An empirical analysis of binary transformation strategies and base algorithms for multi-label learning

Rivolli, Adriano; Read, Jesse; Soares, Carlos; Pfahringer, Bernhard; de Carvalho, André C. P. L. F.

doi:10.1007/s10994-020-05879-3

An empirical analysis of binary transformation strategies and base algorithms for multi-label learning

Published: 10 June 2020

Volume 109, pages 1509–1563, (2020)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

An empirical analysis of binary transformation strategies and base algorithms for multi-label learning

Download PDF

Adriano Rivolli ORCID: orcid.org/0000-0001-6445-3007¹,
Jesse Read²,
Carlos Soares³,
Bernhard Pfahringer⁴ &
…
André C. P. L. F. de Carvalho⁵

2399 Accesses
8 Citations
Explore all metrics

Abstract

Investigating strategies that are able to efficiently deal with multi-label classification tasks is a current research topic in machine learning. Many methods have been proposed, making the selection of the most suitable strategy a challenging issue. From this premise, this paper presents an extensive empirical analysis of the binary transformation strategies and base algorithms for multi-label learning. This subset of strategies uses the one-versus-all approach to transform the original data, generating one binary data set per label, upon which any binary base algorithm can be applied. Considering that the influence of the base algorithm on the predictive performance obtained by the strategies has not been considered in depth by many empirical studies, we investigated the influence of distinct base algorithms on the performance of several strategies. Thus, this study covers a family of multi-label strategies using a diversified range of base algorithms, exploring their relationship over different perspectives. This finding has significant implications concerning the methodology of evaluation adopted in multi-label experiments containing binary transformation strategies, given that multiple base algorithms should be considered. Despite these improvements in strategy and base algorithms, for many data sets, a large number of labels, mainly those less frequent, were either never predicted, or always misclassified. We conclude the experimental analysis by recommending strategies and base algorithms in accordance with different performance criteria.

LiBRe: Label-Wise Selection of Base Learners in Binary Relevance for Multi-label Classification

A Study on Multi-label Classification

Learning Interpretable Rules for Multi-Label Classification

1 Introduction

Multi-label learning has been investigated widely by the machine learning community in recent years (de Carvalho and Freitas 2009; Tsoumakas et al. 2010; Gibaja and Ventura 2014). It deals with classification tasks where an instance can be simultaneously classified into more than one class. Each class is represented by one label. Several domains, such as text (Klimt and Yang 2004; Pestian et al. 2007), multimedia (Duygulu et al. 2002; Zhou and Zhang 2006; Briggs et al. 2013) and biology (Elisseeff and Weston 2001), are intrinsically multi-label.

A common approach to dealing with multi-label classification tasks is to transform the original data set into one or more single-label data sets. A conventional binary classification algorithm, called base algorithm here, is used to induce predictive models for each one of them. As such, a transformation strategy defines how to decompose the original task into a set of single-label tasks and to combine the results obtained from these tasks to solve the original task (Tsoumakas et al. 2010). Many strategies have been proposed to address the multi-label tasks and transform the data, exploring different aspects, such as label correlation (Read et al. 2011; Cherman et al. 2012; Montañes et al. 2014), dimensionality reduction (Tsoumakas et al. 2008; Zhang and Wu 2015) and class imbalance (Zhang and Wu 2015; Tsoumakas et al. 2011b).

Although the base algorithm can be seen as a hyperparameter for transformation strategies, it is generally fixed for all strategies, so that only a single base algorithm is considered in the whole experiment (Read et al. 2011; Montañes et al. 2014; Madjarov et al. 2012). Given that a comprehensive comparison of the binary transformation strategies, using different base algorithms, has not yet been performed, this study assesses the hypothesis that the base algorithms can have a stronger influence than the binary transformation strategies on the predictive performance of multi-label models. At a glance, it may seem trivial to be investigated, however, if the choice of a base algorithm is more important regarding the quality of the results than the specific strategy, then several of them should be considered in empirical studies evaluating these strategies.

In the multi-label literature, the most similar comparative study was performed by Madjarov et al. (2012), where 12 strategies (including 3 binary transformation strategies) were evaluated under several measures, using the original train and test partition of 11 benchmark data sets. Even though a variety of different algorithms were considered, the transformation strategies were evaluated with a single base algorithm, Support Vector Machine (SVM). Another large empirical study covering multiple ensemble strategies (Moyano et al. 2018) used only the C4.5 decision tree as the base algorithm. Nevertheless, a few studies have considered using more than one base algorithm. These studies include Tsoumakas and Katakis (2007) and Cherman et al. (2012), who did not compare strategies using different base algorithms; and Zufferey et al. (2015), who compared strategies with distinct base algorithms, but just in a single data set.

Methods using Automatic Machine Learning (Auto-ML) to address multi-label classification tasks also consider multiple base algorithms (de Sá et al. 2017, 2018; Wever et al. 2018, 2019). During the search for a solution, the Auto-ML method may find a suitable combination between strategies and base algorithms that optimizes a fitness function. In these cases, choosing the base algorithm is seen as part of the solution and the comparison of the strategies does not fix a base algorithm, as observed in other studies.

Since the most common strategies are based on binary transformations, this paper will focus on these strategies. Hence, 10 binary transformation strategies and 5 different base algorithms (plus one with its hyperparameters tuned) were evaluated using $5\times 2$-fold cross-validation for 20 benchmark data sets. In contrast to previous studies, which used null hypothesis significance testing, we ran Bayesian statistic tests (Benavoli et al. 2017) to assess the statistical significance of the differences in the predictive performance of the assessed strategies over different evaluation measures. To the best of our knowledge, this is the most extensive multi-label empirical study carried out so far.

The results reported reinforce the claim that the predictive performance obtained by transformation strategies is affected by the base algorithm used. Thus, experimental studies in multi-label learning must take into account experiments with several different base algorithms. In particular, many of the binary transformation strategies obtained very similar results, with differences mainly being due to the choice of the base algorithm used. Therefore, previous comparative studies (Madjarov et al. 2012; Moyano et al. 2018) might have reached different conclusions if other base algorithms had been employed. Additionally, for many data sets, the investigated strategies consistently predicted only a subset of the existing labels, never assigning the remaining labels to any instance. This problem was previously observed in the food truck data set (Rivolli et al. 2018), however, as far as we know, it has never been widely investigated.

The rest of the paper is organized as follows: Sect. 2 formally defines the main concepts relevant for multi-label learning. Section 3 details the investigated strategies. Section 4 describes the experimental design, including data sets, evaluation procedures, base classifiers, tools, and hyperparameter values adopted. Section 5 presents, analyzes and discusses the empirical results. In the last section, conclusions are drawn concerning relevant findings from the experimental study and future work directions.

2 Multi-label learning

In multi-label learning, an instance can be simultaneously associated with more than one label. The main tasks in this field are Multi-Label Classification and Label Ranking.

Multi-Label Classification (MLC), the most common task (Tsoumakas et al. 2010), induces a predictive model $h: {\mathcal {X}} \rightarrow {\mathcal {Y}}$ from a set of training data ${\mathcal {D}}$, which later assigns labels to new examples. This task can be formally defined as follows. Let ${\mathcal {D}}$ be a set of labeled instances, such that ${\mathcal {D}} = \left\{ (x_1, Y_1), ..., (x_n, Y_n)\right\}$. Every labeled instance is composed of $x_i = (x_{i1}, x_{i2}, ..., x_{id}) \in {\mathbb {R}}^d$, and $Y_i \subseteq {\mathcal {L}}$, such that ${\mathcal {L}} = \left\{ \lambda _1, \lambda _2, ..., \lambda _q\right\}$ is the set of all q labels $\lambda _i$. For the sake of convenience, the labels associated with the $i^{th}$ instance, also called label set, can be seen as a binary vector $y_i = (y_{i1}, y_{i2}, \ldots , y_{iq}) \in \{0,1\}^q$, where $y_{ij} = 1 \,\textit{iff}\, \lambda _j \in Y_i$ and $y_{ij} = 0 \,\textit{iff}\, \lambda _j \not \in Y_i$. Finally, model h is used to predict, for a test instance $(x_i, ?)$, the set of relevant labels $\hat{Y}_i$ (or $\hat{y}_i$ as a binarized prediction).

In the Label Ranking (LRK) task, a model outputs the ranked labels for each test instance. This ranking can easily be computed using any model that provides a score value indicating its probability of being relevant to a given instance. Thus, the higher the score value, the better its ranking position. In turn, MLC can be derived from the LRK formulation (Gibaja and Ventura 2015).

A multi-label model can be obtained by using two approaches (Tsoumakas and Katakis 2007), problem transformation and algorithm adaptation. The former converts the original multi-labeled data into a set of binary or multi-class data sets, whereas for the latter, the multi-label support is embedded into the algorithm’s structure. Thus, the transformation approach fits the data to the algorithms, and the adaptation approach fits the algorithms to the data (Zhang and Zhou 2014).

A straightforward transformation is to build a binary classifier for each label individually. This is known as the Binary approach. On the other hand, a multi-class transformation can be considered, in which each label set (combination of labels) is mapped to one class. Both approaches are algorithm independent (de Carvalho and Freitas 2009), in the sense that any traditional classification algorithm that is capable of handling such problems can be used as the base algorithm.

We want to emphasize that the binary transformation approach implies that algorithms are trained separately, but not necessarily independently; this will become apparent in the following section. In addition, many hybrid approaches exist, such as Pairwise, which models pairwise combinations (a one-vs-one approach), and subset approaches, which includes the well-known RAkEL strategy (Tsoumakas et al. 2011a).

Binary transformation generates at least one data set per label. Each binary data set ${\mathcal {D}}_j'$ is related to the label $\lambda _j$. The instances associated with $\lambda _j$ are labeled with a class value of “1”, all others are labelled with a class value of “0”.The number of binary data sets generated is defined by $|{\mathcal {D}}'| = mq$, where m is the number of data sets per label. Therefore, the complexity of this family of strategies is linear in the number of labels q. Negative aspects of this approach include the tendency to generate rather imbalanced data sets and the fact that some of these strategies ignore the relationships between labels (Zhou et al. 2012).

The binary transformation strategies are organized into three groups, one-round, stacking, and ensemble, according to the value of m. One-round strategies are the simplest strategies, with $m = 1$. A special case of one-round is chaining, which increases the input space by adding already predicted labels as features to predict the others, in a chain. In stacking strategies, two rounds of training and prediction steps are performed, thus $m = 2$. They augment the input space in the second round by using the values of the labels predicted in the first round as features. When all the labels are used, they are called full-stacking. When only a subset of the labels is used, they are called pruned-stacking. Finally, in the ensemble strategies more than two models for each label ($m>2$) are used and usually, the value of m is a hyperparameter defined by the user. When the same instances and attributes are shared by all internal models, the ensemble is homogeneous. However, when each member and label use distinct data sets as training data, the ensemble is heterogeneous. The former can be seen as an ensemble of multi-labeled data, whereas the latter as multiple ensembles of single-label data (Gibaja and Ventura 2015). These groups and their strategies are detailed in Sect. 3.

A base classification algorithm must always be chosen to induce predictive models for each transformed data set ${\mathcal {D}}'$. Later, these models are used to predict the relevance of each label for new instances. If the models predict a score instead of a class, the strategies support both tasks, MLC and LRK (Gibaja and Ventura 2015). Logically, if the base algorithms are responsible for predicting a score and the binary transformation strategies are independent from them, any transformation strategy can be used to solve them. Distinctions among them will not be considered in the rest of this paper.

As previously mentioned, this study is restricted to analyzing strategies based on binary transformation, which are relevant for a broad group of researchers and practitioners. Besides, for most of them, their individual models can be trained separately (thus, allowing for parallelism), they are simple to interpret, they have been successfully used in many state-of-the-art comparisons in the literature, and they usually exhibit acceptable time complexity, almost linear with the number of labels. Using separate classifiers, each focused on only one label, allows for higher flexibility, choosing potentially different approaches on a per-label basis. Furthermore, new labels can usually be added to the problem without retraining the models built for existing labels. In general, as some of the strategies are conceptually quite similar to each other, their practical differences may be highlighted by comparing their performances using different base algorithms, an approach we put forward in this paper.

3 Strategies

In this section, the 10 binary transformation strategies considered are described. Table 1 presents the strategies organized into groups, defined by the number of binary models generated per label, and the subgroups according to their main characteristic.

Table 1 Binary transformation strategies organized into groups/subgroups according to the number of binary models per label and their main characteristic

An empirical analysis of binary transformation strategies and base algorithms for multi-label learning

Abstract

Similar content being viewed by others

LiBRe: Label-Wise Selection of Base Learners in Binary Relevance for Multi-label Classification

A Study on Multi-label Classification

Learning Interpretable Rules for Multi-Label Classification

1 Introduction

2 Multi-label learning

3 Strategies

3.1 One-round

3.1.1 Chaining

3.2 Stacking

3.2.1 Full stacking

3.2.2 Pruned stacking

3.3 Ensemble

4 Experimental design

4.1 Data sets

4.2 Evaluation measures

4.2.1 Example-based measures

4.2.2 Label-based measures

4.2.3 Ranking measures

4.3 Multi-label baselines

4.4 Base algorithms

4.5 Experimental setup

5 Experimental results

5.1 Comparison with the baselines

5.2 Similarity of strategies

5.3 Analysis of strategies

5.4 Analysis of base algorithms

5.5 Combining strategies and base algorithms

5.6 Label problems

5.7 Summary

6 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Appendices

Appendix 1: Best strategies/base algorithms

Appendix 2: Statistical results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation