An approximated decision-theoretic algorithm for minimization of the Tversky loss under the multi-label framework

Trajdos, Pawel; Kurzynski, Marek

doi:10.1007/s10044-017-0651-6

An approximated decision-theoretic algorithm for minimization of the Tversky loss under the multi-label framework

Theoretical Advances
Open access
Published: 13 September 2017

Volume 22, pages 389–416, (2019)
Cite this article

Download PDF

You have full access to this open access article

Pattern Analysis and Applications Aims and scope Submit manuscript

An approximated decision-theoretic algorithm for minimization of the Tversky loss under the multi-label framework

Download PDF

1335 Accesses
Explore all metrics

Abstract

In this paper, we addressed the problem of building a decision-theoretic classifier tailored for minimizing the Tversky loss under the framework of multi-label classification. The proposed approach is a generalization of the Dembczyński $F_{\beta }$ measure optimization algorithm. The introduced technique is based on a series of discrete linear approximations of the Tversky measure. The approximated criterion is then optimized using original optimization algorithm. To assess quality of classification results produced by the designed strategy and compare its outcome with the results obtained by the state-of-the-art approaches, we conducted an experimental study on 24 benchmark datasets. The investigated methods were compared with respect to eleven different quality criteria. We considered quality criteria belonging to three main groups, i.e., example-based, micro-averaged and macro-averaged. During the experimental study, we considered four testing scenarios. Two of them deal with a symmetric variant of the Tversky loss. Remaining scenarios examine asymmetric Tversky loss. The study shows that, in general, the proposed method is comparable to the Dembczyński approach. However, for both symmetric scenarios and one asymmetric scenario, the average ranks suggest that the proposed approach achieves better classification quality in terms of the example-based Tversky measure. This is an important result because the proposed method was designed to optimize the above-mentioned quality indicator. Additionally, the introduced procedure can outperform the reference methods with respect to the zero-one loss under all testing scenarios.

A Blended Metric for Multi-label Optimisation and Evaluation

Multi-objective Optimisation-Based Feature Selection for Multi-label Classification

Surrogate regret bounds for generalized classification performance metrics

Article Open access 14 October 2016

1 Introduction

Traditional single-label classification is concerned with learning from a set of examples which are assigned to a label (class) from a disjoint set of labels. In other words, only one label is relevant to the given object. Nevertheless, the assumption that labels are disjoint do not always hold. This issue emerges in many real-life recognition tasks: for example, a photograph may be tagged with such labels as lake, forest, sky and mountains. This label set constitutes a complete description of the object, and omitting one of labels in the classifier outcome must be considered as a classification error. Consequently, traditional single-label classification methods cannot directly be employed to solve a problem which violates the assumption that classes are disjoint. Strictly speaking, they are capable of predicting only a single category per object which in case of multi-label data is insufficient. A solution to this problem is to employ the multi-label (ML) classification framework which can be seen as a generalization of the classical recognition task [20, 49]. In the multi-label recognition, it is assumed that an object is simultaneously assigned to more than one class. What is more, the multi-label learning also considers two extraordinary cases: an object belongs to all possible labels or to none of the labels (interpretation of these cases is specific to the domain of the considered task).

Multi-label learning is employed in a variety of practical applications but the most widespread ones are: text classification [29, 30] and multimedia classification including classification of video objects [12], images [4, 57] and music [43]. Another important field of application is bioinformatics where multi-label classification is a powerful tool for prediction of: gene functions [44], protein functions [55, 56] or drug resistance [24], to name only a few. Nowadays, multi-label classification is becoming more and more common among the machine learning society. This growth is mainly caused by an increasing amount of data which can be effectively modelled using the multi-label framework [19]. A great example of this phenomenon is the growth in number of protein and nucleotide sequences stored in EMBL databases [58].

This paper is aimed at providing a flexible, effective and efficient classification procedure that is tailored to optimize the Tversky measure. We are focused on the Tversky measure because it is a far more general quality indicator than commonly used loss functions such as $F_{\beta }$ measure, Jaccard measure, the zero-one loss, false discovery rate or false negative rate. Namely, all aforesaid quality indicators can be expressed in terms of the Tversky loss by setting proper values of its parameters. Consequently, building an effective classifier aimed at optimization of this measure can provide a general tool that can cover many user-specific loss criteria. To achieve our goal, we propose a generalization of the method described by Dembczyński in [8]. The introduced technique approximates the Tversky measure using a set of discrete linear functions. The task described by the linear approximation is solved using the inner–outer minimization approach in a way analogous to the solution proposed in the original approach.

The proposed method was also experimentally compared to the reference methods. The experimental procedure employs 24 benchmarks datasets and 11 quality criteria. We considered quality criteria belonging to three main groups, i.e., example-based, micro-averaged and macro-averaged. During the experimental study, we considered four testing scenarios. Two of them deal with a symmetric variant of the Tversky loss. Remaining scenarios examine asymmetric Tversky loss. The conducted experimental study provides an empirical evidence that the proposed method can outperform reference algorithms in terms of example-based Tversky measure and the zero-one loss.

The paper is organized as follows. Section 2 provides a description of the work related to the topic of this paper. Section 3 introduces a formal description of the proposed method. The next Sect. 4 describes experimental setup. The obtained results are presented and discussed in Sect. 5. The paper is concluded in Sect. 6.

2 Related work

Multi-label classifiers predict a vector response. Due to the complexity of the output structure of multi-label models, it is possible to evaluate the quality of multi-label classification using many different criteria such as Hamming distance, zero-one subset loss [10] or $F_{\beta }$ measure [41]. The criteria also differ on the method of combining well-known single-label quality measures to produce a multi-label quality criterion. Namely, we can distinguish three possible ways: example-based, macro-averaged and micro-averaged [33]. Algorithms are usually designed to optimize a chosen quality measure, and the classifier designed to optimize one quality criterion is usually suboptimal under another quality criterion [10]. During this study, our focus is put on algorithms which are tailored to maximize classification quality expressed in terms of an asymmetric information retrieval measure known as the Tversky measure [53]. The Tversky measure is more general than $F_{\beta }$ measure, so building an effective algorithm dedicated to this function allows us to express and optimize a wider range of quality criteria such as $F_{\beta }$ measure, Jaccard measure [26] or zero-one subset loss [10] (relations between aforesaid measures are discussed in Sect. 3.1). Multi-label approaches aimed at dealing with optimization of the aforesaid quality criteria (including the Tversky measure) can be basically divided into empirical utility maximization methods and decision-theoretic methods [35].

The empirical utility maximization approaches build classifiers which are designed to obtain the optimal value of quality measure defined in the learning set. Learning these models is usually done by determining values of the model parameters that optimize the quality criterion. After that, the model with determined parameters is used to calculate classifier output for a test instance. Algorithms from this group are commonly based on structured SVMs [14, 37], thresholding strategies [38,39,40] or regression [27]. The structured output SVM is a generalization of classical SVM algorithm [48]. The procedure is tailored to deal with the classification problems whose output is more complex than single class. The approach can be adopted to the task of multi-label classification in a straightforward way [14]. As in the classical SVM approach, it is also possible to utilize different kernel functions under the considered approach [60]. The thresholding procedures mainly employ a state-of-the-art multi-label classifier that returns a set of label supports. The outcome of the classifier is then converted to binary prediction using a set of dynamically determined thresholds [39, 40]. The aforementioned approaches were originally harnessed to optimize the $F_{\beta }$ measure, but they can also be employed to optimize the Tversky measure.

On the other hand, the decision-theoretic methods use the learning set to estimate parameters of the underlying probability model. In the inference phase, the values of the probability distributions are calculated according to the estimated model. There were proposed a few methods based on this framework [6, 8, 28, 35]. Chai [6] tackled the posed problem by expressing the expected loss as a recursive function, and then, he solved the arisen optimization task using dynamic programming. Another approach to tackle with the above-mentioned issue was provided by Jansche [28] who proved that the posed problem can be effectively solved via inner and outer optimization. To perform the optimization task, the space of all possible solutions is divided into non-overlapping equivalence classes. Then, the optimal solution is found for each of equivalence classes separately. Finally, the outer optimization is performed in order to determine the globally optimal solution. The author designed a method based on Lebesgue integral and two-tape automaton. Inspired by this methodology, Dembczyński et al. [8] proposed an alternative inner–outer optimization scheme which, in contrast to the formerly mentioned methods, does not make any assumptions about the underlying probability distribution. Unfortunately, the Dembczyński method, contrary to the remaining algorithms, cannot be directly employed to minimize the loss function based on the Tversky measure. An alternative set of equivalence classes was described by Nan et al. [35]. Additionally, the authors presented a heuristic procedure that allows them to reduce the computational burden.

Cheng et al. analysed the Classifier Chain approach [40], which extends the basic binary relevance approach [1], under probabilistic formalism. Their work showed that the original method is a simplified strategy of a more general framework [7] of conditional joint mode estimation. During the inference phase, the simplified approach performs greedy search procedure that follows only a single path in a tree of all possible solutions. The authors proposed to employ an inference algorithm that performs exhaustive search in order to determine the optimal solution. Although the routine enables us to find the optimal solution in terms of any loss function (including $F_{\beta }$ and Tversky), its computational complexity grows exponentially ($2^{L}$, where L is the number of labels) with the number of labels. As a consequence, the computational burden of the approach is extremely high and it can be directly employed only when the number of labels is relatively low. However, the above-mentioned drawback was dealt with in an application of heuristic methods of finding the optimal path in the tree of possible solutions [10, 32].

3 Proposed method

3.1 Preliminaries

In the introductory section, we outlined the basic description of the multi-label classification task. Now, let us define a more formal description of the investigated issue. An object $x$ is now interpreted as a vector $x=\left[ {x}_{1},{x}_{2},\ldots ,{x}_{d} \right]$ that comes from the d-dimensional input space ${\mathcal {X}}$. The set labels related to the object is indicated by a binary vector of length L: $y=\left[ {y}_{1},{y}_{2},\ldots ,{y}_{L} \right]$ and $y_{i}=1$ ($y_{i}=0$) denotes that $i-{\mathrm {th}}$ label is relevant (irrelevant) to the object $x$. As a consequence, the output space is defined as ${\mathcal {Y}}=\{0,1\}^{L}$ which denotes a set of all possible binary vectors of length L. Additionally, it is assumed that object $x$ and its set of labels y are realizations of corresponding random vectors $\mathbf{X =\left[ \mathbf{X }_{1},\mathbf{X }_{2},\ldots ,\mathbf{X }_{d}\right] }$, $\mathbf{Y =\left[ \mathbf{Y }_{1},\mathbf{Y }_{2},\ldots ,\mathbf{Y }_{L}\right] }$ and the joint probability distribution $P(\mathbf X ,\mathbf Y )$ on ${\mathcal {X}}\times {\mathcal {Y}}$ is known.

Relevant labels are assigned to instances by an unknown mapping $f:{\mathcal {X}} \mapsto {\mathcal {Y}}$. A classifier function ${\psi : {\mathcal {X}}\mapsto {\mathcal {Y}}}$ is an approximation of the unknown mapping. Finding the classifier function is usually stated as a problem of optimal decision making given loss function. The loss function ${\mathcal {L}}: {\mathcal {Y}}\times {\mathcal {Y}}\mapsto {\mathcal {R}}_{+}$ assesses similarity between vectors from the output space. Without loss of generality, it is assumed that only normalized loss functions ${\mathcal {L}}:{\mathcal {Y}}\times {\mathcal {Y}}\mapsto \left[ 0,1 \right]$ are considered. In general, the optimal decision making aims to find a classifier $\psi ^{*}$ that minimizes the expected loss over the joint probability distribution $P(\mathbf X ,\mathbf Y )$:

$$\begin{aligned} \psi ^{*} = \arg \!\min _{\psi }{\mathbb {E}}_\mathbf{X {} \mathbf Y }\left[ {\mathcal {L}}(\psi (\mathbf X ),\mathbf {Y}) \right] , \end{aligned}$$

(1)

where ${\mathbb {E}}$ is the expected value operator. The above-mentioned classifier can be found in a pointwise way by the Bayes optimal decisions

$$\begin{aligned} h^{*}(x)= & {} \arg \!\min _{h\in {\mathcal {Y}}}{\mathbb {E}}_\mathbf{Y |\mathbf X =x}\left[ {\mathcal {L}}(h,\mathbf {Y}) \right] \nonumber \\= & {} \arg \!\min _{h\in {\mathcal {Y}}}\sum _{y\in {\mathcal {Y}}}{\mathcal {L}}(h,y)P(y|x) \end{aligned}$$

(2)

where $h^{*}(x)=\left[ {h}_{1}^{*}(x),{h}_{2}^{*}(x),\ldots ,{h}_{L}^{*}(x) \right] \in {\mathcal {Y}}$ denotes an optimal prediction for instance $x$ and $P(y|x)=P(Y=y|X=x)$ is the conditional probability distribution of vector $y$ given an object $x$. It is clear that the optimal solution cannot be efficiently found via exhaustive search because the size of the output space is $|{\mathcal {Y}}|= 2^{L}$.

Although there is a bunch of loss functions that can be adopted under the multi-label classification methodology, in this paper we are focused on the learning algorithms that optimize the Tversky loss $T_{\gamma ,\delta }$, which is defined as follows:above-mentioned loss function

$$\begin{aligned} T_{{\gamma ,\delta }}(h(x),y(x))= & {} 1-\frac{{h(x)}\cdot {y(x)} }{{\gamma \left\|h(x)\right\|_1+\delta \left\| y(x)\right\|_1 }{+\eta\, {h(x)}\cdot {y(x)} }}, \end{aligned}$$

(3)

where $h(x)\in {\mathcal {Y}}$ is the prediction of a classifier, $y(x)\in {\mathcal {Y}}$ indicates the ground truth labels for the instance $x$, $\eta =1-\gamma -\delta$, $\gamma >0$ and $\delta >0$ can be interpreted as weights related to False Positive rate and False Negative rate, respectively. A growth in one of those weights increases the penalty related to the type of error associated with the weight. Additionally $\left\|\cdot \right\|_1$ is the $L_1$ norm, and ${h(x)}\cdot {y(x)}$ is a dot product of a given pair of vectors. For a special case when $\left\|h(x)\right\|_1=\left\|y(x)\right\|_1=0$, it is assumed that $T_{\gamma ,\delta }(h(x),y(x))=0$. The above-mentioned loss function is worth considering because it is more general than the loss functions that are usually applied to build a multi-label classifier. Namely, using this loss function, it is possible to express such loss functions as $F_{\beta }(h(x),y(x))$, the Jaccard loss $J(h(x),y(x))$ or zero-one loss $Z\,(h(x),y(x))$:

$$\begin{aligned} F_{1}(h(x),y(x))= & {} \,T_{\gamma =0.5,\delta =0.5}(h(x),y(x))\\ F_{\beta }(h(x),y(x))= & {} \,T_{\gamma =\frac{1}{1+\beta ^{2}},\delta =\frac{\beta ^{2}}{1+\beta ^{2}}}(h(x),y(x))\\ J(h(x),y(x))= & {} \,T_{\gamma =1,\delta =1}(h(x),y(x))\\ Z(h(x),y(x))= & {}\, T_{\gamma \rightarrow \infty ,\delta \rightarrow \infty }(h(x),y(x)) \end{aligned}$$

It is also possible to express such measures as false discovery rate $\mathrm {FDR}(h(x),y(x))$ and false negative rate $\mathrm {FNR}(h(x),y(x))$:

$$\begin{aligned} \mathrm {FDR}(h(x),y(x))=\, & {} T_{\gamma =1,\delta =0}(h(x),y(x))\\ \mathrm {FNR}(h(x),y(x))=\, & {} T_{\gamma =0,\delta =1}(h(x),y(x))\\ \end{aligned}$$

The differences between aforementioned loss functions become clearer when we express them using binary confusion matrix resulting from comparison of two binary strings $h(x)$ and $y(x)$ (Table 1). Entries of the matrix are defined as follows:

$$\begin{aligned} \mathrm {TP}= & {} \sum _{i=1}^{L}h_{i}(x)y_{i}(x)\end{aligned}$$

(4)

$$\begin{aligned} \mathrm {TN}= & {} \sum _{i=1}^{L}\left[ 1-h_{i}(x)\right] \left[ 1-y_{i}(x)\right] \end{aligned}$$

(5)

$$\begin{aligned} \mathrm {FP}= & \sum _{i=1}^{L}h_{i}(x)\left[ 1-y_{i}(x)\right] \end{aligned}$$

(6)

$$\begin{aligned} \mathrm {FN}= &\sum _{i=1}^{L}\left[ 1-h_{i}(x)\right] y_{i}(x). \end{aligned}$$

(7)

Then, the measures can be rewritten:

$$Z\left( {h\left( x \right),y\left( x \right)} \right) = \left[\kern-0.15em\left[ {FN + FP = 1} \right]\kern-0.15em\right]$$

(8)

$$\begin{aligned} F_{\beta }(h(x),y(x))= \frac{\beta ^2FN + FP}{(1+\beta ^{2})TP + \beta ^{2}FN + FP} \end{aligned}$$

(9)

$$\begin{aligned} J(h(x),y(x))= \frac{FN + FP}{TP + FN + FP}\end{aligned}$$

(10)

$$\begin{aligned} T_{\gamma ,\delta }(h(x),y(x))= \frac{\delta FN + \gamma FP}{TP +\delta FN + \gamma FP}, \end{aligned}$$

(11)

where $\left[\kern-0.15em\left[ \cdot \right]\kern-0.15em\right]$ is the Ivreson bracket.

Table 1 Confusion matrix resulting from comparison of two binary strings $h(x)$ and $y(x)$

An approximated decision-theoretic algorithm for minimization of the Tversky loss under the multi-label framework

Abstract

Similar content being viewed by others

A Blended Metric for Multi-label Optimisation and Evaluation

Multi-objective Optimisation-Based Feature Selection for Multi-label Classification

Surrogate regret bounds for generalized classification performance metrics

1 Introduction

2 Related work

3 Proposed method

3.1 Preliminaries

3.2 Bayes classifier for the \(F_{\beta }\) loss

3.3 Proposed method

Example 1

3.4 Plug-in rule classifier for the \(T_{\gamma ,\delta }\) loss

3.5 System architecture

4 Experimental setup

5 Results and discussion

5.1 The symmetric loss

5.2 The asymmetric loss

6 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Full results

Appendix 2: Derivations

1.1 Appendix 2.1: Linear approximation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation