1 Introduction

Traditional single-label classification is concerned with learning from a set of examples which are assigned to a label (class) from a disjoint set of labels. In other words, only one label is relevant to the given object. Nevertheless, the assumption that labels are disjoint do not always hold. This issue emerges in many real-life recognition tasks: for example, a photograph may be tagged with such labels as lake, forest, sky and mountains. This label set constitutes a complete description of the object, and omitting one of labels in the classifier outcome must be considered as a classification error. Consequently, traditional single-label classification methods cannot directly be employed to solve a problem which violates the assumption that classes are disjoint. Strictly speaking, they are capable of predicting only a single category per object which in case of multi-label data is insufficient. A solution to this problem is to employ the multi-label (ML) classification framework which can be seen as a generalization of the classical recognition task [20, 49]. In the multi-label recognition, it is assumed that an object is simultaneously assigned to more than one class. What is more, the multi-label learning also considers two extraordinary cases: an object belongs to all possible labels or to none of the labels (interpretation of these cases is specific to the domain of the considered task).

Multi-label learning is employed in a variety of practical applications but the most widespread ones are: text classification [29, 30] and multimedia classification including classification of video objects [12], images [4, 57] and music [43]. Another important field of application is bioinformatics where multi-label classification is a powerful tool for prediction of: gene functions [44], protein functions [55, 56] or drug resistance [24], to name only a few. Nowadays, multi-label classification is becoming more and more common among the machine learning society. This growth is mainly caused by an increasing amount of data which can be effectively modelled using the multi-label framework [19]. A great example of this phenomenon is the growth in number of protein and nucleotide sequences stored in EMBL databases [58].

This paper is aimed at providing a flexible, effective and efficient classification procedure that is tailored to optimize the Tversky measure. We are focused on the Tversky measure because it is a far more general quality indicator than commonly used loss functions such as \(F_{\beta }\) measure, Jaccard measure, the zero-one loss, false discovery rate or false negative rate. Namely, all aforesaid quality indicators can be expressed in terms of the Tversky loss by setting proper values of its parameters. Consequently, building an effective classifier aimed at optimization of this measure can provide a general tool that can cover many user-specific loss criteria. To achieve our goal, we propose a generalization of the method described by Dembczyński in [8]. The introduced technique approximates the Tversky measure using a set of discrete linear functions. The task described by the linear approximation is solved using the inner–outer minimization approach in a way analogous to the solution proposed in the original approach.

The proposed method was also experimentally compared to the reference methods. The experimental procedure employs 24 benchmarks datasets and 11 quality criteria. We considered quality criteria belonging to three main groups, i.e., example-based, micro-averaged and macro-averaged. During the experimental study, we considered four testing scenarios. Two of them deal with a symmetric variant of the Tversky loss. Remaining scenarios examine asymmetric Tversky loss. The conducted experimental study provides an empirical evidence that the proposed method can outperform reference algorithms in terms of example-based Tversky measure and the zero-one loss.

The paper is organized as follows. Section 2 provides a description of the work related to the topic of this paper. Section 3 introduces a formal description of the proposed method. The next Sect. 4 describes experimental setup. The obtained results are presented and discussed in Sect. 5. The paper is concluded in Sect. 6.

2 Related work

Multi-label classifiers predict a vector response. Due to the complexity of the output structure of multi-label models, it is possible to evaluate the quality of multi-label classification using many different criteria such as Hamming distance, zero-one subset loss [10] or \(F_{\beta }\) measure [41]. The criteria also differ on the method of combining well-known single-label quality measures to produce a multi-label quality criterion. Namely, we can distinguish three possible ways: example-based, macro-averaged and micro-averaged [33]. Algorithms are usually designed to optimize a chosen quality measure, and the classifier designed to optimize one quality criterion is usually suboptimal under another quality criterion [10]. During this study, our focus is put on algorithms which are tailored to maximize classification quality expressed in terms of an asymmetric information retrieval measure known as the Tversky measure [53]. The Tversky measure is more general than \(F_{\beta }\) measure, so building an effective algorithm dedicated to this function allows us to express and optimize a wider range of quality criteria such as \(F_{\beta }\) measure, Jaccard measure [26] or zero-one subset loss [10] (relations between aforesaid measures are discussed in Sect. 3.1). Multi-label approaches aimed at dealing with optimization of the aforesaid quality criteria (including the Tversky measure) can be basically divided into empirical utility maximization methods and decision-theoretic methods [35].

The empirical utility maximization approaches build classifiers which are designed to obtain the optimal value of quality measure defined in the learning set. Learning these models is usually done by determining values of the model parameters that optimize the quality criterion. After that, the model with determined parameters is used to calculate classifier output for a test instance. Algorithms from this group are commonly based on structured SVMs [14, 37], thresholding strategies [38,39,40] or regression [27]. The structured output SVM is a generalization of classical SVM algorithm [48]. The procedure is tailored to deal with the classification problems whose output is more complex than single class. The approach can be adopted to the task of multi-label classification in a straightforward way [14]. As in the classical SVM approach, it is also possible to utilize different kernel functions under the considered approach [60]. The thresholding procedures mainly employ a state-of-the-art multi-label classifier that returns a set of label supports. The outcome of the classifier is then converted to binary prediction using a set of dynamically determined thresholds [39, 40]. The aforementioned approaches were originally harnessed to optimize the \(F_{\beta }\) measure, but they can also be employed to optimize the Tversky measure.

On the other hand, the decision-theoretic methods use the learning set to estimate parameters of the underlying probability model. In the inference phase, the values of the probability distributions are calculated according to the estimated model. There were proposed a few methods based on this framework [6, 8, 28, 35]. Chai [6] tackled the posed problem by expressing the expected loss as a recursive function, and then, he solved the arisen optimization task using dynamic programming. Another approach to tackle with the above-mentioned issue was provided by Jansche [28] who proved that the posed problem can be effectively solved via inner and outer optimization. To perform the optimization task, the space of all possible solutions is divided into non-overlapping equivalence classes. Then, the optimal solution is found for each of equivalence classes separately. Finally, the outer optimization is performed in order to determine the globally optimal solution. The author designed a method based on Lebesgue integral and two-tape automaton. Inspired by this methodology, Dembczyński et al. [8] proposed an alternative inner–outer optimization scheme which, in contrast to the formerly mentioned methods, does not make any assumptions about the underlying probability distribution. Unfortunately, the Dembczyński method, contrary to the remaining algorithms, cannot be directly employed to minimize the loss function based on the Tversky measure. An alternative set of equivalence classes was described by Nan et al. [35]. Additionally, the authors presented a heuristic procedure that allows them to reduce the computational burden.

Cheng et al. analysed the Classifier Chain approach [40], which extends the basic binary relevance approach [1], under probabilistic formalism. Their work showed that the original method is a simplified strategy of a more general framework [7] of conditional joint mode estimation. During the inference phase, the simplified approach performs greedy search procedure that follows only a single path in a tree of all possible solutions. The authors proposed to employ an inference algorithm that performs exhaustive search in order to determine the optimal solution. Although the routine enables us to find the optimal solution in terms of any loss function (including \(F_{\beta }\) and Tversky), its computational complexity grows exponentially (\(2^{L}\), where L is the number of labels) with the number of labels. As a consequence, the computational burden of the approach is extremely high and it can be directly employed only when the number of labels is relatively low. However, the above-mentioned drawback was dealt with in an application of heuristic methods of finding the optimal path in the tree of possible solutions [10, 32].

3 Proposed method

3.1 Preliminaries

In the introductory section, we outlined the basic description of the multi-label classification task. Now, let us define a more formal description of the investigated issue. An object \(x\) is now interpreted as a vector \(x=\left[ {x}_{1},{x}_{2},\ldots ,{x}_{d} \right]\) that comes from the d-dimensional input space \({\mathcal {X}}\). The set labels related to the object is indicated by a binary vector of length L: \(y=\left[ {y}_{1},{y}_{2},\ldots ,{y}_{L} \right]\) and \(y_{i}=1\) (\(y_{i}=0\)) denotes that \(i-{\mathrm {th}}\) label is relevant (irrelevant) to the object \(x\). As a consequence, the output space is defined as \({\mathcal {Y}}=\{0,1\}^{L}\) which denotes a set of all possible binary vectors of length L. Additionally, it is assumed that object \(x\) and its set of labels y are realizations of corresponding random vectors \(\mathbf{X =\left[ \mathbf{X }_{1},\mathbf{X }_{2},\ldots ,\mathbf{X }_{d}\right] }\), \(\mathbf{Y =\left[ \mathbf{Y }_{1},\mathbf{Y }_{2},\ldots ,\mathbf{Y }_{L}\right] }\) and the joint probability distribution \(P(\mathbf X ,\mathbf Y )\) on \({\mathcal {X}}\times {\mathcal {Y}}\) is known.

Relevant labels are assigned to instances by an unknown mapping \(f:{\mathcal {X}} \mapsto {\mathcal {Y}}\). A classifier function \({\psi : {\mathcal {X}}\mapsto {\mathcal {Y}}}\) is an approximation of the unknown mapping. Finding the classifier function is usually stated as a problem of optimal decision making given loss function. The loss function \({\mathcal {L}}: {\mathcal {Y}}\times {\mathcal {Y}}\mapsto {\mathcal {R}}_{+}\) assesses similarity between vectors from the output space. Without loss of generality, it is assumed that only normalized loss functions \({\mathcal {L}}:{\mathcal {Y}}\times {\mathcal {Y}}\mapsto \left[ 0,1 \right]\) are considered. In general, the optimal decision making aims to find a classifier \(\psi ^{*}\) that minimizes the expected loss over the joint probability distribution \(P(\mathbf X ,\mathbf Y )\):

$$\begin{aligned} \psi ^{*} = \arg \!\min _{\psi }{\mathbb {E}}_\mathbf{X {} \mathbf Y }\left[ {\mathcal {L}}(\psi (\mathbf X ),\mathbf {Y}) \right] , \end{aligned}$$
(1)

where \({\mathbb {E}}\) is the expected value operator. The above-mentioned classifier can be found in a pointwise way by the Bayes optimal decisions

$$\begin{aligned} h^{*}(x)= & {} \arg \!\min _{h\in {\mathcal {Y}}}{\mathbb {E}}_\mathbf{Y |\mathbf X =x}\left[ {\mathcal {L}}(h,\mathbf {Y}) \right] \nonumber \\= & {} \arg \!\min _{h\in {\mathcal {Y}}}\sum _{y\in {\mathcal {Y}}}{\mathcal {L}}(h,y)P(y|x) \end{aligned}$$
(2)

where \(h^{*}(x)=\left[ {h}_{1}^{*}(x),{h}_{2}^{*}(x),\ldots ,{h}_{L}^{*}(x) \right] \in {\mathcal {Y}}\) denotes an optimal prediction for instance \(x\) and \(P(y|x)=P(Y=y|X=x)\) is the conditional probability distribution of vector \(y\) given an object \(x\). It is clear that the optimal solution cannot be efficiently found via exhaustive search because the size of the output space is \(|{\mathcal {Y}}|= 2^{L}\).

Although there is a bunch of loss functions that can be adopted under the multi-label classification methodology, in this paper we are focused on the learning algorithms that optimize the Tversky loss \(T_{\gamma ,\delta }\), which is defined as follows:above-mentioned loss function

$$\begin{aligned} T_{{\gamma ,\delta }}(h(x),y(x))= & {} 1-\frac{{h(x)}\cdot {y(x)} }{{\gamma \left\|h(x)\right\|_1+\delta \left\| y(x)\right\|_1 }{+\eta\, {h(x)}\cdot {y(x)} }}, \end{aligned}$$
(3)

where \(h(x)\in {\mathcal {Y}}\) is the prediction of a classifier, \(y(x)\in {\mathcal {Y}}\) indicates the ground truth labels for the instance \(x\), \(\eta =1-\gamma -\delta\), \(\gamma >0\) and \(\delta >0\) can be interpreted as weights related to False Positive rate and False Negative rate, respectively. A growth in one of those weights increases the penalty related to the type of error associated with the weight. Additionally \(\left\|\cdot \right\|_1\) is the \(L_1\) norm, and \({h(x)}\cdot {y(x)}\) is a dot product of a given pair of vectors. For a special case when \(\left\|h(x)\right\|_1=\left\|y(x)\right\|_1=0\), it is assumed that \(T_{\gamma ,\delta }(h(x),y(x))=0\). The above-mentioned loss function is worth considering because it is more general than the loss functions that are usually applied to build a multi-label classifier. Namely, using this loss function, it is possible to express such loss functions as \(F_{\beta }(h(x),y(x))\), the Jaccard loss \(J(h(x),y(x))\) or zero-one loss \(Z\,(h(x),y(x))\):

$$\begin{aligned} F_{1}(h(x),y(x))= & {} \,T_{\gamma =0.5,\delta =0.5}(h(x),y(x))\\ F_{\beta }(h(x),y(x))= & {} \,T_{\gamma =\frac{1}{1+\beta ^{2}},\delta =\frac{\beta ^{2}}{1+\beta ^{2}}}(h(x),y(x))\\ J(h(x),y(x))= & {} \,T_{\gamma =1,\delta =1}(h(x),y(x))\\ Z(h(x),y(x))= & {}\, T_{\gamma \rightarrow \infty ,\delta \rightarrow \infty }(h(x),y(x)) \end{aligned}$$

It is also possible to express such measures as false discovery rate \(\mathrm {FDR}(h(x),y(x))\) and false negative rate \(\mathrm {FNR}(h(x),y(x))\):

$$\begin{aligned} \mathrm {FDR}(h(x),y(x))=\, & {} T_{\gamma =1,\delta =0}(h(x),y(x))\\ \mathrm {FNR}(h(x),y(x))=\, & {} T_{\gamma =0,\delta =1}(h(x),y(x))\\ \end{aligned}$$

The differences between aforementioned loss functions become clearer when we express them using binary confusion matrix resulting from comparison of two binary strings \(h(x)\) and \(y(x)\) (Table 1). Entries of the matrix are defined as follows:

$$\begin{aligned} \mathrm {TP}= & {} \sum _{i=1}^{L}h_{i}(x)y_{i}(x)\end{aligned}$$
(4)
$$\begin{aligned} \mathrm {TN}= & {} \sum _{i=1}^{L}\left[ 1-h_{i}(x)\right] \left[ 1-y_{i}(x)\right] \end{aligned}$$
(5)
$$\begin{aligned} \mathrm {FP}= & \sum _{i=1}^{L}h_{i}(x)\left[ 1-y_{i}(x)\right] \end{aligned}$$
(6)
$$\begin{aligned} \mathrm {FN}= &\sum _{i=1}^{L}\left[ 1-h_{i}(x)\right] y_{i}(x). \end{aligned}$$
(7)

Then, the measures can be rewritten:

$$Z\left( {h\left( x \right),y\left( x \right)} \right) = \left[\kern-0.15em\left[ {FN + FP = 1} \right]\kern-0.15em\right]$$
(8)
$$\begin{aligned} F_{\beta }(h(x),y(x))= \frac{\beta ^2FN + FP}{(1+\beta ^{2})TP + \beta ^{2}FN + FP} \end{aligned}$$
(9)
$$\begin{aligned} J(h(x),y(x))= \frac{FN + FP}{TP + FN + FP}\end{aligned}$$
(10)
$$\begin{aligned} T_{\gamma ,\delta }(h(x),y(x))= \frac{\delta FN + \gamma FP}{TP +\delta FN + \gamma FP}, \end{aligned}$$
(11)

where \(\left[\kern-0.15em\left[ \cdot \right]\kern-0.15em\right]\) is the Ivreson bracket.

Table 1 Confusion matrix resulting from comparison of two binary strings \(h(x)\) and \(y(x)\)

3.2 Bayes classifier for the \(F_{\beta }\) loss

In this section, we describe the original method proposed by Dembczyński [8]. The method is based upon the inner–outer framework which was proven to be an efficient way to find the optimal classifier tailored for the \(F_{\beta }\) loss [28]:

$$\begin{aligned} F_{\beta }(h(x),y(x))= & {} 1 - \dfrac{(1+\beta ^2){h(x)}\cdot {y(x)} }{\beta ^2\left\|y(x)\right\|_1 + \left\| h(x)\right\|_1 }. \end{aligned}$$
(12)

In the case of the \(F_{\beta }\) loss, the Bayes classifier (2) is given by

$$\begin{aligned} h^{*}(x)= & {} \arg \!\min _{h\in {\mathcal {Y}}}\sum _{y\in {\mathcal {Y}}}\left( 1 - \dfrac{(1+\beta ^2){h}\cdot {y}}{\beta ^2 \left\|y\right\|_1 + \left\|h\right\|_1}\right) P(y|x) \end{aligned}$$
(13)

Next, the posed problem (13) is solved via inner and outer maximizations. In order to perform the inner maximization, the space of all possible solutions \({\mathcal {Y}}\) is partitioned into \(L+1\) non-overlapping equivalence classes. Each equivalence class contains binary vectors \(h\in {\mathcal {Y}}\) with the number of ones equal to \(\mathrm {K}=\left||h\right||_1\). As a consequence, we denote the equivalence class as a set \({\mathcal {H}}_{\mathrm {K}} = \left\{ h:h\in {\mathcal {Y}},\left||h\right||_1=\mathrm {K}\right\}\) and \(\mathrm {K}\in \left\{ 0, 1,\dots , L \right\}\). Analogously, the partitioning into the equivalence can also be employed to vector \(y\) from the Eq. (13). The sum of ones in this vector is denoted by \(\mathrm {S}=\left||y\right||_1\). Then, the optimization problem can be solved for each of equivalence classes separately. Before we define inner optimization problem, let us introduce a subset of \({\mathcal {Y}}\) in which the number of ones in vectors sums up to s:

$$\begin{aligned} {\mathcal {S}}_{s} = \left\{ y\in {\mathcal {Y}}, y\,:\,\left\|y\right\|_1=s \right\} . \end{aligned}$$
(14)

Then, for each equivalence class \({\mathcal {H}}_{\mathrm {K}}\) an optimal prediction is described by:

$$\begin{aligned} h^{*}(x)_{\mathrm {K}}= & {} \arg \!\min _{h\in {\mathcal {H}}_{\mathrm {K}}} \sum _{ {\mathrm {S}=0} }^{L}\sum _{ { y\in {\mathcal {S}}_{\mathrm {S}}} }\,\left( P(y|x) -\dfrac{(1+\beta ^{2})P(y|x)}{\beta ^2\mathrm {S}+ \mathrm {K}}\sum _{i=1}^{L}h_{i}y_{i}\right) \end{aligned}$$
(15)
$$\begin{aligned}= & {} \arg \!\max _{h\in {\mathcal {H}}_{\mathrm {K}}} \sum _{\mathrm {S}=1}^{L}\dfrac{(1+\beta ^{2})}{\beta ^2\mathrm {S}+ \mathrm {K}}\sum _{ y\in {\mathcal {S}}_{\mathrm {S}} }P(y|x)\sum _{i=1}^{L}h_{i}y_{i} \end{aligned}$$
(16)

After that, by swapping sums in (16) and skipping term \((1+\beta ^{2})\), the authors obtain

$$\begin{aligned} h^{*}(x)_{\mathrm {K}}= & {} \arg \!\max _{ h\in {\mathcal {H}}_{\mathrm {K}} }\sum _{i=1}^{L}h_{i}\sum _{\mathrm {S}=1}^{L}\quad\dfrac{P(y_{i}=1,\left\|y\right\|_1=\mathrm {S}|x)}{\beta ^2\mathrm {S}+ \mathrm {K}}\nonumber \\= & {} \arg \!\max _{ h\in {\mathcal {H}}_{\mathrm {K}} }\sum _{i=1}^{L}h_{i}\sum _{\mathrm {S}=1}^{L}\dfrac{P(y_{i}=1|x)P(\left\|y\right\|_1=\mathrm {S}|y_{i}=1,x)}{\beta ^2\mathrm {S}+ \mathrm {K}}\nonumber \\= & {} \arg \!\max _{h\in {\mathcal {H}}_{\mathrm {K}}}\sum _{i=1}^{L}h_{i}\Delta (i,\mathrm {K}). \end{aligned}$$
(17)

where

$$\begin{aligned} P(y_{i}=1|x)=\sum _{y\in {\mathcal {Y}}:y_{i}=1}P(y|x) \end{aligned}$$
(18)

is the probability that \(i\mathrm {-th}\) bit is set given \(x\) and

$$\begin{aligned} P(\left\|y\right\|_1=\mathrm {S}|y_{i}=1,x) = \sum _{y\in {\mathcal {S}}_{\mathrm {S}}}P(y|x,y_i=1) \end{aligned}$$
(19)

is the probability that the number of ones in vector \(y\) is \(\mathrm {S}\) given \(y_{i}=1\) and \(x\). Since we must set \(h_{i}=1\) for only \(\mathrm {K}\) positions, the optimization problem can be solved optimally by setting \(h_{i}=1\) for the top \(\mathrm {K}\) values of \(\Delta (i,\mathrm {K})\).

To perform the outer optimization, it is also necessary to calculate the expected loss related to the previously determined solution \({\mathbb {E}}_{Y|X=x}\left[ F_{\beta }(h^{*}(x)_{\mathrm {K}},Y)\right]\):

$$\begin{aligned} {\mathbb {E}}_{Y|X=x}\left[ F_{\beta }(h^{*}(x)_{\mathrm {K}},Y)\right]= & {} 1 - (1+\beta ^{2})\sum _{i=1}^{L}h^{*}_{i}(x)_{\mathrm {K}}\Delta (i,\mathrm {K}). \end{aligned}$$
(20)

Additionally, the expected value associated with inner optimization task for \(\mathrm {K}=0\) is:

$$\begin{aligned} {\mathbb {E}}_{Y|X=x}\left[ F_{\beta }(h^{*}(x)_{0},Y)\right]= & {} \sum _{ \begin{array}{c} y\in {\mathcal {Y}}\\ y:\left\|y\right\|_1 = 0 \end{array} }P(y|x)\nonumber \\= & {}\, P(\left\|y\right\|_1=0 | x) \end{aligned}$$
(21)

Finally, the outer minimization finds the best prediction among the predictions produced for each equivalence class \({\mathcal {H}}_{\mathrm {K}}\):

$$\begin{aligned} {h}^{*}({x}) = \quad {\arg \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\min _{{ {h}\in \left\{ h^{*}({x})_{0}, h^{*}({x})_{1},\dots ,h^{*}({x})_{L} \right\} } }\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! {\mathbb {E}}_{Y|X= {x}} \, \left[ F_{\beta }({h},Y)\right] } \end{aligned}$$
(22)

and its solution is found by checking all possible \(L+1\) (\(\mathrm {K}\in \{0,1,2,\ldots ,L \}\)) expected values.

3.3 Proposed method

In this subsection, an approximated method of building decision-theoretic classifier for the Tversky measure is introduced. We begin with considering the Bayes classifier for the \(T_{\gamma ,\delta }\):

$$\begin{aligned} \bar{h}^{*}(x)= & {} \arg \!\min _{ h\in {\mathcal {Y}} }\sum _{y\in {\mathcal {Y}}}\left( 1 - \dfrac{{h}\cdot { y} }{{\delta \left\|y\right\|_1 + \gamma \left\|h\right\|_1}{ + \eta {h}\cdot {y}}}\right) P(y|x) \end{aligned}$$
(23)

After transformation analogous to (15), we obtain:

$$\begin{aligned} \bar{h}^{*}(x)_{\mathrm {K}}= & {} \arg \!\min _{ h\in {\mathcal {H}}_{\mathrm {K}} } \sum _{\mathrm {S}=0}^{L}\sum _{y\in {\mathcal {S}}_{\mathrm {S}}}\left( P(y|x) -\dfrac{P(y|x)\sum _{i=1}^{L}h_{i}y_{i}}{\delta \mathrm {S}+ \gamma \mathrm {K}+ \eta {h}\cdot {y} } \right) \nonumber \\= & {} \arg \!\max _{ h\in {\mathcal {H}}_{\mathrm {K}} }\sum _{\mathrm {S}=1}^{L}\sum _{y\in {\mathcal {S}}_{\mathrm {S}}}\dfrac{P(y|x)\sum _{i=1}^{L}h_{i}y_{i}}{\delta \mathrm {S}+ \gamma \mathrm {K}+ \eta {h}\cdot {y}}. \end{aligned}$$
(24)

Although a transformation analogous to (17) cannot eliminate term \(\eta {h}\cdot {y}\) from the denominator of (24), it is impossible to perform directly the optimization procedure described in the previous subsection. In order to apply the aforementioned procedure, one can simply remove the term but this simplification can lead to a coarse approximation. In this paper, we propose more accurate approximation which does not induce a significant increase in the computational complexity.

We start with a simple observation that for given \(\mathrm {K}\), \(\mathrm {S}\) and \(x\), each term of expected loss function (15) is a linear discrete function of \({h}\cdot {y}\). This remark leads to a conclusion that the approximation of the Tversky expected loss can be efficiently computed using linear approximations:

$$\begin{aligned} g(\mathrm {K},\mathrm {S},x,{h}\cdot {y}=z)= & \, d(\mathrm {K},\mathrm {S})\sum _{ y\in {\mathcal {S}}_{\mathrm {S}} }P(y|x)-c(\mathrm {K},\mathrm {S})\sum _{ y\in {\mathcal {S}}_{\mathrm {S}} }P(y|x)z. \end{aligned}$$
(25)

For the sake of simplicity, let us introduce the following notation:

$$\begin{aligned} l\,(\mathrm {K},\mathrm {S},x,{h}\cdot {y}=z)= & {} \sum _{ y\in {\mathcal {S}}_{\mathrm {S}} }P(y|x) -\sum _{ y\in {\mathcal {S}}_{\mathrm {S}} }\dfrac{P(y|x)z}{\delta \mathrm {S}+ \gamma \mathrm {K}+ \eta z},\end{aligned}$$
(26)
$$\begin{aligned} \delta \mathrm {S}+ \gamma \mathrm {K}= & {} \,R. \end{aligned}$$
(27)

Now, please note that for fixed L, \(\mathrm {K}\) and \(\mathrm {S}\), the term \({h}\cdot {y}\) is bounded as follows:

$$\begin{aligned} {h}\cdot {y}\le & {} \min (\mathrm {K},\mathrm {S})=b_{u}(\mathrm {K},\mathrm {S}) \end{aligned}$$
(28)
$$\begin{aligned} {h}\cdot {y}\ge & {} \max (0,\mathrm {K}+\mathrm {S}-L)=b_{l}(\mathrm {K},\mathrm {S}). \end{aligned}$$
(29)

In the interval defined by \(b_{l}(\mathrm {K},\mathrm {S})\) and \(b_{u}(\mathrm {K},\mathrm {S})\), the original loss function can be approximated using a linear function (25) whose parameters \(c(\mathrm {K},\mathrm {S})\) and \(d(\mathrm {K},\mathrm {S})\) are calculated from the following system of linear equations

$$\begin{aligned} \left\{ \begin{array}{ccc} g(\mathrm {K},\mathrm {S},x,b_{u}(\mathrm {K},\mathrm {S}))&{}=&{}l(\mathrm {K},\mathrm {S},x,b_{u}(\mathrm {K},\mathrm {S})),\\ g(\mathrm {K},\mathrm {S},x,b_{l}(\mathrm {K},\mathrm {S}))&{}=&{}l(\mathrm {K},\mathrm {S},x,b_{l}(\mathrm {K},\mathrm {S})) \end{array} \right. \end{aligned}$$
(30)

which gives (details are shown in "Appendix 2.1")

$$\begin{aligned} \left\{ \begin{array}{l} c(\mathrm {K},\mathrm {S})=\dfrac{R}{(R+\eta b_{u}(\mathrm {K},\mathrm {S}))(S+\eta b_{l}(\mathrm {K},\mathrm {S}))},\\ d(\mathrm {K},\mathrm {S})=1-\dfrac{\eta b_{u}(\mathrm {K},\mathrm {S})b_{l}(\mathrm {K},\mathrm {S})}{(R+\eta b_{u}(\mathrm {K},\mathrm {S}))(R+\eta b_{l}(\mathrm {K},\mathrm {S}))}. \end{array} \right. \end{aligned}$$
(31)

Example 1

Let us show an example of approximation \(T_{\gamma =10,\delta =1}\) loss using linear functions. Throughout this example, we assumed that \(L=20\), \(\mathrm {\mathrm {K}=15}\) and \(\mathrm {S}=15\). For this case, the bounds of \({h}\cdot {y}\) are \(b_{u}(\mathrm {15},\mathrm {15})=15\) and \(b_{l}(\mathrm {15},\mathrm {15})=10\). Considering all the above, we construct a linear approximation of the \(T_{\gamma =10,\delta =1}\) in the interval \(\left[ 10;15\right]\). This approximation is presented in Fig. 1. As we can see, under such circumstances, the difference between our approximation and approximation performed using Dembczyński approach is substantial (red and green lines).

Fig. 1
figure 1

An example of the Tversky (\(T_{\gamma =10,\delta =1}\)) measure approximation. The measure is expressed as a function of the number of true positives (\({h}\cdot {y}\)) while the number of labels \(L\), the number of relevant labels in \(h\) and \(y\) are fixed to \(\mathrm {K}\) and \(\mathrm {S}\) respectively. The approximation interval computed according to (28) and (29) is \([10;15]\)

Finally, we define an approximated inner classifier:

$$\begin{aligned} \tilde{h}^{*}(x)_{\mathrm {K}}= & {} \arg \!\min _{ h\in {\mathcal {H}}_{\mathrm {K}} } \sum _{\mathrm {S}=0}^{L}\sum _{y\in {\mathcal {S}}_{\mathrm {S}}}\left( d(\mathrm {K},\mathrm {S}) - c(\mathrm {K},\mathrm {S})\sum _{i=1}^{L}h_{i}y_{i}\right) P(y|x) \end{aligned}$$
(32)
$$\begin{aligned}= & {} \arg \!\min _{ h\in {\mathcal {H}}_{\mathrm {K}} }\left\{ \sum _{\mathrm {S}=0}^{L}\sum _{y\in {\mathcal {S}}_{\mathrm {S}}}P(y|x)d(\mathrm {K},\mathrm {S}) - \sum _{\mathrm {S}=1}^{L}\sum _{ y\in {\mathcal {S}}_{\mathrm {S}} }P(y|x)c(\mathrm {K},\mathrm {S})\sum _{i=1}^{L}h_{i}y_{i} \right\} . \end{aligned}$$
(33)

And since the term \(d(\mathrm {K},\mathrm {S})\) does not depend on \(h\) but only on \(\mathrm {K}\), which is constant in each inner optimization problem:

$$\begin{aligned} \tilde{h}^{*}(x)_{\mathrm {K}}= & {} \arg \!\max _{ h\in {\mathcal {H}}_{\mathrm {K}} }\sum _{\mathrm {S}=0}^{L}\sum _{ y\in {\mathcal {S}}_{\mathrm {S}} }P(y|x)c(\mathrm {K},\mathrm {S})\sum _{i=1}^{L}h_{i}y_{i}. \end{aligned}$$
(34)

that can be efficiently found via the inner optimization approach proposed by Dembczyński:

$$\begin{aligned} \tilde{h}^{*}(x)_{\mathrm {K}}= & {} \arg \!\max _{ h\in {\mathcal {H}}_{\mathrm {K}} }\sum _{i=1}^{L}h_{i}\sum _{\mathrm {S}=1}^{L}\nonumber \\& c(\mathrm {K},\mathrm {S}){P(y_{i}=1|x)P(\left\|y\right\|_1=\mathrm {S}|y_{i}=1,x)}\nonumber \\= & {} \arg \!\max _{h\in {\mathcal {H}}_{\mathrm {K}}}\sum _{i=1}^{L}h_{i}\tilde{\Delta }(i,\mathrm {K}). \end{aligned}$$
(35)

The expected loss related to the determined classifier is:

$$\begin{aligned} {\mathbb {E}}_{Y|X=x}\left[ T_{\gamma ,\delta }(\tilde{h}^{*}(x)_{\mathrm {K}},Y)\right]= & {} \sum _{\mathrm {S}=0}^{L}P(\left\|y\right\|_1=\mathrm {S}|x)d(\mathrm {K},\mathrm {S})- \sum _{i=1}^{L}\tilde{h}^{*}_{i}(x)_{\mathrm {K}}\tilde{\Delta }(i,\mathrm {K}), \end{aligned}$$
(36)

where \(\tilde{\Delta }(i,\mathrm {K})\) is defined in a way analogous to the transformation applied in (17). As a consequence, the outer optimization algorithm does not differ from the procedure described in the previous subsection. That is, the probability of getting zero vector is:

$$\begin{aligned} {\mathbb {E}}_{Y|X=x}\left[ T_{\gamma ,\delta }(h^{*}(x)_{0},Y)\right]= & {} P(\left\|y\right\|_1=0 | x), \end{aligned}$$
(37)

and outer optimization is performed according to:

$$\begin{aligned} \tilde{{h}}^{*}({x}) = \quad {\arg \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\min _{{ {h}\in \left\{ \tilde{h}^{*}({x})_{0}, \tilde{h}^{*}({x})_{1},\dots ,\tilde{h}^{*}({x})_{L} \right\} } }\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!{\mathbb {E}}_{Y|X={x}} \left[ T_{\gamma ,\delta }(h,Y)\right] } \end{aligned}$$
(38)

3.4 Plug-in rule classifier for the \(T_{\gamma ,\delta }\) loss

The previous section provides us with description of the Bayes classifier tailored for the \(T_{\gamma ,\delta }\) loss function. The description assumes that all considered probability distributions are known, but in real-life classification tasks this assumption does not hold. In such situations, an approach referred as plug-in rule classifier [31] can be employed. The plug-in rule approach consists in estimating the unknown probabilities that are calculated on the basis of a training set

$$\begin{aligned} {\mathcal {D}}= \left\{ \left( {x}^{(1)},{y}^{(1)} \right) ,\left( {x}^{(2)},{y}^{(2)} \right) ,\dots ,\left( {x}^{(N)},{y}^{(N)} \right) \right\} \end{aligned}$$
(39)

and then plugged into the formula of the Bayes classifier.

The above-defined Bayes classifier requires \(L^{2}+2L\) probabilities to be estimated, namely:

  • \(L^{2}\) for \(P(\left\|y\right\|_1=\mathrm {S}|y_{i}=1,x)\);

  • L for \(P(y_{i}=1|x)\);

  • and L for \(P(\left\|y\right\|_1=\mathrm {S}|x)\).

Calculation of these values can be efficiently done by employing a set of \(2L + 1\) multinominal regression models or classifiers (only classifiers that return an estimation of the posterior probability are considered).

Probabilities \(P(\left\|y\right\|_1=\mathrm {S}|y_{i}=1,x)\) are estimated using L models. First step in the procedure of building the models is to make sets \({\mathcal {D}}_{i}\) that contains only objects for which \(y^{(j)}_{i}=1\). After that, a group of estimators is learned on the sets transformed in the following way

$$\begin{aligned} \left( x^{(j)},y^{(j)}\right) \in {\mathcal {D}}_{i} \mapsto \left( x^{(j)},\left\|y\right\|_1\right) \,\forall \quad j\in {1,2,\dots ,|{\mathcal {D}}|}. \end{aligned}$$
(40)

Probabilities \(P(y_{i}=1|x)\) are modelled by applying the binary relevance transformation on the training set:

$$\begin{aligned} \left( x^{(j)},y^{(j)}\right) \in {\mathcal {D}} \mapsto \left( x^{(j)},{y^{(j)}_{i}}\right) . \end{aligned}$$
(41)

The transformation produces L one-vs-rest binary sets corresponding to each label.

Finally, we obtain an estimation of \(P(\left\|y\right\|_1=S|x)\) by performing a transformation

$$\begin{aligned} \left( x^{(j)},y^{(j)}\right) \in {\mathcal {D}} \mapsto \left( x^{(j)},{{\left\|y^{(j)}\right\|_1}}\right) . \end{aligned}$$
(42)

followed by learning of a related model.

3.5 System architecture

The description of learning and inference phases are provided in Figs. 2 and 3.

Fig. 2
figure 2

Pseudocode of the learning procedure

Fig. 3
figure 3

Classification procedure for given \(x\)

4 Experimental setup

During the experimental study, the proposed method was compared to five state-of-the-arts approaches. First of all, the most natural choice for a reference method is the one proposed by Dembczyński [8] upon which the developed approximation scheme is based. Additionally, we harnessed another decision-theoretic approach introduced by Jansche [28]. We also employed an algorithm that follows the empirical utility maximization framework, namely Pillai thresholding procedure [39]. We considered four versions of the aforesaid algorithms, and they differ according to values of parameters \(\gamma\), \(\delta\) of the Tversky loss. The following parameter values were examined:

  • \(\left\{ \gamma =1; \,\delta =1 \right\}\) – corresponding to the well-known Jaccard measure [26] (known also as multi-label accuracy [49]) which is a common measure of multi-label classification quality.

  • \(\left\{ \gamma =10;\, \delta =10 \right\}\) – corresponding to the optimization of zero-one loss measure. As stated previously, the Tversky measure ideally approximates the zero-one loss when \(\gamma \rightarrow \infty\) and \(\delta \rightarrow \infty\). However, preliminary experiments showed that increasing \(\gamma\) and \(\delta\) above 10 does not induce significant increase in classification quality measured by the zero-one loss.

  • \(\left\{ \gamma =1; \,\delta =10 \right\}\) and \(\left\{ \gamma =10;\, \delta =1 \right\}\) which are asymmetric variations of the above-mentioned quality criteria.

We did not consider the instantiation of algorithms with parameters set to \(\left\{ \gamma =0.5;\, \delta =0.5 \right\}\) because, as it is shown in Sect. 3.3, under such setup the proposed algorithm is equivalent to the Dembczyński approach and the performance of this method was examined extensively [8]. Nevertheless, the Dembczyński algorithm can only be applied to minimize \(F_{\beta }\) loss, so its parameters must be set in such a way that allow us to obtain the best approximation of the Tversky measure with given parameters. Strictly speaking, \(\beta\) parameter was set to \(\beta = \sqrt{\frac{\delta }{\gamma }}\). All of the algorithms, were implemented using the Naïve Bayes classifier [23] as a base single-label model. We utilized Naïve Bayes implemented in WEKA framework [22]. The classifier parameters were set to its defaults.

In the next section, the aforementioned approaches are numbered as follows:

  1. 1.

    the proposed approach (Sects. 3.3 and 3.4),

  2. 2.

    the Dembczyński algorithm [8],

  3. 3.

    the Jansche approach [28],

  4. 4.

    Pillai thresholding procedure [39],

  5. 5.

    Structured SVM approach [39],

  6. 6.

    the binary relevance method [49].

The performances of the above-mentioned classifiers were assessed using such quality criteria as false discovery rate (FDR,\(1-\mathrm{precision}\)), false negative rate (FNR,\(1-\text {recall}\)) and the Tversky measure with parameters relevant to investigated algorithms [33]. We employed example-based (denoted by Ex), macro-averaged (denoted by Ma ) and micro-averaged (denoted by Mi) measures. The algorithms were also compared with respect to the Hamming loss [10] and zero-one subset loss function.

The experimental evaluation was conducted on 24 benchmark multi-label datasets related to seven different domains: text categorization – 7 datasets, image annotation – 6, bioinformatics – 4, audio samples recognition – 3, video annotation – 1, astronomy – 2 and environmental science – 1. The datasets are summarized in Table 2. The first column (named ‘No.’) shows the number of the dataset. The number is further used to denote the set in the experimental section. The second column of the table holds names of datasets and reference to related papers. Next three columns contain the number of instances in dataset, input space dimensionality and the number of labels, respectively. Another three columns provide more detailed information about dataset properties that is, set cardinality, density and the number of unique label combinations [50]. To be more formal, the aforementioned properties are defined as follows:

$$\begin{aligned} \mathrm {LC}({\mathcal {D}})= & {}\, \dfrac{1}{N}\sum _{i=1}^{N}\sum _{j=1}^{L}{y}_{j}^{(i)}, \end{aligned}$$
(43)
$$\begin{aligned} \mathrm {LD}({\mathcal {D}})= & {}\, \dfrac{\mathrm {LC}({\mathcal {D}})}{L}, \end{aligned}$$
(44)
$$\begin{aligned} \mathrm {LU}({\mathcal {D}})= & {}\, \left| \left\{ y:\exists \left( x^{(i)},y^{(i)} \right) \in {\mathcal {D}},y^{(i)}=y \right\} \right| . \end{aligned}$$
(45)
Table 2 Summary statistics of benchmark datasets

During the dataset-preprocessing stage, we applied a few transformations on datasets. First and foremost, all nominal attributes, except binary attributes, were converted into a set of binary variables. For example, a nominal feature with three possible values is transformed into a set of three binary variables and each binary variable indicates the occurrence of the associated nominal value. This approach is one of the simplest methods to replace nominal variables with binary variables [46].

In our study, we employed datasets that follow multi-instance-multi-label (MIML) framework [61], [55]. In those datasets, each object consists of a bag of instances tagged with a set of labels. In order to tackle this data, we follow the suggestion made in [62], and we transformed the set to single-instance multi-label data. Namely, we build a between-bag distance matrix using the Hausdorff distance [42], and then, we constructed a set of new points in Euclidean space using multidimensional scaling [3].

We also harnessed multi-target regression sets (Solar_flare1, Solar_flare2 and Water-quality) which were converted into multi-label data using a simple thresholding procedure. To be more precise, when the value of output variable for a given object is greater than zero, the corresponding label is set to be relevant to this object. In case of Solar_flare sets, this transformation results in producing a multi-label set characterized by low label cardinality and density, and such a set should be considered as a kind of mischievous case.

The dimensionality of the input space for Rcv1subsetX datasets must be reduced since it causes the experimental software to run out of memory. To remove unnecessary attributes, we followed a filtering procedure suggested previously in [51]. Namely, we applied a simple frequency filter with the minimal number of word occurrences set to 50. As a consequence, the number of attributes is reduced from about \(47\mathrm {k}\) to about \(1.8\mathrm {k}\).

Finally, training and testing sets were extracted using tenfold cross-validation. However, due to a large number of instances (nearly \(44\mathrm {k}\) instances), the number of cross-validation folds was reduced to three for Mediamill dataset. Despite the reduced number of cross-validation folds, the number of instances is large enough to provide a stable estimation of the classification quality criteria.

Statistical significance of obtained results was assessed using the Friedman test [18] and the post hoc Nemenyi test [11]. The corrected critical difference for the Nemenyi post hoc test is \(\mathrm {CD}_{\alpha =0.05}=1.59\). Additionally, we performed the Wilcoxon signed-rank test [11, 54]. For all tests, the significance level was set to \(\alpha =0.05\). To control family-wise error rates of the Wilcoxon testing procedure, the Holm approach of p value correction was employed [25].

5 Results and discussion

The experimental section was partitioned into two main subsections which contain the results related to the symmetric and asymmetric form of the Tversky loss. In the first section, we considered the symmetric case of the Tversky loss function, i.e., the loss function was instantiated with parameters set to \(\gamma =\delta =1\) (Jaccard loss) and \(\gamma =\delta =10\) (zero-one loss). The summarized results related to this scenario are presented in Tables 3, 4. The other subsection, on the other hand, is aimed at evaluating the introduced approach under two asymmetric variants of the Tversky loss. One is tuned to put a greater penalty to the false negatives \(\left\{ \gamma =10, \delta =1\right\}\), whereas the other increases the cost of false positives \(\left\{ \gamma =1, \delta =10\right\}\). The outcome of the asymmetric experiment is shown in Tables 5, 6. Both sections share the common format of a result table. Namely, the header table holds the ordinal numbers of assessed algorithms which are compatible with the numbering introduced in the description of the experimental setup. The second part of the table consists of 11 subsections, and each subsection is related to a quality measure which is presented in the subsection header. Each subsection contains three rows. The first one, denoted by Rnk, shows average ranks achieved by the investigated multi-label classifiers over the benchmark sets. The second row presents p value related to the Friedman test (Frd) applied to criterion-specific results. The last row contains the corrected (Holm’s correction) p value (Wp) produced by the Wilcoxon test which was applied to compare the introduced learning procedure against the reference methods.

For easier interpretation, we provided a visualization of the data presented by the above-mentioned tables using radar plots. The plots are shown in Figs. 456 and 7. Each plot presents average ranks achieved by evaluated algorithms under different quality criteria. Additionally, plots also contains a graphical representation of the performed Nemenyi post hoc procedure. Namely, the critical differences are denoted by black bars parallel to criterion-specific axes in a radar plot.

In order to improve the readability of the paper, tables containing set-specific results for Tversky and zero-one measures are presented in “Appendix”. Results related to the symmetric scenario are shown in Tables 7, 8, 9, 10, 11 and 12, whereas the results for asymmetric scenarios are presented in Tables 15, 16. The detailed description of those tables is provided in the “Appendix”.

5.1 The symmetric loss

Let us begin with the analysis of results related to the first symmetric scenario which is the optimization of the Jaccard loss (\(\gamma =\delta =1\)). The outcome of the experiments is presented in Tables 37, 8, 9 and Fig. 4.

From the point of view of this paper, we are interested in the lack of significant difference, in terms of Tversky-based loss functions, between the proposed method and the Dembczyński approach. The lack of the statistically important difference is confirmed by both statistical tests; however, the average ranks may suggest that the proposed method performs slightly better. This observation leads to a conclusion that the divergence between the \(F_{1}\) measure and the Jaccard loss is not substantial enough to produce significant difference between the investigated approaches. Additionally, the proposed method introduces an approximation error. The presence of this error has influenced the results.

On the other hand, the proposed method reveals to achieve the best classification quality under the zero-one loss. Only structured SVM approach is statistically comparable to the proposed one; however, the average rank of the proposed method is the lowest. This is an important result because the aforementioned criterion is the most restrictive loss function [33]. What is more, the mentioned measure is also the roughest quality indicator that is unable to distinguish a nearly correct outcome from a totally misclassified one. Thus, significant superiority suggests that our approach achieved the greatest number of ’perfect match’ solutions.

Additionally, the experimental evaluation unveils that the proposed approximation procedure tends to be more conservative than the reference methods. In other words, the method reduces the false positive rate at the cost of increasing the false negative rate, and the imbalance factor is greater in comparison with the other algorithms. This fact is manifested by the low rank earned in terms of FDR and relatively high rank under FNR criteria. This slight shift towards a majority class (the label is irrelevant to given object) must be interpreted as a beneficial property when the label density is low, which is a common situation under the multi-label framework. Moreover, the noticed shift is not a symptom of a harmful bias towards the majority class because the method performs well with respect to the Hamming loss, the Jaccard loss and the zero-one loss.

Table 3 Summarized results for the algorithms tailored to optimize the \({\mathcal {T}}_{1,1}\) loss function
Fig. 4
figure 4

Visualization of average ranks achieved by algorithms and corresponding critical distances for the Nemenyi post hoc test for the algorithms tailored to optimize the \({\mathcal {T}}_{1,1}\) (Jaccard) loss function. Each axis of the radar plot corresponds to given quality criterion. The closer a point is to the centre of the radar plot, the lower average rank is (lower is better). Algorithms are numbered as in the section that describes experimental setup. Investigated algorithms are also distinguished using different colours and point styles. Lines connecting algorithm-specific points are added only for visualization purpose. Black bars parallel to criterion-specific axes denote critical difference for the Nemenyi tests

The results related to the second symmetric scenario (\(\gamma =10, \delta =10\)) are presented in Tables 410, 11, 12 and Fig. 5. The change of the values of the parameters does not induce any major change in the general behaviour of the proposed method. Namely, the algorithm still is the most conservative approach among the investigated procedures.

What is more, it maintains its leading position under the zero-one criterion. An interpretation of this phenomenon may lie in the interpretation of the \(T_{\gamma ,\delta }\) when the parameters \(\gamma\) and \(\delta\) are relatively high. Namely, under such conditions, the loss function becomes similar to the zero-one loss. The previously analysed example shows that even the algorithm focused on optimization of the Jaccard loss outperforms the remaining methods in terms of the zero-one loss.

The quality assessment in terms of the example-based Tversky loss shows that the proposed method significantly outperforms only the BR system and the Jansche classifier. This result is quite obvious if we consider the assumption about the relationship between labels that lie at the root of the investigated algorithms. Strictly speaking, the outperformed methods, in contrast to the remaining procedures, expect the labels to be conditionally independent, which results in creation of simplified models. As a consequence, the overall, predictive performance of those approaches is generally lower. On the other hand, there are no significant differences between the proposed method and the other methods under label-based Tversky loss functions.

Equally important, the classification quality measured by the example-based Tversky loss has increased. This observation is a positive sign since the proposed method is tailored to optimize the aforesaid quality measure. Furthermore, this fact confirms the previously made conclusion that the predictive ability of the proposed approximation scheme will rise if the disagreement between the Tversky loss and \(F_{1}\) measure increases. The graphical explanation of this phenomenon is provided by Fig. 1. That is, if the parameters of the Tversky loss become closer to \(\left\{ \gamma =0.5, \delta =0.5\right\}\), the curve related to \(T_{\gamma ,\delta }\) becomes more flat. As a consequence, the linear approximation becomes similar to the curve related to the \(F_1\) loss and the predicted outcome is close to the class assignment provided by the Dembczyński algorithm.

On the other hand, there is still no significant difference between the proposed method and its counterparts in terms of macro-averaged and micro-averaged Tversky measures. Nonetheless, the average ranks show that the classification quality obtained by the Dembczyński algorithm may be a bit better.

Table 4 Summarized results for the algorithms tailored to optimize the \({\mathcal {T}}_{10,10}\) loss function
Fig. 5
figure 5

Visualization of average ranks achieved by algorithms and corresponding critical distances for the Nemenyi post hoc test for the algorithms tailored to optimize the \({\mathcal {T}}_{10,10}\) loss function

5.2 The asymmetric loss

Under the first asymmetric scenario, we set the parameters of the Tversky loss to \(\gamma =10\) and \(\delta =1\), which should result in building classifiers that prefer more conservative solutions in comparison to the outcomes produced by the procedures that employ symmetrical loss measures. Indeed, the obtained results, which are summarized in Tables 513, 14, 15 and Fig. 6, confirmed that assessed procedures exhibit the expected behaviour. To be more precise, the evaluation in terms of the FNR criteria shows a significant decrease in comparison with the binary relevance system which is a reference method that is not affected by the values of parameters \(\gamma\) and \(\delta\). The observed cutback concerns all investigated classifiers, even those that were significantly better than BR under the symmetric scenario. Contrary to the results obtained using symmetric quality measures, the significant difference between the proposed method and the Dembczyński approach does not hold although the corresponding p value is close to the assumed significance level. The disappearance of the significant differences can also be observed under FDR criterion. As a consequence, the introduced approach is no longer the most conservative classifier. This phenomenon suggests that the considered asymmetry ratio has pushed algorithms to the point where the precision cannot be further increased without significant deterioration of the recall.

Notwithstanding, our approximation procedure still achieves the lowest average rank in terms of the example-based Tversky loss and zero-one criterion. Generally speaking, the relative differences, which are described by average ranks, fall under the scheme described in the section devoted to the analysis of the first symmetrical scenario.

Table 5 Results for the algorithms tailored to optimize the \({\mathcal {T}}_{10,1}\) loss function
Fig. 6
figure 6

Visualization of average ranks achieved by algorithms and corresponding critical distances for the Nemenyi post hoc test for the algorithms tailored to optimize the \({\mathcal {T}}_{10,1}\) loss function

The analysis of the second asymmetric scenario (\(\gamma =1,\delta =10\)) provides us with interesting information (Tables 616, 17, 18 and Fig. 7). Namely, the attempt to reduce the false negative ratio at the cost of increasing the number of false positives leads to a significant decrease in the classification quality under each considered loss measure. The decline is confirmed by the Nemenyi post hoc procedure which shows that all classifiers perform in a manner similar to the BR system, although under the symmetric experiment the algorithms were able to outperform the BR system. What is more, the significant differences between the proposed method and reference approaches have not held. The experiment reveals that increasing the \(\delta\) causes an undesirable rise of the number of labels which are considered relevant. This strategy proves to be inappropriate for the multi-label datasets characterized by the low label density.

Table 6 Summarized results for the algorithms tailored to optimize the \({\mathcal {T}}_{1,10}\) loss function
Fig. 7
figure 7

Visualization of average ranks achieved by algorithms and corresponding critical distances for the Nemenyi post hoc test for the algorithms tailored to optimize the \({\mathcal {T}}_{1,10}\) loss function

6 Conclusions

In this paper, we addressed the issue of building a multi-label classifier that utilizes the Tversky loss. Our approach is built using a discrete linear approximation of the analysed loss function. The optimization procedure was performed using the inner–outer optimization technique proposed by Dembczynski [9] in order to find the optimal solution of the considered classification problem in terms of the \(F_{\beta }\) loss. Although our algorithm solves the optimization task in a suboptimal manner, it does not put any assumption on the label distribution. Additionally, the Tversky loss allows us to express other loss functions as \(F_{\beta }\), Jaccard loss or the zero-one loss. As a consequence, the introduced procedure is more flexible than the other approaches proposed by literature. What is more, the computational cost of employing the modified procedure instead of the original one is relatively low.

During the experimental study, we obtained promising results which clearly show that the posed problem was solved with moderate success. However, there is still room for improvement. Namely, results produced by the proposed approach are comparable (under the Tversky loss) to the outcome of the Dembczynski approach for majority of experimental scenarios. What is more, under the example-based version of the Tversky measure, the average ranks suggest that the proposed method outperforms the reference method. It also should be noticed that the quality of classification achieved by the proposed procedure was substantially higher when the parameters of the loss function was set in order to emulate the zero-one loss. What is more important, the introduced approximation scheme outperforms the other assessed methods with respect to the zero-one loss.

The presented results lead to cautious optimism, so we are willing to continue the development of the presented approach. Our further studies will be focused on different approximation techniques that are expected to reduce the approximation error. Another branch of our research is aimed at dealing with the imbalanced distribution problem which is a common issue under the multi-label classification framework. This is an important problem because our approach is based on a series of binary-relevance-based estimations of a posterior probability distributions which are strongly affected by the uneven label distribution.