Keywords

1 Introduction

Amongst other utility measures, the F\(_\beta \)-measure is commonly used as a performance metric for multi-label classification (MLC) problems, especially in the case of imbalanced label occurrences. Given a prediction \(\varvec{h}(\varvec{x}) = (h_1(\varvec{x}), \ldots , h_m(\varvec{x}))^{T} \) of an instance \(\varvec{x}\) with m-dimensional binary label vector \(\varvec{y} = (y_1, \ldots , y_m)^{T} \), where both \(\varvec{h}(\varvec{x})\) and \(\varvec{y}\) belong to \(\{0,1\}^m\), the F\(_\beta \)-measure is usually computed in an instance-wise manner:

$$\begin{aligned} F_{\beta }(\varvec{y},\varvec{h}(\varvec{x})) = \frac{(1 + \beta ^2) \sum _{i=1}^m y_i h_i(\varvec{x})}{\beta ^2\sum _{i=1}^m y_i + \sum _{i=1}^m h_i(\varvec{x})} \quad \in [0,1] , \end{aligned}$$
(1)

where \(0/0 = 1\) by definition. Alternative ways of computing the F\(_\beta \)-measure are macro-averaging, in which the F\(_\beta \)-measure is not computed per instance, but per label, and micro-averaging, in which the computation is done over the whole instance-label matrix for a predefined dataset. The instance-wise F\(_\beta \)-measure will be the focus of this work. It is a very relevant measure for many practical MLC problems.

In recent years, specialized algorithms have been developed for optimizing the instance-wise F\(_\beta \)-measure. Roughly speaking, existing methods can be subdivided into two categories: utility maximization methods and decision-theoretic approaches. Algorithms in the first category intend to minimize a specific loss during the training phase. Many of those algorithms seek for thresholds on scoring functions [1,2,3,4], but also a few more complicated approaches have been proposed [5, 6]. For the related problem of binary classification, F\(_\beta \)-measure maximization at training time can be achieved via extensions of logistic regression [7], boosting [8] or support vector machines [9, 10]. However, F\(_\beta \)-measure maximization is simpler in binary classification than in multi-label classification, because predictions for subsequent instances are independent, while predictions for subsequent labels are not.

Decision-theoretic methods depart from a different perspective. These methods usually fit a probabilistic model \(P(\varvec{y} \, | \,\varvec{x})\) to the data during training, followed by an inference procedure at prediction time. This inference procedure consists of optimizing the following optimization problem:

$$\begin{aligned} \varvec{h}_F(\varvec{x}) = \mathop {\mathrm {arg max}}\limits _{\varvec{h} \in \lbrace 0,1 \rbrace ^m} \mathbb {E}_{\varvec{Y} \, | \,\varvec{x}} \left[ F_{\beta }(\varvec{Y},\varvec{h})\right] = \mathop {\mathrm {arg max}}\limits _{\varvec{h} \in \lbrace 0,1 \rbrace ^m} \sum _{\varvec{y} \in \lbrace 0,1 \rbrace ^m}P(\varvec{y} \, | \,\varvec{x})\,F_{\beta }(\varvec{y},\varvec{h}), \end{aligned}$$
(2)

in which the ground-truth is a vector of random variables \(\varvec{Y}= (Y_1, Y_2, \ldots , Y_m)\), \(\mathbb {E}_{\varvec{Y} \, | \,\varvec{x}}\) denotes the expectation for an underlying probability distribution \(P\) over \(\{0,1\}^m\), and \(\varvec{h}\) denotes a potential prediction. This is a non-trivial optimization problem without closed-form solution. Moreover, a brute-force search requires checking all \(2^m\) combinations of \(\varvec{h}\) and summing over an exponential number of terms in each combination and is hence infeasible for moderate values of m [11].

For solving (2), one can distinguish approximate inference algorithms, such as those of [12,13,14,15,16,17], and Bayes optimal methods [18,19,20]. Approximate algorithms depart from the assumption of independence of the \(Y_i\), i.e.,

$$\begin{aligned} P(\varvec{y} \, | \,\varvec{x}) = \prod _{i=1}^m (p_i(\varvec{x}))^{y_i}(1 - p_i(\varvec{x}))^{1-y_i}, \end{aligned}$$
(3)

with \( p_i(\varvec{x}) = P(y_i = 1 \, | \,\varvec{x})\). In contrast, exact algorithms do not require the independence assumption, which is not realistic for many MLC problems. Optimization problem (2) seems to require information about the entire joint distribution \(P(\varvec{y} \, | \,\varvec{x})\). However, exact algorithms have been proposed that solve the problem in an efficient way, by estimating only a quadratic instead of an exponential (with respect to m) number of parameters of the joint distribution.

The main goal of this article is to provide additional insights on how the instance F\(_\beta \)-measure can be optimized in the context of (convolutional) neural networks. Multi-label classification methods are commonly used in image analysis, for classical tasks such as tagging, segmentation or edge detection. In such studies the F\(_\beta \)-measure is often reported as a performance measure that reflects the practical performance of a classifier in a realistic way. However, the F\(_\beta \)-measure maximization methods that are discussed above have only been tested on simple MLC problems with shallow base learners that do not involve feature learning. Likewise, deep convolutional neural networks, which dominate the image classification landscape, usually only consider crude solutions when optimizing the F\(_\beta \)-measure. Researchers often stick to simple approaches that are easy to implement, while ignoring the shortcomings of those approximations. In a recent Kaggle competition which involved the multi-label classification of satellite imagesFootnote 1, one could observe that almost all top-scoring submissions applied simple thresholding strategies, which are known to be suboptimal. Only one author in the top ten reported improvement gains by testing something different than thresholding strategies. It is therefore interesting to investigate in a more systematic way how the instance-wise F\(_\beta \)-measure can be maximized in the context of deep neural networks.

This article is organized as follows. In Sect. 2, we will introduce neural network extensions of different algorithms, including several thresholding strategies, and approximate and exact inference methods. Moreover, we introduce a new model based on proportional odds to estimate the set of parameters of the joint label distribution, required to perform exact inference with existing methods. All those methods have pros and cons, which will be discussed without imposing sympathy for one particular method from the beginning. In Sect. 3, we present the results of a comparative experimental study on four image classification datasets, illustrating the behavior of the methods that we introduce. Our proportional odds model outperforms the alternatives in almost all scenarios. We end with a few clear conclusions.

2 Algorithms for Deep F\(_\beta \)-Measure Maximization

In this section we present six different algorithms that can be applied in tandem with (convolutional) neural networks to optimize the F\(_\beta \)-measure. To this end, we make a major distinction between three utility maximization methods and three decision-theoretic methods.

2.1 Utility Maximization Methods

When optimizing F\(_\beta \)-measure during training with (deep) neural networks, engineers usually consider thresholding strategies on marginal probabilities via a simple line search. Other existing utility maximization methods usually lead to constrained optimization problems, making them not immediately applicable to neural network training. We present three algorithms that seek to optimize the F\(_\beta \)-measure by means of applying specific thresholds to the predicted marginal probabilities \(p_1(\varvec{x}),\ldots ,p_m(\varvec{x})\). To this end, we assume that those marginals are modelled with a (convolutional) neural network with one output neuron per label, obtained via a logistic output layer:

$$\begin{aligned} p_i(\varvec{x}) = \frac{\exp (\varvec{w}_i^T \varvec{\phi }(\varvec{x};\varvec{\psi }))}{1+\exp ( \varvec{w}_i^T \varvec{\phi }(\varvec{x};\varvec{\psi }))}, \end{aligned}$$
(4)

in which \(\varvec{w}_i\) represents parameter vectors, and \(\varvec{\phi }\) denotes the map from the input layer to the one-but-last layer, parameterized by a parameter set \(\varvec{\psi }\).

This approach is in multi-label classification often referred to as binary relevance (BR). In the results section, the three BR-inspired algorithms will be referred to as threshold averaging (BR\(_{\mathrm {t}}^{\mathrm {avg}}\)), global thresholding (BR\(_{\mathrm {t}}^{\mathrm {glob}}\)) and threshold stacking (BR\(_{\mathrm {t}}^{\mathrm {stack}}\)), respectively.

Threshold Averaging (BR\(_\mathrm {\mathbf {t}}^\mathrm {\mathbf {avg}}\)). The first thresholding approach consists of computing a specific optimal threshold \(\theta _{*}^{(i)}\) for each instance \(\varvec{x}^{(i)}\) during training time. The algorithm passes over the data exactly once and considers the marginal probabilities \(p_1(\varvec{x}^{(i)}),\ldots ,p_m(\varvec{x}^{(i)})\) in decreasing order as candidate thresholds. At test time, the average optimal threshold over the training dataset is applied as a common threshold. Algorithm 1 provides pseudocode for a single instance; the algorithm can be applied on an entire training dataset with \(\mathcal {O}(mn)\) time complexity, by vectorizing the counter variables.

figure a

Global Thresholding (BR\(_\mathrm {\mathbf {t}}^\mathrm {\mathbf {glob}}\)). Algorithm 2 directly finds a single global optimal threshold \(\theta _{*}\) at training time. The method acts on the entire training data set by concatenating all marginal probabilities \(p_1(\varvec{x}),\ldots ,p_m(\varvec{x})\) for different \(\varvec{x}\), and considering each value as candidate threshold. This second thresholding method seeks to improve over the previous method by considering much more candidate thresholds. However, this comes at the expense of an increasing time complexity. Sorting the vector of all marginals takes \(\mathcal {O}(mn \log (mn))\) time, and the computation of the optimal threshold takes \(\mathcal {O}(m^2n)\). Each of those two factors might be dominating, depending on m and n. Let us remark that Algorithm 2 could be substantially simplified if the macro or micro F\(_\beta \)-measure would be optimized instead of the instance-wise F\(_\beta \)-measure. For the instance-wise F\(_\beta \)-measure one needs to keep track of the score for every instance individually for different thresholds, resulting in a higher time complexity compared to the micro and macro F\(_\beta \)-measures. Algorithmically, too, threshold-based optimization of the latter two measures is easier.

figure b

Threshold Stacking (BR\(_\mathrm {\mathbf {t}}^\mathrm {\mathbf {stack}}\)). The final thresholding method presented here tries to predict the instance-wise optimal thresholds for each test instance, in an approach similar to stacking, see e.g. [21, 22]. A set of marginal probabilities and optimal thresholds \(\{(p(\varvec{x}^{(1)}), \theta _{*}^{(1)}),\ldots ,(p(\varvec{x}^{(n)}), \theta _{*}^{(n)})\}\) is obtained via Algorithm 1 and serves as training data to learn a mapping from probability vectors to thresholds. As such, one ends up with a stacked model structure:

$$ \varvec{x} \mapsto p_1(\varvec{x}),\ldots ,p_m(\varvec{x}) \mapsto \theta _{*}(\varvec{x}).$$

The first mapping consists of a (convolutional) neural network that predicts marginal probabilities, and the second mapping will be a ridge regression model that transforms the distribution over marginals to a distribution-specific threshold. The distribution of marginal probabilities depends on \(\varvec{x}\), so one can argue that the predicted threshold is instance-specific.

2.2 Decision-Theoretic Methods

We present in total three algorithms that optimize the F\(_\beta \)-measure in a decision-theoretic perspective using so-called plug-in classifiers, i.e. classifiers that fit a probabilistic model at training time, followed by an inference phase at test time. We first mention an approach that departs from marginal probabilities and optimizes (2) in an approximate way by assuming label independence. This approach will be referred to as the label independence F\(_\beta \) plug-in classifier (LFP). Subsequently, we introduce two methods that do not have this restriction and provide exact solutions for (2). These methods do not require the plugin of m estimated marginal probabilities but rather a set of \(m^2 +1\) parameters of the joint distribution. To this end, we propose a neural network architecture with an output layer that is modified compared to the models that are typically used for BR estimation of marginal probabilities. The two exact methods differ in the hypothesis class that is considered.

All the methods in this section rely on solving (2) via outer and inner maximization. Let \(H_k\) denote the space of all possible predictions that contain exactly k positive labels: \(H_k = \lbrace \varvec{h} \in \lbrace 0,1 \rbrace ^m \, \vert \, \sum _{i=1}^m h_i = k \rbrace \). The inner maximization then solves

$$\begin{aligned} \varvec{h}_k(\varvec{x}) = \mathop {\mathrm {arg max}}\limits _{\varvec{h} \in H_k}~\mathbb {E}_{\varvec{Y} \, | \,\varvec{x}} \left[ F_{\beta }(\varvec{Y}, \varvec{h}) \right] , \end{aligned}$$
(5)

for each k. Subsequently, the outer maximization seeks to find the F\(_{\beta }\)-maximizer \(\varvec{h}_F\):

$$\begin{aligned} \varvec{h}_F(\varvec{x}) = \mathop {\mathrm {arg max}}\limits _{\varvec{h} \in \lbrace \varvec{h}_0(\varvec{x}), \ldots , \varvec{h}_m(\varvec{x}) \rbrace } \mathbb {E}_{\varvec{Y} \, | \,\varvec{x}} \left[ F_{\beta }(\varvec{Y}, \varvec{h}) \right] . \end{aligned}$$
(6)

The solution to (6) is found by checking all \(m+1\) possibilities. The algorithms discussed below differ in the way they solve the inner maximization (5).

Label Independence F\(_\beta \) Plug-In Classifier (LFP). By assuming independence of the random variables \(Y_1,\ldots ,Y_m\), optimization problem (5) can be substantially simplified. It has been shown independently in [12] and [14] that the optimal solution then always contains the labels with the highest marginal probabilities, or no labels at all.

Theorem 1

[12]. Let \(Y_1, Y_2, \ldots , Y_m\) be independent Bernoulli variables with parameters \(p_1, p_2, \ldots , p_m\) respectively. Then, for all \(j,k \in \{1, \ldots , m\}\), \(h_{F,j} = 1\) and \(h_{F,k} = 0\) implies \(p_j \ge p_k\).

As a consequence, only a few hypotheses \(\varvec{h}\) (\(m\!+\!1\) instead of \(2^m\)) need to be examined, and the computation of the expected F\(_\beta \)-measure can be performed in an efficient way. [13,14,15,16] have proposed exact procedures for computing the F\(_{\beta }\)-maximizer under the assumption of label independence. All those methods take as input predicted marginal probabilities \((p_1, p_2, \ldots , p_m)\) with shorthand notation \(p_i=p_i(\varvec{x})\), and they all obtain the same solution. In what follows we only discuss the method of [16], which is the most efficient among the four implementations. This method only works for rational \(\beta ^2\); in other cases a less efficient algorithm can be used.

As a starting point, let us assume that the labels are sorted according to the marginal probabilities and let \(\varvec{h}_k(\varvec{x})\) be the prediction that returns a one for the labels with the k highest marginal probabilities and zero for the other labels. Furthermore, let \(s^{\varvec{y}}_{i:j} = \sum _{l=i}^j y_l\), then one can observe that

$$\begin{aligned} \mathbb {E} \left[ F_{\beta }(\varvec{Y}, \varvec{h}_k(\varvec{x})) \right]= & {} \sum _{\varvec{y} \in \{0,1\}^m} F_{\beta }(\varvec{y},\varvec{h}_k(\varvec{x})) P(\varvec{y \, | \,\varvec{x}}) \\= & {} \sum _{\begin{array}{c} 0 \le k_1 \le k \\ 0 \le k_2 \le m-k \end{array}} \frac{P(s^{\varvec{y}}_{1:k} =k_1) P(s^{\varvec{y}}_{k+1:m} = k_2) (1+ \beta ^2) k_1}{k +\beta ^2(k_1+k_2)} \nonumber \\= & {} \sum _{k_1 = 0}^k (1+\beta ^{-2})k_1 P(s^{\varvec{y}}_{1:k}=k_1) s(k,k\beta ^{-2} + k_1) \,, \nonumber \end{aligned}$$
(7)

where \(s(k,\alpha ) = \sum _{k_2 = 0}^{m-k} P(s^{\varvec{y}}_{k+1:m} = k_2 )/(\alpha + k_2 )\). Now observe that

$$P(s^{\varvec{y}}_{k:m} = i) = p_k P(s^{\varvec{y}}_{k+1:m} = i-1) + (1-p_k)P(s^{\varvec{y}}_{k+1:m} = i) \,.$$

As a result, the s-values for different values of k in (7) can be computed recursively:

$$\begin{aligned} s(k-1,\alpha )= & {} \sum _{k_2 = 0}^{m-k+1} \frac{P(s^{\varvec{y}}_{k:m} = k_2 )}{\alpha +k_2} \\= & {} p_k \sum _{k_2 = 0}^{m-k+1} \frac{P(s^{\varvec{y}}_{k+1:m} = k_2-1 )}{\alpha +k_2} + (1-p_k) \sum _{k_2 = 0}^{m-k+1} \frac{P(s^{\varvec{y}}_{k+1:m} = k_2 )}{\alpha +k_2} \\= & {} p_k \sum _{k_2 = 0}^{m-k} \frac{P(s^{\varvec{y}}_{k+1:m} = k_2 )}{\alpha +k_2+1} + (1-p_k) \sum _{k_2 = 0}^{m-k} \frac{P(s^{\varvec{y}}_{k+1:m} = k_2 )}{\alpha +k_2} \\= & {} p_k s(k,\alpha +1) + (1-p_k)s(k,\alpha ) \,, \end{aligned}$$

with \(s(m,\alpha ) = 1/\alpha \) and \(s(k,\alpha ) = 0\) when \(k < 0\) or \(k >m\). Remark that the transition from the second to the third line follows from an index change.

The recursive formula suggests a dynamic programming implementation with k ranging from \(k=m\) to \(k=1\), as given in Algorithm 3. Here we first introduce a list of lists, using double indexing, such that \(L[k][j] = P(s^{\varvec{y}}_{1:k} =j)\) with \(j \in \{-1, 0, \ldots , k+1\}\). This data structure can also be initialized via dynamic programming:

$$\begin{aligned} L[k][j]= & {} p_k P(s^{\varvec{y}}_{1:k-1} =j-1) + (1-p_k) P(s^{\varvec{y}}_{1:k-1} =j) \\= & {} p_k L[k-1][j-1] + (1-p_k) L[k-1][j] \end{aligned}$$

using \(L[1] = [0, (1-p_1), p_1, 0]\) and \(L[k][-1] = L[k][k+1] = 0\). After initializing those lists, one can proceed with computing \(s(k,\alpha )\) for rational \(\beta ^2\). To this end, we introduce \(S[i] = s(k,i/q)\) with \(\beta ^2 = q/r\), which leads to the implementation given in Algorithm 3. Further speed-ups can be obtained via Taylor series approximations, which might be useful when m becomes very large.

figure c

General F\(_\beta \) Maximizer (GFM). The algorithm that was explained in the previous section assumed that the labels are independent, so that only marginals need to be modelled in order to solve inner problem (5). In what follows we discuss two different extensions of an alternative algorithm that does not assume label independence [19]. The algorithm is Bayes optimal for any probability distribution, but the price one has to pay for this is that more parameters of \(P(\varvec{y} \, | \,\varvec{x})\) must be estimated. As a starting point, we introduce the following shorthand notations:

$$ s^{\varvec{y}} = s^{\varvec{y}}_{1:m} \quad \text{ and } \quad \varDelta _{ik} = \sum _{\varvec{y}: y_i = 1} \frac{P(\varvec{y}\, | \,\varvec{x})}{\beta ^2 s^{\varvec{y}} + k} \,. $$

By plugging (1) into (5), one can write

$$\begin{aligned} \varvec{h}_k = \mathop {\mathrm {arg max}}\limits _{\varvec{h}\in H_k} \sum _{\varvec{y} \in \lbrace 0,1 \rbrace ^m} \frac{(1 + \beta ^2) \sum _{i=1}^m y_i h_iP(\varvec{y}\, | \,\varvec{x})}{\beta ^2 s^{\varvec{y}} + k} \,. \end{aligned}$$
(8)

Swapping the sums in (8) leads to

$$\begin{aligned} \varvec{h}_k= & {} \mathop {\mathrm {arg max}}\limits _{\varvec{h}\in H_k} (1 + \beta ^2) \sum _{i=1}^m h_i \sum _{\varvec{y} \in \lbrace 0,1 \rbrace ^m} \frac{y_i P(\varvec{y}\, | \,\varvec{x})}{\beta ^2 s^{\varvec{y}} + k} \nonumber \\= & {} \mathop {\mathrm {arg max}}\limits _{\varvec{h}\in H_k} (1 + \beta ^2) \sum _{i=1}^m h_i\varDelta _{ik} \,. \end{aligned}$$
(9)

The inner maximization is solved by setting \(h_i = 1\) for the top k values of \(\varDelta _{ik}\). For each \(\varvec{h}_k\), \(\mathbb {E}\left[ F_{\beta }(\varvec{Y}, \varvec{h}_k) \right] \) is stored and used to solve the outer maximization. For the specific case of \(\varvec{h}_0\), \(\mathbb {E}\left[ F_{\beta }(\varvec{Y}, \varvec{h}_0) \right] \) equals \(P(\varvec{y}= \varvec{0} \, | \,\varvec{x})\), which needs to be estimated separately. Algorithm 4 provides pseudocode for the complete procedure. This algorithm requires \(\varDelta _{ik}\) for \(1 \le i,k \le m\) and \(P(\varvec{y} = \varvec{0} \, | \,\varvec{x})\), that is, \(m^2 + 1\) parameters to obtain \(\varvec{h}_F\). With these parameters, the solution can be obtained in \(\mathcal {O}(m^2)\) time, i.e., the dominating part of the procedure is the inner maximization: for each k, a selection of the top k elements must be done, which can be accomplished in linear time. Thus, compared to the approach that assumed label independence, more parameters need to be estimated. The advantage of not imposing any distributional assumptions brings a more difficult estimation problem as disadvantage. Depending on the distributional properties of a specific dataset, it can therefore be the case that one algorithm outperforms the other, or the other way around.

figure d

Estimating \(\varDelta \) with Multinomial Regression (GFM\(_{\mathrm {\mathbf {MR}}}\)). [19] proposed the following scheme to estimate the probabilities \(\varDelta _{ik}\). Let \(\varvec{P}\) and \(\varvec{W}\) denote two \(m\times m\) matrices with elements

$$ p_{is} = P(y_i = 1, s^{\varvec{y}} = s \, | \,\varvec{x}), \quad w_{rk} = (\beta ^2 r + k)^{-1}, $$

respectively. Then, the \(m \times m\) matrix \(\varvec{\varDelta }\) with elements \(\varDelta _{ik}\) can be obtained by

$$ \varvec{\varDelta } = \varvec{P} \varvec{W}. $$

When using simple base learners, one can proceed to estimate \(\varvec{P}\) by reducing the problem to m independent problems, each with up to \(m+1\) classes. Each subproblem i involves the estimation of

$$\begin{aligned} P(y = \![\![y_i = 1 \!]\!] \cdot s^{\varvec{y}} \, | \,\varvec{x}), \quad \forall \, y \in \{0,\ldots ,m\}, \end{aligned}$$
(10)

which sum to one. The subproblems can hence be solved with multinomial regression. For \(y = \{1,\ldots ,m\}\), these probabilities make up the elements of the rows of \(\varvec{P}\). Similarly as for the deep neural network that estimated marginal probabilities, we model the i-th row of \(\varvec{P}\) via a softmax layer:

$$p_{is}(\varvec{x}) = \frac{\exp (\varvec{w}_{is}^T\varvec{\phi }(\varvec{x};\varvec{\psi }))}{ \sum _{s=0}^m\exp ( \varvec{w}_{is}^T \varvec{\phi }(\varvec{x};\varvec{\psi }))},$$

with \(i=1,\ldots ,m\), \(\varvec{w}_{is}\) parameter vectors, and \(\varvec{\phi }\) the map that originates from the feature learning phase, again parameterized by parameter set \(\varvec{\psi }\).

It should be noted that \(s^{\varvec{y}}\) equals m only in the worst case where an instance is attributed with all possible labels. This is rarely encountered in practice. Let \( s_{m} = \max _{1 \le j \le n} \sum _{i=1}^{m}y_i^{(j)}, \) then the total number of output classes for each subproblem (10) can be reduced to \(s_m + 1\). Nevertheless, fitting each of multinomial regression problems independently is undesirable when the cost of training the base learners becomes higher, as with deep (convolutional) neural networks, especially for large m. We propose the natural solution of estimating \(\varvec{P}\) in its entirety as the output of a single neural network. The two-dimensional final layer of the network should contain m rows of \( (s_{m} + 1)\) output neurons, where a row-wise soft-max transformation is applied. Then the loss to be minimized during training is composed of m cross-entropy losses, which can be minimized using stochastic gradient descent. The \(m^2\) entries required for \(\varvec{P}\) can be obtained from the output of the network by discarding the first column and by adding \(m - s_m\) columns with zeros.

Estimating \(\varDelta \) with Ordinal Regression (GFM\(_{\mathrm {\mathbf {OR}}}\)). Additionally, we propose to reformulate the problem of estimating the elements of \(\varDelta \) as an ordinal regression problem. The key insight is to factorize the probabilities \(p_{is}\) as follows:

$$ p_{is} = P(y_i = 1, s^{\varvec{y}} = s \, | \,\varvec{x}) = P(s^{\varvec{y}} = s \, | \,y_i = 1, \varvec{x}) \, P(y_i = 1 \, | \,\varvec{x}). $$

As before, \(P(y_i = 1 \, | \,\varvec{x})\) can be estimated by means of BR. In the conditional probability \(P(s^{\varvec{y}} = s \, | \,y_i = 1, \varvec{x})\), \(s^{\varvec{y}}\) can take on values from 1 to \(s_m\). By exploiting the ordinal nature of the variable \(s^{\varvec{y}}\), one can estimate the conditional probability with proportional odds models, while reducing the number of parameters, compared to GFM\(_{\mathrm {MR}}\) [23]. After estimating these conditional probabilities, they can be multiplied with the marginals to obtain the probabilities \(p_{is}\) required for GFM.

Taking into account the conditioning on \(y_{i}=1\), one can choose to estimate m independent proportional odds models. However, we will consider a global proportional odds model, consisting of m proportional odds submodels which are optimized jointly in a multi-task learning way. As such, the i-th submodel is characterized by \(s_{m}\) classes, a parameter vector \(\varvec{w}_{i}\) and a vector of bias terms \( \varvec{b}^{(i)} = (b_0^{(i)}, b_2^{(i)},\ldots , b_{s_m}^{(i)})\), subject to

$$\begin{aligned} b_0^{(i)}< b_1^{(i)}< \dots < b_{s_m}^{(i)}, \end{aligned}$$
(11)

with \(b_0^{(i)} = -\infty \) and \(b_{s_m}^{(i)} = \infty \).

Formally speaking, the i-th proportional odds model will estimate the cumulative probabilities

$$\begin{aligned} P(s^{\varvec{y}} \le s \, | \,y_{i}=1, \varvec{x}) = \frac{\exp (\varvec{w}_{i}^{T} \varvec{\phi }(\varvec{x};\varvec{\psi }) - b_s^{(i)})}{1+\exp (\varvec{w}_{i}^{T} \varvec{\phi }(\varvec{x};\varvec{\psi }) - b_s^{(i)})},\ \text {for}\ i\in \{1,\ldots ,m\}\,, \end{aligned}$$
(12)

where we depart from some learnable feature representation \(\varvec{\phi }(\varvec{x};\varvec{\psi })\), as in the other methods. Consequently, the conditional distribution of \(s^{\varvec{y}}\) can then be retrieved as follows:

$$\begin{aligned} P(s^{\varvec{y}} = s \, | \,y_{i}=1, \varvec{x}) = P(s^{\varvec{y}} \le s \, | \,y_{i}=1, \varvec{x}) - P(s^{\varvec{y}} \le s-1 \, | \,y_{i}=1, \varvec{x}). \end{aligned}$$

Furthermore, we estimate the model parameters in (12) jointly for \(i \in \{1,\ldots ,m\}\), by minimizing the following log-likelihood function:

$$\begin{aligned} \mathop {\mathrm {arg min}}\limits _{\varvec{W}, \varvec{B}, \varvec{\psi }} \Bigg ( -\! \sum _{n}\sum _{i=1}^{m}\sum _{s=1}^{s_{m}}\!I_{nis}\! \ T\Big (P(s^{\varvec{y}} = s \, | \,y_{i}=1, \varvec{x} )\Big )\Bigg ), \end{aligned}$$
(13)

with \(\varvec{W}=(\varvec{w}_{1},\ldots ,\varvec{w}_{m})\), \(\varvec{B}=(\varvec{b}^{(1)},\ldots ,\varvec{b}^{(m)})\) and \(I_{nis}\) a binary indicator, which is one when the n-th training instance \((\varvec{x},\varvec{y})\) has \(y_i = 1\) and \(s^{\varvec{y}} = s\). T is a transformation function

$$ T(z;\epsilon ) = {\left\{ \begin{array}{ll} \log \epsilon &{} \text {if}\, z\le 0 \\ \log z &{} \text {if}\, z>0 \\ \end{array}\right. },\ $$

for \(\epsilon > 0\), that defines a truncated log-likelihood.

This transformation can be seen as the modified negative log-likelihood of the proportional odds model. It is needed to guarantee numerical stability of the optimization algorithm, in case \(P(s^{\varvec{y}} = s \, | \,y_{i}=1, \varvec{x})\) becomes negative. This might happen in the early optimization steps, as (11) is not necessarily obeyed. Moreover, when the ordering constraint on the thresholds is not fulfilled, this will be directly penalized by the truncated log-likelihood, provided that \(\epsilon \) is chosen sufficiently small, e.g. \(\epsilon =1e^{-10}\). The truncated log-likelihood will hence yield a similar effect as logarithmic barrier penalty terms, which are sometimes used to enforce monotonicity as in (11).

Although GFM\(_{\mathrm {OR}}\) needs less parameters to estimate \(p_{is}\) than GFM\(_{\mathrm {MR}}\), it requires m values for the marginals as additional input. In case a separate model is used to estimate the marginals (starting from the same feature representation of size d), the parameter requirements for BR + GFM\(_{\mathrm {OR}}\) are \(dm + m + dm + m(s_m - 1)\), which boils down to \(m \times (2d + s_m)\). This number will still be lower than the number of parameters required for GFM\(_{\mathrm {MR}}\), which can be rewritten as \(m \times ((s_m+1)d + s_m +1)\).

3 Empirical Analysis

3.1 Experimental Setup

We compare the discussed methods by means of empirical evaluation on real-world datasets. Estimates of the marginal probability vectors are made by means of BR in the form of a convolutional neural network with m output nodes subject to a sigmoid non-linearity in the output layer, as given in Eq. 4. Our results include the F\(_{\beta }\)-measure scores obtained with BR, without any form of F\(_{\beta }\)-measure maximization. The marginal probabilities for the training data obtained by BR serve as input for the thresholding methods BR\(_{\mathrm {t}}^{\mathrm {avg}}\), BR\(_{\mathrm {t}}^{\mathrm {glob}}\) and BR\(_{\mathrm {t}}^{\mathrm {stack}}\). GFM\(_\mathrm {MR}\) and GFM\(_\mathrm {OR}\) estimate the \(m^2 + 1\) parameters of \(P(\varvec{y} \, | \,\varvec{x})\), required for GFM, with multinomial regression and proportional odds, respectively. The GFM algorithm is then used in tandem with these methods to obtain optimal predictions. Finally, the LFP method starts from the marginal probabilities obtained with BR for the test data.

We report both the F\(_1\) and F\(_2\)-measure scores obtained on four publicly available multi-label classification image datasets: PASCAL VOC 2007 [24], PASCAL VOC 2012 [25], Microsoft COCO [26] and the Kaggle Planet dataset [27]. We use the recommended train-val-test split for VOC 2007 and perform custom training/validation splits for the other datasets. Table 1 provides some summarizing statistics. All experiments were carried out on a single NVIDIA GTX 1080Ti GPU. All algorithms were implemented in Python using TensorFlow [28], Keras [29] and Pytorch [30].

When it comes to the experiments, for each dataset, the features are vectors of size 512 obtained by resizing the images to 224\(\,\times \,\)224 pixels and passing them through the convolutional part of an entire VGG16 architecture, including a max-pooling operation [31]. The final fully connected classification layers from the original architecture are replaced by a single fully connected layer with 128 neurons (ReLu activation), followed by either a single-layer BR, GFM\(_{\mathrm {MR}}\) or GFM\(_{\mathrm {OR}}\) classifier, as described in Sect. 2. The weights and biases of this architecture were set to those obtained by training the network on ImageNet; these are publicly available and accessible through the Keras API. First, the convolutional layers are fixed and the fully connected classification layers are trained until convergence. Then, the entire network is fine-tuned with a lower learning rate. In both stages, early stopping is applied, similarly as before. Moreover, the BR estimator consists of a single-layer neural network with m output nodes. Likewise, the GFM\(_{\mathrm {BR}}\) and GFM\(_{\mathrm {OR}}\) models consist of single-layer neural networks parameterized as described in the previous section. A small amount of dropout regularization was applied (dropout probability 0.2) at the input level. All models were trained with the Adam optimization algorithm (learning rate \(1e^{-3}\)), where early stopping was applied with a five epochs patience counter.

Table 1. Summary statistics for the four datasets. m is the number of labels, \(s_m\) the maximum number of labels attributed to a single instance in the training data.

3.2 Experimental Results

The results for the conducted experiments are presented in Table 2. As expected, BR without any attempt at maximizing the F\(_\beta \)-measure leads to the worst performance in almost all cases. Rather surprising is the fact that BR\(_{\mathrm {t}}^{\mathrm {avg}}\) performs worse than BR for the F\(_1\)-measure in several cases, meaning that the average optimal threshold for the training instances is not better than just 0.5 as a threshold. This is especially true for the Planet dataset, which has the smallest m and an imbalanced label distribution. Conversely, this does not occur for the COCO dataset, where, due to a larger number of labels, more candidate thresholds are considered for each instance by Algorithm 1. BR\(_{\mathrm {t}}^{\mathrm {glob}}\) consistently outperforms both BR and BR\(_{\mathrm {t}}^{\mathrm {avg}}\). This is as expected, since BR\(_{\mathrm {t}}^{\mathrm {glob}}\) considers all \(m \times n\) predicted marginal probabilities as candidates. However, this comes at the cost of higher time complexity, as discussed in Sect. 2.

The performance of BR\(_{\mathrm {t}}^{\mathrm {stack}}\) varies across datasets and seems to depend on whether F\(_1\) or F\(_2\) is the measure of interest. For F\(_1\) it performs substantially worse than BR\(_{\mathrm {t}}^{\mathrm {glob}}\), whereas it even becomes competitive with the decision-theoretic approaches for F\(_2\). Figure 1 gives further insights w.r.t. the behavior of BR\(_{\mathrm {t}}^{\mathrm {stack}}\). It shows for training data the empirical distribution of instance-wise thresholds obtained by Algorithm 1, as well as the thresholds predicted by BR\(_{\mathrm {t}}^{\mathrm {stack}}\). One can observe that for all four datasets the two distributions differ substantially, indicating that the threshold stacking method is not always capable of predicting a good threshold. The empirical distribution of instance-wise thresholds obtained by Algorithm 1 is here considered as the ground truth. The dotted line indicates the threshold that will be returned by Algorithm 1 after training.

More generally, the decision-theoretic approaches seem to outperform the thresholding methods on all datasets. The GFM algorithm, which is the only algorithm that does not require the assumption of label independence, is the best algorithm in all but one setting. In almost all cases the proportional odds model outperforms the multinomial regression model, which might indicate that the assumption of ordinality for \(s^{\varvec{y}}\) is a valid assumption. However, the differences between both methods are small, so the benefit of a more parsimonious model structure is limited. In addition, the LFP method also yields rather good results. Therefore, we hypothesize that for the analyzed datasets the dependence among the labels is not very strong. Moreover, even though LFP assumes independence, it requires less parameters than the GFM methods.

Fig. 1.
figure 1

Empirical distributions of instance-wise thresholds obtained by Algorithm 1 (blue, optimal thresholds), as well as the thresholds predicted by BR\(_{\mathrm {t}}^{\mathrm {stack}}\) (red, predicted thresholds). Here, the thresholds for training data are shown, and F\(_2\) is the performance measure. The thresholds are obtained by using the convolutional part of a high-quality pre-trained VGG16 architecture (R\(^{2}\) indicates the quality of the predictions). The dotted line indicates the mean optimal instance-wise threshold, which is returned by BR\(_{\mathrm {t}}^{\mathrm {avg}}\) after training. See main text for more details. (Color figure online)

Table 2. Comparison of the different methods, with training strategy described in Sect. 3.1.

4 Conclusion

In this article we introduced extensions of utility maximization and decision-theoretic methods that can optimize the F\(_\beta \)-measure with (convolutional) neural networks. We discussed pros and cons of the different methods and we presented experimental results on several image classification datasets. The results illustrate that decision-theoretic inference algorithms are worth the investment. While being more difficult to implement compared to thresholding strategies, they lead to a superior predictive performance. This is a surprising result, given the popularity of thresholding in deep neural networks. For most of the datasets, the inferior performance of thresholding strategies was remarkable, while also big differences could be observed among the different ways of defining a threshold. Overall, the best performance was obtained with an exact decision-theoretic method based on proportional odds models. This is interesting, because this method is at the same time the most novel among the different methods that were analyzed in this paper.