# Efficient \(F\) measure maximization via weighted maximum likelihood

- 471 Downloads
- 3 Citations

## Abstract

The classification models obtained via maximum likelihood-based training do not necessarily reach the optimal \(F_\beta \)-measure for some user’s choice of \(\beta \) that is achievable with the chosen parametrization. In this work we link the weighted maximum entropy and the optimization of the expected \(F_\beta \)-measure, by viewing them in the framework of a general common multi-criteria optimization problem. As a result, each solution of the expected \(F_\beta \)-measure maximization can be realized as a weighted maximum likelihood solution within the maximum entropy model - a well understood and behaved problem for which standard (off the shelf) gradient methods can be used. Based on this insight, we present an efficient algorithm for optimization of the expected \(F_\beta \) using weighted maximum likelihood with dynamically adaptive weights.

## Keywords

Maximum Entropy Acceptance Threshold Maximum Entropy Model Weighted Likelihood Brute Force Approach## 1 Introduction and related work

The F measure is a common tool for expressing a trade-off between precision and recall in binary classification, which might be specific to each application. Optimizing an F measure based performance of a classifier has a long history and many applications in text analysis. High precision is often needed in the analysis of clinical records, job boards, automatic summarization, machine translation etc., whereas high recall is usually preferred when the output of the system will be used in information retrieval applications, as an input to other text analysis components like relation extraction (Georgiev et al. 2009), correference (Ganchev et al. 2007; Carpenter 2007), or software-aided curation and annotation (Ganchev et al. 2007).

Maximum-likelihood-based binary classifiers such as maximum entropy are relatively easy to fit, but they are rigid and cannot be tuned to a desired Precision and Recall trade-off. In this work, we show that the well known weighted MaxEnt (Vandev and Neykov 1998), with corresponding estimation that is barely harder than in the standard case without weights, can be used to optimize the \(F_\beta \) measure; see also Simecková (2005) for an interesting discussion on weighted maximum likelihood. The main statement of the paper is that if appropriate weights are chosen, then the fitted maximum weighted likelihood model coincides with the optimal expected \(F_\beta \) model for the binary classification task. There are at least two major advantages of the weighted likelihood as a loss function, namely: (i) it is concave and (ii) standard (off the shelf) gradient methods can be used for its optimization. To the best of our knowledge, such a link between the weighted maximum likelihood and the \(F_\beta \) maximization has not been established before. The article is focused on the intuition of the relation and the proof of the main result. The value of our theoretical observation is that it establishes the methodology of viewing a particular model as a specific solution of a common multi-criteria optimization problem. Furthermore, we give a handy expression for the optimal weights and propose and test an efficient algorithm for the maximization of the expected \(F_\beta \) measure. Obviously if the assumed model is, in contrast to the considered in this article maximum entropy framework, such that the (weighted) maximum likelihood estimation is computationally not accessible there is little gain in casting the hard problem of \(F_\beta \) maximization into a possibly even harder maximum likelihood problem.

This article is organized as follows: in Sect. 2 we briefly introduce the weighted maximum entropy model. In Sect.3, the \(F_\beta \) measure as a trade-off of precision and recall is presented and the expected \(F_\beta \) is defined (see also Nan et al. 2012). In Sect. 4, we establish the link between the optimal expected \(F_\beta \) and a weighted maximum entropy model. Sections 5 and 6 present a method for estimating weights that afford optimal expected \(F_\beta \) and a corresponding algorithm. We describe our approach for evaluation of the algorithm in Sect. 7. In Sect. 8 we introduce the datasets that we used for experiments. Section 9 presents the results and Sects. 10 and 11 conclude the article with comments and a discussion.

*Related work*One of the most popular heuristics for precision-recall trade-off is based on adjusting the acceptance threshold given by maximum entropy models (or any other model that estimates class posterior probabilities). However, this procedure amounts to a simple translation of the maximum likelihood hyperplane towards or away from the target class and does not fit the model anew. Thresholding on posterior probabilities has been used in the context of other learning frameworks, such as Conditional Random Fields (CRFs) (Culotta 2004).

Minkov et al. (2006) introduces another heuristic, which is based on changing the weight of a special feature, which indicates if a sample is in the sought-after class or not.

Jansche (2005) describes a maximum entropy model that optimizes directly an expected \(F_\beta \)-based loss by means of gradient ascent. However the expected \(F_\beta \) is not concave and is rather cumbersome to deal with. Therefore the standard gradient methods do not guarantee optimality of the solution.

Approaches for training CRFs (Lafferty 2001) to directly optimize multivariate evaluation measures, including non-linear measures such as the \(F\)-score (Suzuki et al. 2006) have been proposed recently.

Joachims (2005) proposes a multivariate SVM capable of optimizing an upper bound of a class of rather general nonlinear performance measures including \(F_\beta \). The obtained optimization problem is of exponential complexity, but the solution can be approximated in polynomial time using a heuristic algorithm.

Our approach is somewhat different in nature - instead of directly optimizing the nonlinear performance measure, we show that the optimizer can be represented as a maximizer of a weighted version of a traditional “linear” performance measure, i.e. one that decomposes into a sum of performance measures over each training example. Needless to say, we also keep the parametrization of the conditional probabilities and in fact optimize the expected \(F_\beta \) measure rather than an upper bound.

A general algorithm for \(F\)-measure optimization is given in Dembczyn’ski et al. (2011), however they rely on known data distributions, which is an unrealistic requirement in practice. A very interesting result in Dembczyn’ski et al. (2011) is that there is a lower bound on the discrepancy between the optimal solution and the solution obtained by means of optimal acceptance threshold.

In Nan et al. (2012), the authors systematically classify the existing approaches for \(F_\beta \) maximization into two groups: empirical utility maximization (EUM) and decision-theoretic approaches (DTA). The authors show that EUM-optimal and DTA-optimal algorithms are asymptotically equivalent and also comment on the practical advantages and disadvantages of the two classes of approaches. In their article when considering the EUM approaches, where in general our framework resides, they basically restrict to approaches where a score function (e.g. parametric model for the conditional probabilities) is learned and only the acceptance threshold is obtained by directly optimizing the F-measure on the expected \(F\) measure \(\tilde{F}\) is also considered in Nan et al. (2012), where also its consistency is stated and even a Hoeffding bound for the convergence is given.

## 2 The maximum entropy model

The maximum entropy modeling framework as introduced in the NLP domain in Berger et al. (1996) has become standard for various NLP tasks. To fix notations consider a training set of \(m\) examples \(\{(x_i,y_i): i \in 1,\dots m\}\) where \(x_i\)’s are the attributes and the \(y_i\)’s are the classes taking values in some finite set \(\mathcal {Y}\). Each observation is represented by a set of \(N\) features \(\{ f_j(x_i, y_i) : j \in 1,\dots , N \} \).

In what follows we will always assume that the underlying log-likelihood function has a maximum. A possible exception is when the training set is separable in which case the log-likelihood converges to zero while the parameters diverge. The same holds for the weighted log-likelihood as long as the weights are strictly positive which will be the case in this paper. To see that indeed the weighted log-likelihood has a maximum when the training set is not separable we refer again to the interpretation of the weights as a modification of the training set. Indeed if the weights are integers then the weighted log-likelihood is in fact the standard log-likelihood for the modified training set where each training example \((x_i, y_i)\) is duplicated \(w_i\) times. Hence the weighted log-likelihood attains its maximum if the modified training set is not separable which is exactly then the case when the original set is not separable. The case of rational weights can be reduced back to the integer case by a simple scaling of the log-likelihood with an appropriate integer, and the general case with real weights can be tackled via approximating it with weighted log-likelihoods having rational weights. Hence, as long as the weights are strictly positive, assuming the non-separability of the training set yields the existence of the unique maximum of the weighted log-likelihood. The case when the training set is separable is not relevant for our considerations here because in that case any separating plane results in an maximal \(F_\beta \) measure for any \(\beta \).

In this paper we are aiming at optimizing the \(F\) measure which is only defined in the binary classification case. Therefore, for the rest of the paper we restrict discussion to the binary classification case with \(|\mathcal Y| = 2\), i.e. the case where we have only two classes, which we will denote with \(u\) and \(\bar{u}\). Hence, from now on \(y_i \in \mathcal Y = \{u,\bar{u}\}\).

The extension to the multi-class case is possible but nontrivial and will be presented in a separate paper.

## 3 The precision/recall trade off expected F-measure

*Precision*and

*Recall*for the class \(u\) (\(\bar{u}\) denotes the complementary class) defined as

*Expected F-measure:*The problem with the maximum posterior probability classification is that it only takes into account the index of the largest probability in the vector of model probabilities and completely disregards the rest. Instead we focus on the stochastic classifier which takes into account the whole information contained in the learned model. More precisely, an example with a given vector of attributes \(x\) is classified into the class \(y(x)\) which is randomly drawn from the set of all classes according to the model conditional probabilities \(p(y\,|\, x, \lambda )\). In this set-up the quantities \(A_u\), \(B_u\), \(C_u\) and \(D_u\) become random variables. However if we repeatedly perform the stochastic classification and average over the results we would get approximately their expected values over the training set which we denote with \(\tilde{A}_u\), \(\tilde{B}_u\), \(\tilde{C}_u\), \(\tilde{D}_u\) respectively:

For a large training set and a good model, the expected \(F_\beta \) measure on the training set will be close to the standard one since most of the model probabilities \(p(y_i\,|\, x_i, \lambda )\) will be close to \(1\) for the training examples.

In the next section we will show that we can maximize the expected \(F_\beta \) measure by modifying the training set via adding weights and then again estimating the model parameters using the well-behaved and understood maximum likelihood procedure.

## 4 Achieving expected F-measure maximization via weighted maximum likelihood

Clearly, the log-likelihood (1) and the expected \(F_\beta \) (5) are different, though - one would hope - not orthogonal objectives.

Intuitively, every reasonable machine learning model would try to set the model parameters \(\lambda \) in such a manner that for all training examples the model conditional probabilities of the observed classes \(y_i\) given the example’s attributes \(x_i\), namely \( p(y_i\,|\, x_i, \lambda )\), are as large as possible. In general, if the used model is not overfitting badly it will not be possible for all conditional probabilities to be close to \(1\) simultaneously. Every particular model can be viewed as a specific method to implicitly handle these trade-offs. In this view the crucial difference between the log-likelihood and the expected \(F_\beta \) measure seen as objective functions is that while the log-likelihood approach gives equal importance to all training examples on the logarithmic scale the (expected) \(F_\beta \) measure has a parameter \(\beta \) controlling this trade-off on a class-wise scale. On the other hand as noted in Jansche (2005) the flexibility in \(\tilde{F}_\beta \) comes at a price - the \(\tilde{F}_\beta \) is by far not that nice function to optimize as the log-likelihood is. Fortunately, the \(\tilde{F}_\beta \) maximization can be viewed in terms of the weighted likelihood maximization, which is only a slight generalization of the standard maximum likelihood. To make the above discussion precise we will rely on some basic facts from the theory of multi-criteria optimization. For a thorough and up to date treatise on the topic see Ehrgott (2005).

As a simple consequence we obtain that each expected \(F_\beta \) measure maximizer can be realized as a weighted maximum likelihood estimator and even approximated via a class-wise weighted maximum likelihood estimator. Later on, we elaborate on the optimal weights.

### **Proposition 1**

We first sketch the *idea of the proof:*

The maximum likelihood optimizes simultaneously the conditional probabilities

### *Proof*

## 5 The weights

Following the lines of the proof, for a given objective \(\beta \) we can in fact calculate the weights \(w(\beta )\) that would allow us to compute the expected \(F_\beta \) maximizer as a \( w(\beta )\)-weighted maximum likelihood. We also present a class-wise approximation \(\bar{w}(\beta )\) for the instant-wise weights \( w(\beta )\)

### **Proposition 2**

### **Corollary 1**

## 6 Algorithm

The important take away from Sect. 4 is that each maximizer of the expected \(F_\beta \) measure can also be realized as a weighted maximum likelihood. This is very appealing since the log-likelihood is a well behaved concave function. Furthermore, it is important to notice that Corollary 1 and Proposition 2 suggest that we can fall back to weights depending on a single parameter and the qualitative behavior is obvious: larger weights \(w\) correspond to smaller \(\beta \) and vice versa.

One of the strengths of this algorithm is that it is straightforward to implement. The weighted log-likelihood maximization is analogous to the standard one with a different gradient involving the weights. The implementation then follows some off-the-shelf version of the gradient ascent algorithm.

*Convergence of the algorithm:* In the following we will show that, given the learning rate for the full-batch gradient ascent is small enough, each step of the instance-wise version of the above algorithm improves the attained value of the expected \(F_\beta \) measure. Hence, if the learning rate is decreasing appropriately the algorithm will converge.

- 1.
One approach would be to consider also versions of the above algorithm where at the \(n\)-th iteration \(k_n\ge 1\) steps of the gradient ascent algorithm for the weighted log-likelihood function are performed, rather than just a single one as presented in Step 4 of Algorithm 1. As long as \(k_n=1\) is satisfied for all \(n\) greater than some large \(N\) we still get convergence. However the previous phases with \(k_n>1\) might result in escaping regions with suboptimal local maxima of \(\tilde{F}_\beta \). While this is a topic for further research the performed numerical experiments did indicate that the Algorithm 1 tends to converge faster if \(k_n > 1\) at least during an initial burn-in phase. Furthermore, our experiments also suggested that the convergence of the algorithm is improved by replacing the expected \(F_\beta \) measure in the expression for the weights \(w_n\) (\(\bar{w}_n\) respectively) in Step 3 of the Algorithm with the true \(F_\beta \) measure.

- 2.
Another approach which we call

*“targeting”*is as follows: One fixes a rather large target value \(T\) for the achieved \( \tilde{F}_\beta (\hat{\lambda }_\beta )\) and in a burn-in phase runs the algorithm with this value for \( \tilde{F}_\beta (\hat{\lambda }_\beta )\), that is use \( \tilde{F}_\beta (\hat{\lambda }_\beta )= T\) instead of using the proxy \(\tilde{F}_\beta (\hat{\lambda }_\beta ) \approx \tilde{F}_\beta (\hat{\lambda }_n)\). Following this phase the algorithm continues with the realized proxy \( \tilde{F}_\beta (\hat{\lambda }_n)\) as described in Algorithm 1. The idea behind the*“targeting”*is obvious - we prescribe a target value \(T\) for the \(\tilde{F}_\beta (\hat{\lambda }_\beta )\) and tries to achieve it by fixing the weights to \( (w_n)_i = \frac{1- T}{T}p(y_i\,|\, x_i , \hat{\lambda }_n) 1\!\!1_{\{y_i = u\}} + p(y_i\,|\, x_i , \hat{\lambda }_n) 1\!\!1_{\{y_i \ne u\}}. \)

## 7 Baseline methods

We now proceed with the numerical evaluation of our results. We will use two baseline methods in our experiments, with two different goals. First, in order to demonstrate that the algorithm presented in Sect. 6 identifies appropriate weights for the weighted log-likelihood leading to the true optimal \(\tilde{F}_\beta \) solution, we compare to a class-wise *brute force* approach which tries a large number of weights (\(50\) between \(0\) and \(1\)) of the target class and assigns a value of \(1\) to the other class. Then, for each \(\beta \), the brute force baseline simply picks the particular weight value which results in the maximum \(F_\beta \). The *brute-force* approach thus performs an exhaustive search for the optimal weight for the target class within some predefined range \([w_{min}, w_{max}]\). This baseline finds the weights for which the achieved expected \(F_\beta \) is optimal and we demonstrate that the class-wise version of our algorithm achieves the same results, however without the computationally extremely exhaustive brute force calculations. We compare our algorithm to this baseline on the train set, as the model is trained on this set and the objective of this comparison is to show that our algorithm fits the training set as well as the brute force approach but with much less computational effort.

The purpose of the second baseline is to compare the performance of our method to the popular approach for adjusting Precision and Recall based on varying the acceptance threshold of a simple maximum entropy model. We call this baseline *acceptance threshold*. The method modifies acceptance thresholds (w.r.t. target class) for the posterior probabilities in order to achieve higher precision (larger threshold) or recall (smaller threshold). For a given \(\beta \), the probability threshold that gives the best \(F_\beta \) is estimated on the train set and then the threshold is used for prediction on the test set. We use this baseline for comparison on the test set.

## 8 Data

We tested our algorithm on binary classification tasks on three different datasets, as follows:

### 8.1 Synthetic data - A:

### 8.2 Synthetic data - B:

We simulated a dataset of \(5000\) samples with two classes and two features. Each class contains \(2500\) samples, distributed as elliptical Gaussians in the space of features. The samples from class \(\bar{u}\) are distributed as \(\mathcal {N}(\mu _0, \Sigma _0)\), with \(\mu _0 = (2, 1)\) and \(\Sigma _0 = (1, 0.3)^{\top }I_2\). Class \(u\) is generated by \(\mathcal {N}(\mu _1, \Sigma _1)\), with \(\mu _1 = (1, 2)\) and \(\Sigma _1 = (0.3, 1)^{\top }I_2\). In Fig. 1b we visualize the synthetic data B, again using a subset. We used \(4500\) of the samples for training and \(500\) for testing.

### 8.3 Sanders Twitter sentiment corpus:

The Sanders Twitter sentiment corpus is free of charge public data set^{1} for training and testing sentiment analysis algorithms (Saif et al. 2013) on tweets. It consists of \(5513\) hand-classified tweets with respect to four different topics “Apple”, “Microsoft”, “Google” and “Twitter” along with the respective sentiment with regard the tweet’s topic - “positive”, “neutral”, “negative”, or “irrelevant”. The data set consists of: tweet text; tweet creation date; hand-curated tweet topic; hand-curated sentiment label.

In this work we ignore the topic and focus on tweets expressing an attitude (positive and negative - class \(1\)) vs. impersonal (neutral and irrelevant - class \(2\)). This splits the dataset into 1224 tweets of class \(1\) that is the class of interest to us (class \(u\) in the notation in the theoretical part of the paper) and 4289 of class \(2\). The distribution of tweets with respect topic and sentiment classes can be found in the set’s download package.

In our experiments we started with splitting tweets in tokens; then we use stemming (Porter stemmer) and filtered out the stop words. We conducted some other normalization of the tweets like: all explicit URLs and emails are rewritten as URL and email features. The @“words” and #“words” are rewritten as @tag and #tag and “n’t” as “not”. Three and more “!” marks are rewritten as “STRONGEST”, two as “STRONGER” a single one as “STRONG”, while four and more “?” marks are rewritten as “QQQQ+”, three as “QQQ”, respectively two and one as “QQ” and “Q”. We normalized the text to lower case, removed all non-alpha-numeric characters and conjunct tokens up to three at the end of the pipeline. We end up with more than 45000 features from which we selected 50 via evaluation of the information gain of a feature with respect to the classes as in Yang and Pedersen (1997).

The corpus is automatically shuffled before each experiment—\(90~\%\) of all tweets are used for training and \(10~\%\) for testing. The results with this dataset are an average of 10 experiments. The following is a list of the top 10 features for Sanders after the feature selection procedure. (obscene words were removed):

1. QQ 2. STRONG 3. @tag 4. iphone 5. URL 6. ios 7. i_m 8. URL_#tag 9. customer 10. love.

### 8.4 Apple Twitter corpus:

The corpus consists of freely available tweets^{2} that mention the word “apple”. The corpus is created with the idea to distinguish tweets that discuss “Apple, Inc.” (class \(1\)) from tweets about “apple pie” and “apple juice” (class \(2\)). All tweets are \(2000\); \(1306\) of which are class \(1\) and \(694\) are class \(2\).

Tweets are tokenized, stemmed and stop words are filtered out. We

and email features. Again the @“words” and #“words” are rewritten as @tag and #tag. Finally we transform the text to lower case, removed all non-alpha-numeric characters and conjunct up to three tokens at the end of the transformation pipeline.

After these transformations we observe \(18462\) features from which we selected \(276\) via the same feature extraction routine as for the Sander twitter corpus. Below is a list of the top 10 features for the Apple Twitter corpus after the feature selection procedure. (obscene words are again removed): 1. iphone 2. apple_juice 3. mac 4. ipod 5. ipad 6. apple_pie 7. iphone_5 8. apple_cider 9. cider 10. patent.

### 8.5 Press association (PA) data:

The “PA” dataset was developed by the Press Association^{3} to enable the implementation of a system for recognition and semantic disambiguation of named entities in press releases. Given certain metadata for a number of overlapping candidate entities, an array of features derived from the textual context of their occurrence, and additional document-level metadata, the model is trained to recognize which (if any) of the candidate entities is the one referenced in the text.

The corpus is annotated with respect to people, organization and location mentions; a special “negative” label denotes the candidates that can be considered irrelevant in the given context. We remove non-location entities, thus reducing the problem to a binary classification task, and conduct feature selection to keep \(\sim \)10 % of the originally extracted features.We split the corpus into a training set (2369 documents) and a held-out test set (\(160\) documents). As a result of this preprocessing, we have 2 classes (“Location” and “Negative”), \(48773\) instances, a target to irrelevant instance counts ratio of \(0.17\), and \(4640\) features.

## 9 Experiments

*brute-force*baseline which finds the optimal solution using an exhaustive search which is computationally very demanding. The objective is to demonstrate that our algorithm does find the optimal solution with far less iterations. The results clearly demonstrate that on the synthetic datasets (train) our

*class-wise*version of the algorithm is indeed very close to the

*brute-force*baseline. However, the brute-force method requires training \(50\) weighted maximum entropy models, whereas our approach needs only one.The

*instance-wise*approach ensures a better adaptation to the data and therefore achieves even higher \(F_\beta \) values than the class-wise

*brute-force*.Additionally we have added the \(F_\beta \) as achieved by the standard rigid maximum entropy model to demonstrate the scale of improvement that can be achieved with a dedicated \(F_\beta \) optimizer. The comparison to the standard Maxent clearly underlines the importance of the targeted \(F_\beta \) optimization.

The instance-wise version of the algorithm achieves a significant improvement of the \(F_\beta \) measure compared to the acceptance threshold baseline. The improvement of the class-wise version is however marginal.

Following the observations on the two simulated datasets, we believe that the Press Association and the Sanders Twitter data, and perhaps a good portion of the NLP data in general, are distributed in a non-trivial way in the space of features, such that adaptive approaches for Precision-Recall trade-off like the one we are proposing are indeed useful. Clearly, as the experiments on the Apple Twitter corpus show, the improvement over the acceptance threshold depends on the dataset as well as on the used features and might not be always substantial but given that our approach comes at virtually no additional computational and implementation cost and is guaranteed at least not to underperform compared to the adjustment threshold (on the training dataset) we believe that it is a good choice for \(F_\beta \) optimization.

## 10 Limits and merits of the weighted maximum entropy

In this section we compare the class-wise weighted maximum entropy and the acceptance threshold method with the help of the two stylized artificial data sets A and B shown on Fig. 1.

The acceptance threshold corresponds to a translation of the separating hyperplane obtained by the standard maximum entropy model. It is rather clear that with translation we can achieve an optimal Precision/Recall trade-off for the synthetic dataset A. Indeed on Fig. 3a one sees that the acceptance threshold and the weighted maximum entropy do result in similar optimal \(F_\beta \) values on the test set. On the train set the acceptance threshold and our algorithms generate virtually the same optimal \(F_\beta \) values.

The optimal Precision/Recall trade-off for the synthetic dataset B however requires additional rotation/tilting of the separating hyperplane that cannot be produced by adjusting the acceptance threshold. In line with this intuition Figs. 3b and 7b, for the test and train sets respectively, demonstrate that the weighted likelihood settles at a considerably better Precision-Recall pairs and consequently results in larger \(F_\beta \) values on the test and the train sets.

Clearly, in the general case the optimal shift of the separating plane is expected to have a rotation component that is unaccessible by simply adjusting the acceptance threshold. Moreover as shown in the previous section the instance-wise weighted maximum entropy also outperforms the class-wise approximation.

## 11 Conclusions and future work

The main result of the paper is that the weighted maximum likelihood and the expected \(F_\beta \) measure are simply two different ways to specify a particular trade-off between the objectives of the same MOP. As a consequence, each expected \(F_\beta \) maximizer can be realized as a weighted maximum likelihood estimator and approximated via a class-wise weighted maximum likelihood estimator.

The difficulty in exploiting the statement of Proposition 1 lies in the fact that it is not a priori clear how to choose the weights \(w(\beta )\) for a given \(\beta \).

We presented a theoretical result giving the optimal \(w(\beta )\) as well as an efficient algorithm for maximizing the \(\tilde{F}_\beta \) by adaptively determining the right weights. We have tested the algorithm on various data sets and the experiments suggest that it indeed finds the optimal weights and achieves \(\tilde{F}_\beta \) maximization. Furthermore we have compared the algorithm to the optimal threshold baseline showing the superiority of our approach.

The presented results can be generalized to the regularized and multi-class cases and will be presented in a forthcoming article. Lastly, the proposed approach to view a broad class of probabilistic learning schemes based on optimizing some objective function as a specific trade-off between the underlying conditional probabilities and thus to link it to a weighted maximum likelihood model can be applied beyond the specific set-up of maximum entropy and the \(F_\beta \) objective. In particular, the intuition via modifying the original data set and then using the standard model on the new data set, as explained above, can obviously be applied to a very broad range of machine learning models.

## Footnotes

## Notes

### Acknowledgments

The work, described in this paper, is supported by the FP7-ICT Strategic Targeted Research Project PHEME (No. 611233).

## References

- Berger, A., Della Pietra, V., & Della Pietra, S. (1996). A maximum entropy approach to natural language processing.
*Computational Linguistics*,*22*(1), 39–71.Google Scholar - Carpenter, B. (2007). Lingpipe for 99.99 % recall of gene mentions. In: Proceedings of the 2nd BioCreative, workshop.Google Scholar
- Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms.
*Journal of Machine Learning Research*,*7*, 551–585.MathSciNetzbMATHGoogle Scholar - Culotta, A. (2004). Confidence estimation for information extraction. In: Proceedings of Human language technology conference and North American chapter of the Association for Computational Linguistics (HLT-NAACL).Google Scholar
- Dembczyn’ski, K., Waegeman, W., Cheng, W., Hü llermeier, E. (2011). An exact algorithm for f-measure maximization. In: Neural information processing systems : 2011 conference book. Neural Information Processing Systems Foundation.Google Scholar
- Ehrgott, M. (2005).
*Multi criteria optimization*. New Jersery: Springer.Google Scholar - Ganchev, K., Crammer, K., Pereira, F., Mann, G., Bellare, K., Mccallum, A., Carroll, S., Jin, Y., White, P. (2007). Penn/umass/chop biocreative ii systems 1 penn/umass/chop biocreative ii systems. In: Proceedings of the second bioCreative challenge evaluation workshop.Google Scholar
- Ganchev, K., Pereira, O., Mandel, M., Carroll, S., White, P. (2007). Semi-automated named entity annotation. In: Proceedings of the linguistic annotation workshop, Prague, Czech Republic. Association for, Computational Linguistics.Google Scholar
- Geoffrion, A. (1968). Proper efficiency and the theory of vector maximization.
*Journal of Mathematical Analysis and Applications*,*22*, 618–630.MathSciNetCrossRefzbMATHGoogle Scholar - Georgiev, G., Ganchev, K., Momtchev, V., Peychev, D., Nakov, P., Roberts, A. (2009). Tunable domain-independent event extraction in the mira framework. In: Proceedings of the workshop on current trends in biomedical natural language processing: Shared Task, BioNLP ’09, pp. 95–98.Google Scholar
- Jansche, M. (2005).Maximum expected F-measure training of logistic regression models. In: HLT ’05, Association for computational linguistics, Morristown, NJ, USA, pp. 692–699Google Scholar
- Joachims, T. (2005).A support vector method for multivariate performance measures. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 377–384. ACM Press.Google Scholar
- Klinger, R., Friedrich, C.M. (2009). User’s choice of precision and recall in named entity recognition. In: Proceedings of the International Conference RANLP-2009, pp. 192–196.Google Scholar
- Lafferty, J. (2001).Conditional random fields: Probabilistic models for segmenting and labeling sequence data. pp. 282–289.Google Scholar
- Minkov, E., Wang, R., Tomasic, A., Cohen, W. (2006). NER systems that suit user’s preferences: adjusting the recall-precision trade-off for entity extraction. In: Proceedings of NAACL, pp. 93–96.Google Scholar
- Nan, Y., Chai, K.M.A., Lee, W.S., Chieu, H.L. (2012). Optimizing f-measure: A tale of two approaches. In: ICML. http://dblp.uni-trier.de/db/conf/icml/icml2012.html#NanCLC12
- Saif, H., Fernandez, M., He, Y., Alani, H. (2013). Evaluation datasets for twitter sentiment analysis: a survey and a new dataset, the sts-gold.Google Scholar
- Simecková, M. (2005). Maximum weighted likelihood estimator in logistic regression.Google Scholar
- Suzuki, J., McDermott, E., Isozaki, H. (2006). Training conditional random fields with multivariate evaluation measures. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for, Computational Linguistics, ACL-44, pp. 217–224.Google Scholar
- Vandev, D.L., Neykov, N.M. (1998). About regression estimators with high breakdown point. Statistics 32, 111–129. http://www.informaworld.com/10.1080/02331889808802657.Google Scholar
- Yang, Y., Pedersen, J.O. (1997). A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. http://dl.acm.org/citation.cfm?id=645526.657137