A comparison of filtering evaluation metrics based on formal constraints


Although document filtering is simple to define, there is a wide range of different evaluation measures that have been proposed in the literature, all of which have been subject to criticism. Our goal is to compare metrics from a formal point of view, in order to understand whether each metric is appropriate, why and when, in order to achieve a better understanding of the similarities and differences between metrics. Our formal study leads to a typology of measures for document filtering which is based on (1) a formal constraint that must be satisfied by any suitable evaluation measure, and (2) a set of three (mutually exclusive) formal properties which help to understand the fundamental differences between measures and determining which ones are more appropriate depending on the application scenario. As far as we know, this is the first in-depth study on how filtering metrics can be categorized according to their appropriateness for different scenarios. Two main findings derive from our study. First, not every measure satisfies the basic constraint; but problematic measures can be adapted using smoothing techniques that and makes them compliant with the basic constraint while preserving their original properties. Our second finding is that all metrics (except one) can be grouped in three families, each satisfying one out of three formal properties which are mutually exclusive. In cases where the application scenario is clearly defined, this classification of metrics should help choosing an adequate evaluation measure. The exception is the Reliability/Sensitivity metric pair, which does not fit into any of the three families, but has two valuable empirical properties: it is strict (i.e. a good result according to reliability/sensitivity ensures a good result according to all other metrics) and has more robustness that all other measures considered in our study.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. 1.

    Or, more precisely, a quantity which can be mapped into a probability of relevance using some growing monotonic function.

  2. 2.


  3. 3.

    For the sake of readability, we will speak of documents. However, our conclusions can be applied to any kind of items.

  4. 4.

    Letter G is chosen for Gold standard.

  5. 5.

    What our definition of the placebo baseline implies is that document filtering is an asymmetric process in terms of the positive/negative labels. This is implicit in most literature on the subject: for instance, precision and recall are assumed by default to be computed on the relevant class.

  6. 6.

    The formula assumes that both \(e_\mathcal{G}\) and \(e_{\lnot \mathcal{G}}\) did not already belong to \(\mathcal{S}\).

  7. 7.

    Note that if the measure also satisfies the monotonicity axiom, this constant will be low.

  8. 8.

    For the sake of readability, we use here the traditional notation for the contingency matrix components.

  9. 9.

    For easier comparison, the lam% scale has been reversed from 0 to 1.

  10. 10.

    Initially we applied the Pearson coefficient. However, the results were not consistent, due to scaling issues (non-linear correlations).

  11. 11.

    See Fang and Zhai (2014) for an extensive discussion on the topic.

  12. 12.

    We omit the proof; it is enough to solve the equation f’(x)=0.


  1. Agresti, A., & Hitchcock, D. B. (2005). Bayesian inference for categorical data analysis: A survey. Technical report.

  2. Amigó, E., Artiles, J., Gonzalo, J., Spina, D., Liu, B., & Corujo, A. (2010). WePS3 evaluation campaign: Overview of the on-line reputation management task. In 2nd Web people search evaluation workshop (WePS 2010), CLEF 2010 conference, Padova Italy.

  3. Amigó, E., Corujo, A., Gonzalo, J., Meij, E., & de Rijke, M. (2012). Overview of RepLab 2012: Evaluating online reputation management systems. In CLEF (online working notes/labs/workshop).

  4. Amigo, E., Fang, H., Mizzaro, S., & Zhai, C. (2017). Axiomatic thinking for information retrieval and related tasks. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’17, pp. 1419–1420, New York, 2017. ACM.

  5. Amigó, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4), 461–486.

    Article  Google Scholar 

  6. Amigó, E., Gonzalo, J, & Verdejo, F. (2013). A generic measure for document organization tasks. In Proceedings of ACM SIGIR, pp. 643–652. ACM Press.

  7. Amigó, E., Spina, D., & Carrillo-de-Albornoz, J. (2018). An axiomatic analysis of diversity evaluation metrics: Introducing the rank-biased utility metric. In CoRR, abs/1805.02334.

  8. Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., & Spyropoulos, C. D. (2000). An evaluation of naive bayesian anti-spam filtering. In CoRR, cs.CL/0006013.

  9. Bradley, Andrew P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30, 1145–1159.

    Article  Google Scholar 

  10. Busin, L., & Mizzaro, S. (2013). Axiometrics: An axiomatic approach to information retrieval effectiveness metrics. In Proceedings of the 2013 conference on the theory of information retrieval, ICTIR ’13, pp. 8:22–8:29, New York, NY, 2013. ACM.

  11. Callan, J. (1996). Document filtering with inference networks. In Proceedings of the nineteenth annual international ACM SIGIR conference on research and development in information retrieval, pp. 262–269.

  12. Caruana, R., & Niculescu-Mizil, A. (2005). An empirical comparison of supervised learning algorithms using different performance metrics. In Proceedings of 23rd international conference machine learning (ICML06), pp. 161–168.

  13. Clinchant, S., & Gaussier, E. (2011). Is document frequency important for PRF? In Advances in information retrieval theory, pp. 89–100. Springer.

  14. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37.

    Article  Google Scholar 

  15. Cormack, G., & Lynam, T. (2005). TREC 2005 spam track overview. In Proceedings of the fourteenth text retrieval conference 8TREC 2005).

  16. Fang, H. (2008). A re-examination of query expansion using lexical resources. In ACL, vol. 2008, pp. 139–147. Citeseer.

  17. Fang, H., Tao, T., & Zhai, C. X. (2004). A formal study of information retrieval heuristics. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 49–56. ACM.

  18. Fang, H., & Zhai, C. X. (2006). Semantic term matching in axiomatic approaches to information retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp. 115–122. ACM.

  19. Fang, H, & Zhai, C. X. (2014). Axiomatic analysis and optimization of information retrieval models. In Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’14, pp. 1288–1288, New York, NY, 2014. ACM.

  20. Fawcett, T., & Niculescu-Mizil, A. (2007). PAV and the ROC convex hull. Machine Learning, 68, 97–106.

    Article  Google Scholar 

  21. Ferri, C., Hernández-Orallo, J., & Modroiu, R. (2009). An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30(1), 27–38.

    Article  Google Scholar 

  22. Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society Series B (Methodological), 14, 107–114.

    MathSciNet  Article  Google Scholar 

  23. Hedin, B., Tomlinson, S., Baron, J. R., & Oard, D. W. (2009). Overview of the TREC 2009 legal track.

  24. Hoashi, K., Matsumoto, K., Inoue, N., & Hashimoto, K. (2000). Document filtering method using non-relevant information profile. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’00, pp. 176–183, New York, NY, 2000. ACM.

  25. Hull, David A. (1997). The TREC-6 filtering track: Description and analysis. Proceedings of the TREC, 6, 33–56.

    Google Scholar 

  26. Hull, D. A. (1998). The TREC-7 filtering track: Description and analysis. In E. M. Voorhees and D. K. Harman, editors, Proceedings of TREC-7, 7th text retrieval conference, pp. 33–56, Gaithersburg, US, 1998. National Institute of Standards and Technology, Gaithersburg, US.

  27. Karimzadehgan, M., & Zhai, C. X. (2012). Axiomatic analysis of translation language model for information retrieval. In Advances in information retrieval, pp. 268–280. Springer, Berlin.

  28. Karon, B. P., & Alexander, I. E. (1958). Association and estimation in contingency tables. Journal of the American Statistical Association, 23(2), 1–28.

    MathSciNet  Google Scholar 

  29. Krishnamurthy, B., Gill, P., & Arlitt, M. (2008). A few chirps about twitter. In WOSP ’08: Proceedings of the first workshop on online social networks, pp. 19–24, New York, NY, 2008. ACM.

  30. Le, A., Ajot, J., Przybocki, M., & Strassel, S. (2010). Document image collection using Amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, pp. 45–52, Los Angeles, June 2010. Association for Computational Linguistics.

  31. Ling, C. X., Huang, J., & Zhang, H. (2003). AUC: A statistically consistent and more discriminating measure than accuracy. In IJCAI, pp. 519–526.

  32. Lv, Y., & Zhai, C. X. (2011). Lower-bounding term frequency normalization. In Proceedings of the 20th ACM international conference on information and knowledge management, CIKM ’11, pp. 7–16, New York, NY, 2011. ACM.

  33. Mitchell, T. M. (1997). Machine learning. New York: McGraw Hill.

    Google Scholar 

  34. Persin, Michael. (1994). Document filtering for fast ranking. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’94, pp. 339–348, New York, NY, 1994. Springer, New York.

    Google Scholar 

  35. Provost, F. J., & Fawcett, T. (1997). Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In Knowledge discovery and data mining, pp. 43–48.

  36. Qi, Haoliang, Yang, Muyun, He, Xiaoning, & Li, Sheng. (2010). Re-examination on lam% in spam filtering. In Proceedings of the SIGIR 2010 conference, Geneva, Switzerland.

  37. Robertson, S., & Hull, D. A. (2001). The TREC-9 filtering track final report. In Proceedings of TREC-9, pp. 25–40.

  38. Schapire, R. E., Singer, Y., & Singhal, A. (1998). Boosting and Rocchio applied to text filtering. In Proceedings of ACM SIGIR, pp. 215–223. ACM Press.

  39. Sebastiani, F. (2015). An axiomatically derived measure for the evaluation of classification algorithms. In ICTIR, pp. 11–20.

  40. Sokolova, M. (2006). Assessing invariance properties of evaluation measures. In Proceedings of NIPS’06 workshop on testing deployable learning and decision systems.

  41. Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. AI 2006: Advances in artificial intelligence, pp. 1015–1021.

  42. Tversky, A. (1977). Features of similarity. Psychological Review, 84, 327–352.

    Article  Google Scholar 

  43. Van Rijsbergen, C. (1974). Foundation of evaluation. Journal of Documentation, 30(4), 365–373.

    Article  Google Scholar 

Download references


Funding was provided by Secretaría de Estado de Investigación, Desarrollo e Innovación, Ministerio de Economía, Industria y Competitividad, Gobierno de España (Grant No. TIN2015-71785-R, project Vemodalen).

Author information



Corresponding author

Correspondence to Julio Gonzalo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Formal proofs

Appendix: Formal proofs


Utility satisfies the Absolute Weighting property

The characteristic of Utility-based metrics in general, and accuracy in particular, is that they assign an absolute weight to relevant (versus non relevant) documents in the output regardless of the output size. For instance, in the case of the Utility measure \(U_\alpha\), being \(\mathcal{S}_{\lnot i}\) and \(\mathcal{S}'_{\lnot i}\) two non-informative outputs:

$$\begin{aligned} U_\alpha (\mathcal{S}_{\lnot i})=\, & {} \alpha P(\mathcal{S}_{\lnot i}|\mathcal{G})P(\mathcal{G}) - P(\mathcal{S}_{\lnot i}|\lnot \mathcal{G})P(\lnot \mathcal{G})\\=\, & {} \alpha P(\mathcal{G}|\mathcal{S}_{\lnot i})P(\mathcal{S}_{\lnot i}) - P(\lnot \mathcal{G}|\mathcal{S}_{\lnot i})P(\mathcal{S}_{\lnot i})=\, P(S_{\lnot i}) (\alpha P(\mathcal{G})- P(\lnot \mathcal{G})) \end{aligned}$$

Therefore, if \(\alpha =\frac{P(\lnot \mathcal{G})}{P(\mathcal{G})}\) then the score of non-informative outputs is fixed. If \(\alpha >\frac{P(\lnot \mathcal{G})}{P(\mathcal{G})}\), the score of non-informative outputs grows with its size, and reversely if \(\alpha <\frac{P(\lnot \mathcal{G})}{P(\mathcal{G})}\). In summary, the value of the \(\alpha\) parameter determines the relative score of two non-informative outputs. \(\square\)


Weighted Accuracy satisfies Absolute Weighting

Note that, although Accuracy can be considered a Utility-based measure, it does not directly satisfy the Absolute Weighting property, given that its definition does not include any parameter. However, the weighted accuracy proposed in Androutsopoulos et al. (2000) does satisfy this property, and it is a generalization of Accuracy (see proof in this section).

$$\begin{aligned} \text{ Weighted } \text{ Accuracy }(\mathcal{S}_{\lnot i})= & {} \frac{\lambda P(\mathcal{S}_{\lnot i}|\mathcal{G})P(\mathcal{G}) + P(\lnot \mathcal{S}_{\lnot i}| \lnot \mathcal{G})P(\lnot \mathcal{G})}{\lambda P(\mathcal{G}) + P(\lnot \mathcal{G})}\\= & {} \frac{\lambda P(\mathcal{S}_{\lnot i}|\mathcal{G})P(\mathcal{G}) + P(\lnot \mathcal{S}_{\lnot i})P(\lnot \mathcal{G})}{\lambda P(\mathcal{G}) + P(\lnot \mathcal{G})}\\= & {} \frac{\lambda P(\mathcal{S}_{\lnot i})P(\mathcal{G}) + (1-P(\mathcal{S}_{\lnot i}))P(\lnot \mathcal{G})}{\lambda P(\mathcal{G}) + P(\lnot \mathcal{G})}\\= & {} \frac{\lambda P(\mathcal{S}_{\lnot i})P(\mathcal{G}) + P(\lnot \mathcal{G})-P(\mathcal{S}_{\lnot i})P(\lnot \mathcal{G})}{\lambda P(\mathcal{G}) + P(\lnot \mathcal{G})} \end{aligned}$$

If we derive over \(P(\mathcal{S}_{\lnot i})\) we obtain:

$$\begin{aligned} \frac{\lambda P(\mathcal{G}) - P(\lnot \mathcal{G})}{\lambda P(\mathcal{G}) + P(\lnot \mathcal{G})}=\frac{\lambda 2P(\mathcal{G}) - 1}{\lambda P(\mathcal{G}) + P(\lnot \mathcal{G})} \end{aligned}$$

Therefore, the score of a non-informative output grows or decreases with its size depending on whether \(\lambda\) is larger or smaller than \(\frac{1}{2P(\mathcal{G})}\). \(\square\)


Lam% satisfies Non-Informativeness Fixed Quality

Given a non informative output \(\mathcal{S}_{\lnot i}\), then:

$$\begin{aligned} lam\%(\mathcal{S}_{\lnot i})=logit^{-1}\left( \frac{logit(P(\mathcal{S}_{\lnot i}))+logit(P(\lnot \mathcal{S}_{\lnot i}))}{2}\right) \end{aligned}$$

But given that:

$$\begin{aligned} logit(P(\mathcal{S}_{\lnot i}))= & {} log\left( \frac{P(\mathcal{S}_{\lnot i})}{1-P(\mathcal{S}_{\lnot i})}\right) =log\left( \frac{1-P(\lnot \mathcal{S}_{\lnot i})}{P(\lnot \mathcal{S}_{\lnot i})}\right) \\= & {} -log\left( \frac{P(\lnot \mathcal{S}_{\lnot i})}{1-P(\lnot \mathcal{S}_{\lnot i})}\right) =-logit(P(\lnot \mathcal{S}_{\lnot i})) \end{aligned}$$

The two components in the numerator cancel out each other:

$$\begin{aligned} logit(P(\mathcal{S}_{\lnot i}))+logit(P(\lnot \mathcal{S}_{\lnot i}))=-logit(P(\lnot \mathcal{S}_{\lnot i}))+logit(\lnot P(\mathcal{S}_{\lnot i}))=0 \end{aligned}$$

Therefore, given any non-informative output \(\mathcal{S}'\), the fixed resulting score is 0.5. \(\square\)


Phi satisfies Non-Informativeness Fixed Quality

$$\begin{aligned} Phi=\frac{TP.TN - FP.FN}{\sqrt{(TP+FN).(TN+FP).(TP+FP).(TN+FN)}} \end{aligned}$$

Phi is always zero if \(\mathcal{S}_{\lnot i}\) is non informative (see proof in this section), because then the two numerator components cancel each other:

$$\begin{aligned} TP.TN=\, & {} P(\mathcal{S}_{\lnot i}|\mathcal{G})P(\mathcal{G})P(\lnot \mathcal{S}_{\lnot i}| \lnot \mathcal{G})P(\lnot \mathcal{G})=P(\mathcal{S}_{\lnot i})P(\mathcal{G})P(\lnot \mathcal{S}_{\lnot i})P(\lnot \mathcal{G})\\=\, & {} P(\mathcal{S}_{\lnot i}|\lnot \mathcal{G})P(\mathcal{G})P(\lnot \mathcal{S}_{\lnot i}|\mathcal{G})P(\lnot \mathcal{G})\\=\, & {} P(\mathcal{S}_{\lnot i}|\lnot \mathcal{G})P(\lnot \mathcal{G})P(\lnot \mathcal{S}_{\lnot i}|\mathcal{G})P(\mathcal{G})=FP.FN \end{aligned}$$

And therefore Phi is zero. \(\square\)


Odds Ratio satisfies Non-Informativeness Fixed Quality

If \(\mathcal{S}_{\lnot i}\) is non informative:

$$\begin{aligned} Odds(\mathcal{S}_{\lnot i})= & {} \frac{TP.TN}{FN.FP}=\frac{P(\mathcal{S}_{\lnot i}|\mathcal{G})P(\mathcal{G})P(\lnot \mathcal{S}_{\lnot i}| \lnot \mathcal{G})P(\lnot \mathcal{G})}{P(\mathcal{S}_{\lnot i}|\lnot \mathcal{G})P(\lnot \mathcal{G}).P(\lnot \mathcal{S}_{\lnot i}|\mathcal{G})P(\mathcal{G})}\\= & {} \frac{P(\mathcal{S}_{\lnot i})P(\mathcal{G})P(\lnot \mathcal{S}_{\lnot i})P(\lnot \mathcal{G})}{P(\mathcal{S}_{\lnot i})P(\lnot \mathcal{G})P(\lnot \mathcal{S}_{\lnot i})P(\mathcal{G})}=1 \end{aligned}$$



Macro Average Accuracy satisfies Non-Informativeness Fixed Quality

$$\begin{aligned} MAAc(\mathcal{S}_{\lnot i})=\frac{\frac{TP}{TP+FN}+\frac{TN}{TN+FP}}{2}=\frac{P(\mathcal{S}|\mathcal{G})+P(\lnot \mathcal{S}|\lnot \mathcal{G})}{2} \end{aligned}$$

If \(\mathcal{S}_{\lnot i}\) is non-informative then:

$$\begin{aligned} MAAc(\mathcal{S}_{\lnot i})= & {} \frac{P(\mathcal{S}_{\lnot i}|\mathcal{G})+P(\lnot \mathcal{S}_{\lnot i}|\lnot \mathcal{G})}{2}=\frac{P(\mathcal{S}_{\lnot i})+P(\lnot \mathcal{S}_{\lnot i})}{2}\\= & {} \frac{P(\mathcal{S}_{\lnot i})+1-P(\mathcal{S}_{\lnot i})}{2}=\frac{1}{2} \end{aligned}$$



Kappa statistic satisfies Non-Informativeness Fixed Quality

The Kappa statistic is defined as:

$$\begin{aligned} KapS(\mathcal{S})=\frac{\text{ Accuracy }-\text{ Random } \text{ Accuracy }}{1-\text{ Random } \text{ Accuracy }} \end{aligned}$$

where Random Accuracy represents the Accuracy obtained randomly by an output with size \(|\mathcal{S}|\). In our probabilistic notation, Kappa can be expressed as:

$$\begin{aligned} KapS(\mathcal{S})=\frac{(P(\mathcal{S}|\mathcal{G})P(\mathcal{G})+P(\lnot \mathcal{S}|\lnot \mathcal{G})P(\lnot \mathcal{G}))-(P(\mathcal{S})P(\mathcal{G})+P(\lnot \mathcal{S})P(\lnot \mathcal{G}))}{1-(P(\mathcal{S})P(\mathcal{G})+P(\lnot \mathcal{S})P(\lnot \mathcal{G}))} \end{aligned}$$

If \(\mathcal{S}_{\lnot i}\) is non informative then \(P(\mathcal{S}_{\lnot i}|\mathcal{G})=P(\mathcal{S}_{\lnot i})\), and the formula returns zero. \(\square\)


Chi-square satisfies Non-Informativeness Fixed Quality

$$\begin{aligned} Chi(\mathcal{S})= & {} \frac{(|\mathcal{S} \cap \mathcal{G}|.|\lnot \mathcal{S} \cap \lnot \mathcal{G}|-|\mathcal{S} \cap \lnot \mathcal{G}|.|\lnot \mathcal{S} \cap \mathcal{G}|)+|T|}{|\mathcal{S}|+|\mathcal{G}|+|\lnot \mathcal{S}|+|\lnot \mathcal{G}|}\\= & {} \frac{(P(\mathcal{S}|\mathcal{G}).P(\lnot \mathcal{S}|\lnot \mathcal{G})-P(\mathcal{S}|\lnot \mathcal{G}).P(\lnot \mathcal{S} | \mathcal{G}))+1}{2} \end{aligned}$$

If an output \(\mathcal{S}_{\lnot i}\) is non informative then:

$$\begin{aligned} Chi(\mathcal{S}_{\lnot i})=\frac{(P(\mathcal{S}_{\lnot i})P(\lnot \mathcal{S}_{\lnot i})-P(\mathcal{S}_{\lnot i})P(\lnot \mathcal{S}_{\lnot i})+1}{2}=\frac{1}{2} \end{aligned}$$



The F measure of Precision and Recall for the positive class satisfies non-informativeness growing quality

The F measure for a non-informative output grows with its size (i.e. with the ratio of items labeled as positive by the system), because

$$\begin{aligned} F_{\alpha }(\mathcal{S}_{\lnot i})=F_{\alpha }(P(\mathcal{G}|\mathcal{S}_{\lnot i}),P(\mathcal{S}_{\lnot i}|\mathcal{G}))=F_{\alpha }(P(\mathcal{G}),P(\mathcal{S}_{\lnot i})) \end{aligned}$$

The F measure Independence property (Van Rijsbergen 1974) states that, if the first parameter is fixed (in our case, \(P(\mathcal{G})\)), F grows with the second parameter (in our case, \(P(\mathcal{S}_{\lnot i})\), which is the probability that an item receives a posivitive label). Therefore,

$$\begin{aligned} F_{\alpha }(P(\mathcal{G}),P(\mathcal{S}_{\lnot i}))\sim P(\mathcal{S}_{\lnot i}) \end{aligned}$$

which satisfies the non-informativeness growing quality. \(\square\)


Every non informative output receives an F(R,S) score lower than 0.25.

Given a non informative input \(\mathcal{S}_{\lnot i}\), \(\hbox {F}_{\alpha }\)(R(\(\mathcal{S}_{\lnot i}\)), S(\(\mathcal{S}_{\lnot i}\))) can be expressed as:

$$\begin{aligned} \text{ F }_{\alpha }(\text{ R }(\mathcal{S}_{\lnot i}), \text{ S }(\mathcal{S}_{\lnot i}))= & {} \left( \frac{\alpha }{P(\mathcal{G}|\mathcal{S}_{\lnot i})P(\lnot \mathcal{G}|\lnot \mathcal{S}_{\lnot i})}+\frac{1-\alpha }{P(\mathcal{S}_{\lnot i}|\mathcal{G})P(\lnot \mathcal{S}_{\lnot i}|\lnot \mathcal{G})}\right) ^{-1}\\= & {} \left( \frac{\alpha }{P(\mathcal{G})P(\lnot \mathcal{G})}+\frac{1-\alpha }{P(\mathcal{S}_{\lnot i})P(\lnot \mathcal{S}_{\lnot i})}\right) ^{-1}\\= & {} \left( \frac{\alpha }{P(\mathcal{G})(1-P( \mathcal{G}))}+\frac{1-\alpha }{P(\mathcal{S}_{\lnot i})(1-P( \mathcal{S}_{\lnot i}))}\right) ^{-1} \end{aligned}$$

We can prove easily that if \(0\le x \le 1\), then the function \(f=x(1-x)\) is upper bounded by 0.25.Footnote 12 Therefore, according to the harmonic mean properties, the maximal value of F(R,S) is:

$$\begin{aligned} \text{ F }_{\alpha }(\text{ R }(\mathcal{S}_{\lnot i}), \text{ S }(\mathcal{S}_{\lnot i}))\le \left( \frac{\alpha }{0.25}+\frac{1-\alpha }{0.25}\right) ^{-1} =\left( \frac{\alpha +1-\alpha }{0.25}\right) ^{-1}=0.25 \end{aligned}$$


Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Amigó, E., Gonzalo, J., Verdejo, F. et al. A comparison of filtering evaluation metrics based on formal constraints. Inf Retrieval J 22, 581–619 (2019). https://doi.org/10.1007/s10791-019-09355-y

Download citation


  • Document filtering
  • Evaluation metrics
  • Evaluation methodologies