## Abstract

Quantification is the task of estimating, given a set \(\sigma \) of unlabelled items and a set of classes \({\mathcal {C}}=\{c_{1}, \ldots , c_{|{\mathcal {C}}|}\}\), the prevalence (or “relative frequency”) in \(\sigma \) of each class \(c_{i}\in {\mathcal {C}}\). While quantification may in principle be solved by classifying each item in \(\sigma \) and counting how many such items have been labelled with \(c_{i}\), it has long been shown that this “classify and count” method yields suboptimal quantification accuracy. As a result, quantification is no longer considered a mere byproduct of classification, and has evolved as a task of its own. While the scientific community has devoted a lot of attention to devising more accurate quantification methods, it has not devoted much to discussing what properties an *evaluation measure for quantification* (EMQ) should enjoy, and which EMQs should be adopted as a result. This paper lays down a number of interesting properties that an EMQ may or may not enjoy, discusses if (and when) each of these properties is desirable, surveys the EMQs that have been used so far, and discusses whether they enjoy or not the above properties. As a result of this investigation, some of the EMQs that have been used in the literature turn out to be severely unfit, while others emerge as closer to what the quantification community actually needs.
However, a significant result is that no existing EMQ satisfies all the properties identified as desirable, thus indicating that more research is needed in order to identify (or synthesize) a truly adequate EMQ.

This is a preview of subscription content, access via your institution.

## Notes

- 1.
Consistently with most mathematical literature, we use the caret symbol (\(^{\hat{\,}}\)) to indicate estimation.

- 2.
In order to keep things simple we avoid overspecifying the notation, thus leaving some aspects of it implicit; e.g., in order to indicate a true distribution

*p*of the unlabelled items in a sample \(\sigma \) across a codeframe \({\mathcal {C}}\) we will simply write*p*instead of the more cumbersome \(p_{\sigma }^{{\mathcal {C}}}\), thus letting \(\sigma \) and \({\mathcal {C}}\) be inferred from context. - 3.
In this paper we do not discuss the evaluation of

*ordinal quantification*(OQ), defined as SLQ with a codeframe \({\mathcal {C}}=\{c_{1}, \ldots , c_{|{\mathcal {C}}|}\}\) on which a total order \(c_{1} \prec \cdots \prec c_{|{\mathcal {C}}|}\) is defined. Aside from reasons of space, the reasons for disregarding OQ is that there has been very little work on it [the only papers we know being (Da San Martino et al. 2016a, b; Esuli 2016)], and that only one measure for OQ (the*Earth Mover’s Distance*—see Esuli and Sebastiani 2010) has been proposed and used so far. For the same reasons we do not discuss*regression quantification*(RQ), the task that stands to metric regression as single-label quantification stands to single-label classification. RQ has been studied even less than OQ, the only work appeared on this theme so far being, to the best of our knowledge (Bella et al. 2014), which as an evaluation measure has proposed the*Cramér-von-Mises u-statistic*(see Bella et al. 2014 for details). - 4.
Note that two distributions

*p*(*c*) and \({\hat{p}}(c)\) over \({\mathcal {C}}\) are essentially two nonnegative-valued, length-normalized vectors of dimensionality \(|{\mathcal {C}}|\). The literature on EMQs thus obviously intersects the literature on functions for computing the similarity of two vectors. - 5.
A divergence is often indicated by the notation \(D(p||{\hat{p}})\)); we will prefer the more neutral notation \(D(p,{\hat{p}})\). Note also that a divergence can take as arguments any two distributions

*p*and*q*defined on the same space of events, i.e.,*p*and*q*need not be a true distribution and a predicted distribution. However, since we will consider divergences only as measures of fit between a true distribution and a predicted distribution, we will use the more specific notation \(D(p,{\hat{p}})\) rather than the more general*D*(*p*,*q*). - 6.
By the “range” of an EMQ here we actually mean its

*image*(i.e., the set of values that the EMQ actually takes for its admissible input values), and not just its codomain. - 7.
One might argue that underestimating the prevalence of a class \(c_{1}\) always implies overestimating the prevalence of another class \(c_{2}\). However, there are cases in which \(c_{1}\) and \(c_{2}\) are not equally important. For instance, if \({\mathcal {C}}=\{c_{1},c_{2}\}\), with \(c_{1}\) the class of patients that suffer from a certain rare disease (say, one such that \(p(c_{1})=.0001\)) and \(c_{2}\) the class of patients who do not, the class whose prevalence we really want to quantify is \(c_{1}\), the prevalence of \(c_{2}\) being derivative. So, what we really care about is that underestimating \(p(c_{1})\) and overestimating \(p(c_{1})\) are equally penalized. The formulation of IMP, which involves underestimation and overestimation in a perfectly symmetric way, is strong enough that IMP is not satisfied (as we will see in Sect. 4) by a number of important EMQs.

- 8.
- 9.
The symbol \({\pm }\) stands for “plus or minus” while \({\mp }\) stands for “minus or plus”; when symbol \({\pm }\) evaluates to \(+\), symbol \({\mp }\) evaluates to −, and vice versa.

- 10.
This is the basis of the “Strict Monotonicity” property discussed in Sebastiani (2015) for the evaluation of classification systems.

- 11.
In Eq. 14 and in the rest of this paper the \(\log \) operator denotes the natural logarithm.

- 12.
Since the standard logistic function \(\frac{e^{x}}{e^{x}+1}\) ranges (for the domain \([0,+\,\infty )\) we are interested in) on [\(\frac{1}{2}\),1], a multiplication by 2 is applied in order for it to range on [1,2], and 1 is subtracted in order for it to range on [0,1], as desired.

- 13.
Esuli and Sebastiani (2014) mistakenly defined \({{\,\mathrm{NKLD}\,}}(p,{\hat{p}})\) as \(\frac{e^{{{\,\mathrm{KLD}\,}}(p,{\hat{p}})}-1}{e^{{{\,\mathrm{KLD}\,}}(p,{\hat{p}})}}\); this was later corrected into the formulation of Eq. 16 [which is equivalent to \(\frac{e^{{{\,\mathrm{KLD}\,}}(p,{\hat{p}})}-1}{e^{{{\,\mathrm{KLD}\,}}(p,{\hat{p}})}+1}\)] by Gao and Sebastiani (2016).

- 14.
This is true only at a first approximation, though. In more precise terms, the maximum value that \({{\,\mathrm{NKLD}\,}}\) can have is strictly smaller than 1 because the maximum value that \({{\,\mathrm{KLD}\,}}\) can have is finite (see Eq. 15) and, as discussed at the end of Sect. 4.7, dependent on

*p*, on the cardinality of \({\mathcal {C}}\), and even on the value of \(\epsilon \); as a result, the maximum value that \({{\,\mathrm{NKLD}\,}}\) can have is also dependent on these three variables (although it is always very close to 1—see the example in “Appendix 2.1” section). - 15.
It has to be remarked that, in some cases, differences of the latter type may be moderate, especially when |

*C*| is high. For instance, when \(|C|=2\) the value of \(z_{AE}\) ranges on [0.5,1.0], but when \(|C|=10\) it ranges on [0.18, 0.20]. - 16.
A similar situation occurs when evaluating multi-label classification via “microaveraged \(F_{1}\)”, a measure in which the classes with higher prevalence weigh more on the final result.

- 17.
It is this author’s experience that even measures such as \(F_{1}\) can be considered by customers “esoteric”.

- 18.
As an example, assume a (very realistic) scenario in which \(|\sigma |=1000\), \({\mathcal {C}}=\{c_{1},c_{2}\}\), \(p(c_{1})=0.01\), and in which three different quantifiers \({\hat{p}}'\), \({\hat{p}}''\), \({\hat{p}}'''\) are such that \({\hat{p}}'(c_{1})=0.0101\), \({\hat{p}}''(c_{1})=0.0110\), \({\hat{p}}'''(c_{1})=0.0200\). In this scenario \({{\,\mathrm{KLD}\,}}\) ranges on [0, 7.46], \({{\,\mathrm{KLD}\,}}(p,{\hat{p}}')=4.78\)e–07, \({{\,\mathrm{KLD}\,}}(p,{\hat{p}}'')=4.53\)e–05, \({{\,\mathrm{KLD}\,}}(p,{\hat{p}}''')=3.02\)e–03, i.e., the difference between \({{\,\mathrm{KLD}\,}}(p,{\hat{p}}')\) and \({{\,\mathrm{KLD}\,}}(p,{\hat{p}}'')\) (and the one between \({{\,\mathrm{KLD}\,}}(p,{\hat{p}}'')\) and \({{\,\mathrm{KLD}\,}}(p,{\hat{p}}''')\)) is 2 orders of magnitude, while the difference between \({{\,\mathrm{KLD}\,}}(p,{\hat{p}}')\) and \({{\,\mathrm{KLD}\,}}(p,{\hat{p}}''')\) is no less than 4 orders of magnitude. The increase in error (as computed by \({{\,\mathrm{KLD}\,}}\)) deriving from using \({\hat{p}}'''\) instead of \({\hat{p}}'\) is + 632,599%.

- 19.
We assume \(|D|=1{,}000{,}000\). This assumption has no relevance on the qualitative conclusions we draw here, and only affects the magnitude of the values in the table (since the value of |

*D*| affects the value of \(\epsilon \), and thus of \({{\,\mathrm{RAE}\,}}\), \({{\,\mathrm{NRAE}\,}}\), \({{\,\mathrm{DR}\,}}\), \({{\,\mathrm{KLD}\,}}\), \({{\,\mathrm{NKLD}\,}}\), \({{\,\mathrm{PD}\,}}\)—see Sect. 4.3) and following. - 20.
For the EMQs that require smoothed probabilities to be used, these definitions obviously need to be replaced by \(a\equiv p_{s}(c_{1})\) and \(x\equiv {\hat{p}}_{s}(c_{1})\).

## References

Alaíz-Rodríguez, R., Guerrero-Curieses, A., & Cid-Sueiro, J. (2011). Class and subclass probability re-estimation to adapt a classifier in the presence of concept drift.

*Neurocomputing*,*74*(16), 2614–2623. https://doi.org/10.1016/j.neucom.2011.03.019.Ali, S. M., & Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another.

*Journal of the Royal Statistical Society, Series B*,*28*(1), 131–142. https://doi.org/10.1111/j.2517-6161.1966.tb00626.x.Amigó, E., Gonzalo, J., & Verdejo, F. (2011). A comparison of evaluation metrics for document filtering. In

*Proceedings of the 2nd international conference of the cross-language evaluation forum (CLEF 2011)*. Amsterdam, NL (pp. 38–49). https://doi.org/10.1007/978-3-642-23708-9_6.Baccianella, S., Esuli, A., & Sebastiani, F. (2013). Variable-constraint classification and quantification of radiology reports under the ACR index.

*Expert Systems and Applications*,*40*(9), 3441–3449. https://doi.org/10.1016/j.eswa.2012.12.052.Barranquero, J., Díez, J., & del Coz, J. J. (2015). Quantification-oriented learning based on reliable classifiers.

*Pattern Recognition*,*48*(2), 591–604. https://doi.org/10.1016/j.patcog.2014.07.032.Barranquero, J., González, P., Díez, J., & del Coz, J. J. (2013). On the study of nearest neighbor algorithms for prevalence estimation in binary problems.

*Pattern Recognition*,*46*(2), 472–482. https://doi.org/10.1016/j.patcog.2012.07.022.Beijbom, O., Hoffman, J., Yao, E., Darrell, T., Rodriguez-Ramirez, A., Gonzalez-Rivero, M., et al. (2015). Quantification in-the-wild: Data-sets and baselines. In

*Presented at the NIPS 015 workshop on transfer and multi-task learning, CA: Montreal. CoRR*. arXiv:1510.04811.Bella, A., Ferri, C., Hernández-Orallo, J., & Ramírez-Quintana, M. J. (2010). Quantification via probability estimators. In

*Proceedings of the 11th IEEE international conference on data mining (ICDM 2010)*. Sydney, AU (pp. 737–742). https://doi.org/10.1109/icdm.2010.75.Bella, A., Ferri, C., Hernández-Orallo, J., & Ramírez-Quintana, M. J. (2014). Aggregative quantification for regression.

*Data Mining and Knowledge Discovery*,*28*(2), 475–518. https://doi.org/10.1007/s10618-013-0308-z.Busin, L., & Mizzaro, S. (2013). Axiometrics: An axiomatic approach to information retrieval effectiveness metrics. In

*Proceedings of the 4th international conference on the theory of information retrieval (ICTIR 2013)*. Copenhagen, DK (p. 8). https://doi.org/10.1145/2499178.2499182.Card, D., & Smith, N. A. (2018). The importance of calibration for estimating proportions from annotations. In

*Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics (HLT-NAACL 2018)*. New Orleans, US (pp. 1636–1646). https://doi.org/10.18653/v1/n18-1148.Ceron, A., Curini, L., & Iacus, S. M. (2016). iSA: A fast, scalable and accurate algorithm for sentiment analysis of social media content.

*Information Sciences*,*367*(368), 105–124. https://doi.org/10.1016/j.ins.2016.05.052.Csiszár, I., & Shields, P. C. (2004). Information theory and statistics: A tutorial.

*Foundations and Trends in Communications and Information Theory*,*1*(4), 417–528. https://doi.org/10.1561/0100000004.Da San Martino, G., Gao, W., & Sebastiani, F. (2016a). Ordinal text quantification. In

*Proceedings of the 39th ACM conference on research and development in information retrieval (SIGIR 2016)*. Pisa, IT (pp. 937–940). https://doi.org/10.1145/2911451.2914749.Da San Martino, G., Gao, W., & Sebastiani, F. (2016b). QCRI at SemEval-2016 task 4: Probabilistic methods for binary and ordinal quantification. In

*Proceedings of the 10th international workshop on semantic evaluation (SemEval 2016)*. San Diego, US (pp. 58–63). https://doi.org/10.18653/v1/s16-1006.dos Reis, D. M., Maletzke, A., Cherman, E., & Batista, G. E. (2018a). One-class quantification. In

*Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML-PKDD 2018)*. Dublin, IE.dos Reis, D. M., Maletzke, A. G., Silva, D. F., & Batista, G. E. (2018b). Classifying and counting with recurrent contexts. In

*Proceedings of the 24th ACM international conference on knowledge discovery and data mining (KDD 2018)*. London, UK (pp. 1983–1992). https://doi.org/10.1145/3219819.3220059.du Plessis, M. C., Niu, G., & Sugiyama, M. (2017). Class-prior estimation for learning from positive and unlabeled data.

*Machine Learning*,*106*(4), 463–492. https://doi.org/10.1007/s10994-016-5604-6.du Plessis, M. C., & Sugiyama, M. (2012). Semi-supervised learning of class balance under class-prior change by distribution matching. In

*Proceedings of the 29th international conference on machine learning (ICML 2012)*. Edinburgh, UK.du Plessis, M. C., & Sugiyama, M. (2014). Class prior estimation from positive and unlabeled data.

*IEICE Transactions*,*97–D*(5), 1358–1362. https://doi.org/10.1587/transinf.e97.d.1358.Esuli, A. (2016). ISTI-CNR at SemEval-2016 task 4: Quantification on an ordinal scale. In

*Proceedings of the 10th international workshop on semantic evaluation (SemEval 2016)*. San Diego, US. https://doi.org/10.18653/v1/s16-1011.Esuli, A., Moreo, A., & Sebastiani, F. (2018). A recurrent neural network for sentiment quantification. In

*Proceedings of the 27th ACM international conference on information and knowledge management (CIKM 2018)*. Torino, IT (pp. 1775–1778). https://doi.org/10.1145/3269206.3269287.Esuli, A., & Sebastiani, F. (2010). Sentiment quantification.

*IEEE Intelligent Systems*,*25*(4), 72–75.Esuli, A., & Sebastiani, F. (2014). Explicit loss minimization in quantification applications (preliminary draft). In

*Proceedings of the 8th international workshop on information filtering and retrieval (DART 2014)*. Pisa, IT (pp. 1–11).Esuli, A., & Sebastiani, F. (2015). Optimizing text quantifiers for multivariate loss functions.

*ACM Transactions on Knowledge Discovery and Data*,*9*(4), 27. https://doi.org/10.1145/2700406.Ferrante, M., Ferro, N., & Maistro, M. (2015). Towards a formal framework for utility-oriented measurements of retrieval effectiveness. In

*Proceedings of the 5th ACM international conference on the theory of information retrieval (ICTIR 2015)*. Northampton, US (pp. 21–30). https://doi.org/10.1145/2808194.2809452.Ferrante, M., Ferro, N., & Pontarollo, S. (2018). A general theory of IR evaluation measures.

*IEEE Transactions on Knowledge and Data Engineering*,. https://doi.org/10.1109/TKDE.2018.2840708.Forman, G. (2005). Counting positives accurately despite inaccurate classification. In

*Proceedings of the 16th European conference on machine learning (ECML 2005)*. Porto, PT (pp. 564–575). https://doi.org/10.1007/11564096_55.Forman, G. (2006). Quantifying trends accurately despite classifier error and class imbalance. In

*Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD 2006)*. Philadelphia, US (pp. 157–166). https://doi.org/10.1145/1150402.1150423.Forman, G. (2008). Quantifying counts and costs via classification.

*Data Mining and Knowledge Discovery*,*17*(2), 164–206. https://doi.org/10.1007/s10618-008-0097-y.Gao, W., & Sebastiani, F. (2015). Tweet sentiment: From classification to quantification. In

*Proceedings of the 7th international conference on advances in social network analysis and mining (ASONAM 2015)*. Paris, FR (pp. 97–104). https://doi.org/10.1145/2808797.2809327.Gao, W., & Sebastiani, F. (2016). From classification to quantification in tweet sentiment analysis.

*Social Network Analysis and Mining*,*6*(19), 1–22. https://doi.org/10.1007/s13278-016-0327-z.González, P., Álvarez, E., Díez, J., López-Urrutia, Á., & del Coz, J. J. (2017). Validation methods for plankton image classification systems.

*Limnology and Oceanography: Methods*,*15*, 221–237. https://doi.org/10.1002/lom3.10151.González, P., Castaño, A., Chawla, N. V., & del Coz, J. J. (2017). A review on quantification learning.

*ACM Computing Surveys*,*50*(5), 74:1–74:40. https://doi.org/10.1145/3117807.González, P., Díez, J., Chawla, N., & del Coz, J. J. (2017). Why is quantification an interesting learning problem?

*Progress in Artificial Intelligence*,*6*(1), 53–58. https://doi.org/10.1007/s13748-016-0103-3.González-Castro, V., Alaiz-Rodríguez, R., & Alegre, E. (2013). Class distribution estimation based on the Hellinger distance.

*Information Sciences*,*218*, 146–164. https://doi.org/10.1016/j.ins.2012.05.028.González-Castro, V., Alaiz-Rodríguez, R., Fernández-Robles, L., Guzmán-Martínez, R., & Alegre, E. (2010). Estimating class proportions in boar semen analysis using the Hellinger distance. In

*Proceedings of the 23rd international conference on industrial engineering and other applications of applied intelligent systems (IEA/AIE 2010)*. Cordoba, ES (pp. 284–293). https://doi.org/10.1007/978-3-642-13022-9_29.Hopkins, D. J., & King, G. (2010). A method of automated nonparametric content analysis for social science.

*American Journal of Political Science*,*54*(1), 229–247. https://doi.org/10.1111/j.1540-5907.2009.00428.x.Kar, P., Li, S., Narasimhan, H., Chawla, S., & Sebastiani, F. (2016). Online optimization methods for the quantification problem. In

*Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (KDD 2016)*. San Francisco, US (pp. 1625–1634). https://doi.org/10.1145/2939672.2939832.Keith, K. A., & O’Connor, B. (2018). Uncertainty-aware generative models for inferring document class prevalence. In

*Proceedings of the conference on empirical methods in natural language processing (EMNLP 2018)*. Brussels, BE.King, G., & Ying, L. (2008). Verbal autopsy methods with multiple causes of death.

*Statistical Science*,*23*(1), 78–91. https://doi.org/10.1214/07-sts247.Levin, R., & Roitman, H. (2017). Enhanced probabilistic classify and count methods for multi-label text quantification. In

*Proceedings of the 7th ACM international conference on the theory of information retrieval (ICTIR 2017)*. Amsterdam, NL (pp. 229–232). https://doi.org/10.1145/3121050.3121083.Liese, F., & Vajda, I. (2006). On divergences and informations in statistics and information theory.

*IEEE Transactions on Information Theory*,*52*(10), 4394–4412. https://doi.org/10.1109/tit.2006.881731.Lin, J. (1991). Divergence measures based on the Shannon entropy.

*IEEE Transactions on Information Theory*,*37*(1), 145–151. https://doi.org/10.1109/18.61115.MacKay, D. J. (2003).

*Information theory, inference and learning algorithms*. Cambridge: Cambridge University Press.Maletzke, A. G., dos Reis, D. M., & Batista, G. E. (2017). Quantification in data streams: Initial results. In

*Proceedings of the 2017 Brazilian conference on intelligent systems (BRACIS 2017)*. Uberlândia, BZ (pp. 43–48). https://doi.org/10.1109/BRACIS.2017.74.Maletzke, A. G., Moreira dos Reis, D., & Batista, G. E. (2018). Combining instance selection and self-training to improve data stream quantification.

*Journal of the Brazilian Computer Society*,*24*(12), 43–48. https://doi.org/10.1186/s13173-018-0076-0.Milli, L., Monreale, A., Rossetti, G., Giannotti, F., Pedreschi, D., & Sebastiani, F. (2013). Quantification trees. In

*Proceedings of the 13th IEEE international conference on data mining (ICDM 2013)*. Dallas, US (pp. 528–536). https://doi.org/10.1109/icdm.2013.122.Milli, L., Monreale, A., Rossetti, G., Pedreschi, D., Giannotti, F., & Sebastiani, F. (2015). Quantification in social networks. In

*Proceedings of the 2nd IEEE international conference on data science and advanced analytics (DSAA 2015)*. Paris, FR. https://doi.org/10.1109/dsaa.2015.7344845.Moffat, A. (2013). Seven numeric properties of effectiveness metrics. In

*Proceedings of the 9th conference of the Asia information retrieval societies (AIRS 2013)*. Singapore, SN (pp. 1–12). https://doi.org/10.1007/978-3-642-45068-6_1.Nakov, P., Farra, N., & Rosenthal, S. (2017). SemEval-2017 task 4: Sentiment analysis in Twitter. In

*Proceedings of the 11th international workshop on semantic evaluation (SemEval 2017)*. Vancouver, CA. https://doi.org/10.18653/v1/s17-2088.Nakov, P., Ritter, A., Rosenthal, S., Sebastiani, F., & Stoyanov, V. (2016). SemEval-2016 task 4: Sentiment analysis in Twitter. In

*Proceedings of the 10th international workshop on semantic evaluation (SemEval 2016)*. San Diego, US (pp. 1–18). https://doi.org/10.18653/v1/s16-1001.Pérez-Gállego, P., Castaño, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic ensemble selection for quantification tasks.

*Information Fusion*,*45*, 1–15. https://doi.org/10.1016/j.inffus.2018.01.001.Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification.

*Information Fusion*,*34*, 87–100. https://doi.org/10.1016/j.inffus.2016.07.001.Saerens, M., Latinne, P., & Decaestecker, C. (2002). Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure.

*Neural Computation*,*14*(1), 21–41. https://doi.org/10.1162/089976602753284446.Sanya, A., Kumar, P., Kar, P., Chawla, S., & Sebastiani, F. (2018). Optimizing non-decomposable measures with deep networks.

*Machine Learning*,*107*(8–10), 1597–1620. https://doi.org/10.1007/s10994-018-5736-y.Sebastiani, F. (2015). An axiomatically derived measure for the evaluation of classification algorithms. In

*Proceedings of the 5th ACM international conference on the theory of information retrieval (ICTIR 2015)*. Northampton, US (pp. 11–20). https://doi.org/10.1145/2808194.2809449.Tang, L., Gao, H., & Liu, H. (2010). Network quantification despite biased labels. In

*Proceedings of the 8th workshop on mining and learning with graphs (MLG 2010)*. Washington, US (pp. 147–154). https://doi.org/10.1145/1830252.1830271.Tasche, D. (2017). Fisher consistency for prior probability shift.

*Journal of Machine Learning Research*,*18*, 95:1–95:32.Vaz, A. F., Izbicki, R., & Stern, R. B. (2018). Quantification under prior probability shift: The ratio estimator and its extensions. arXiv preprint arXiv:1807.03929.

Zhang, Z., & Zhou, J. (2010). Transfer estimation of evolving class priors in data stream classification.

*Pattern Recognition*,*43*(9), 3151–3161. https://doi.org/10.1016/j.patcog.2010.03.021.

## Acknowledgements

This work has benefitted from many discussions that I have had over the years with Andrea Esuli, Wei Gao, Ercan Kuruoglu, and Alejandro Moreo.

## Author information

### Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### Appendix 1: Properties of \({{\,\mathrm{AE}\,}}\)

We here prove that \({{\,\mathrm{AE}\,}}\) enjoys IoI, NN, IND, MON, IMP, ABS. While some of these proofs are trivial, these are reported in detail in order to show how the same arguments can be used to prove the same for many of the other EMQs discussed in Sect. 4.

\({{\,\mathrm{AE}\,}}\) enjoys IoI. In fact, \({{\,\mathrm{AE}\,}}(p,{\hat{p}})=\frac{1}{|{\mathcal {C}}|}\sum _{c\in {\mathcal {C}}}|{\hat{p}}(c)-p(c)|=0\) implies that \(\sum _{c\in {\mathcal {C}}}|{\hat{p}}(c)-p(c)|=0\); given that \(\sum _{c\in {\mathcal {C}}}|{\hat{p}}(c)-p(c)|\) is a sum of nonnegative factors, this implies that \(|{\hat{p}}(c)-p(c)|=0\) for all \(c\in {\mathcal {C}}\), i.e., \({\hat{p}}(c)=p(c)\) for all \(c\in {\mathcal {C}}\). Conversely, if \({\hat{p}}=p\), then \(\frac{1}{|{\mathcal {C}}|}\sum _{c\in {\mathcal {C}}}|{\hat{p}}(c)-p(c)|=0\). \(\square \)

\({{\,\mathrm{AE}\,}}\) enjoys NN. Quite obviously, \(\frac{1}{|{\mathcal {C}}|}\ge 0\) and \(\sum _{c\in {\mathcal {C}}}|{\hat{p}}(c)-p(c)|\ge 0\), which implies that \(\frac{1}{|{\mathcal {C}}|}\sum _{c\in {\mathcal {C}}}|{\hat{p}}(c)-p(c)|\ge 0\).□

\({{\,\mathrm{AE}\,}}\) enjoys IND. Given codeframe \({\mathcal {C}}=\{c_{1}, \ldots , c_{k}, c_{k+1}, \ldots , c_{n}\}\), for any true distribution *p* on \({\mathcal {C}}\) and predicted distributions \({\hat{p}}'\) and \({\hat{p}}''\) on \({\mathcal {C}}\) such that \({\hat{p}}'(c)= {\hat{p}}''(c)\) for all \(c\in \{c_{k+1}, \ldots , c_{n}\}\), the inequality

resolves to

\(\square \) AE enjoys MON. This can be proven by showing that \({{\,\mathrm{AE}\,}}\) enjoys B-MON, since we have proven that it enjoys IND. Given codeframe \({\mathcal {C}}=\{c_{1},c_{2}\}\) and true distribution *p*, if predicted distributions \({\hat{p}}',{\hat{p}}''\) are such that \({\hat{p}}''(c_{1})< {\hat{p}}'(c_{1})\le p(c_{1})\), then it holds that

\(\square \)

\({{\,\mathrm{AE}\,}}\) enjoys IMP. This can be shown by showing that \({{\,\mathrm{AE}\,}}\) enjoys B-IMP, since we have proven that it enjoys IND. Given codeframe \({\mathcal {C}}=\{c_{1},c_{2}\}\), true distribution *p*, predicted distributions \({\hat{p}}'\) and \({\hat{p}}''\), and constant \(a\ge 0\) such that \({\hat{p}}'(c_{1})=p(c_{1})+a\) and \({\hat{p}}''(c_{1})=p(c_{1})-a\), it holds that

\(\square \)

\({{\,\mathrm{AE}\,}}\) enjoys ABS. This can be shown by showing that \({{\,\mathrm{AE}\,}}\) enjoys B-ABS, since we have proven that it enjoys IND. Given codeframe \({\mathcal {C}}=\{c_{1},c_{2}\}\), constant \(a>0\), true distributions \(p'\) and \(p''\) such that \(p'(c_{1})<p''(c_{1})\) and \(p''(c_{1})<p''(c_{2})\), if a predicted distribution \({\hat{p}}'\) that estimates \(p'\) is such that \({\hat{p}}'(c_{1})=p'(c_{1})\pm a\) and a predicted distribution \({\hat{p}}''\) that estimates \(p''\) is such that \({\hat{p}}''(c_{1})=p''(c_{1})\pm a\), then it holds that

\(\square \)

### Appendix 2: Testing for **MAX**, **IMP**, **ABS**, **REL**

In this section we present simple tests aimed at establishing that a certain EMQ *D* does *not* enjoy a certain property \(\pi \in \{\mathrm{MAX}, {\mathrm{IMP}}, {\mathrm{ABS}}, {\mathrm{REL}}\}\). The basic pattern of these tests is to show that \(\pi \) does not hold for *D* by providing a counterexample. More in particular, given a concrete scenario *s* characterized by (1) a codeframe \({\mathcal {C}}\), (2) one or more true distributions \(p_{1}\), \(p_{2}\), ..., and (3) one or more predicted distributions \({\hat{p}}_{1}\), \({\hat{p}}_{2}\), ..., the test attempts to check whether the scenario satisfies the particular constraint that is required for property \(\pi \) to hold. Since for *D* to enjoy property \(\pi \) the constraint is required to hold for all scenarios, if \(\pi \) does not hold in *s* we can conclude than *D* does not enjoy \(\pi \). Instead, if \(\pi \) does hold in *s* we can conclude nothing, and thus need to study the issue further.

### Appendix 2.1: A counterexample for **MAX**

In the test for MAX we consider the scenario described in the following table

\(p(c_{1})\) | \(p(c_{2})\) | \({\hat{p}}(c_{1})\) | \({\hat{p}}(c_{2})\) | \({{\,\mathrm{AE}\,}}\) | \({{\,\mathrm{NAE}\,}}\) | \({{\,\mathrm{RAE}\,}}\) | \({{\,\mathrm{NRAE}\,}}\) | \({{\,\mathrm{SE}\,}}\) | \({{\,\mathrm{DR}\,}}\) | \({{\,\mathrm{KLD}\,}}\) | \({{\,\mathrm{NKLD}\,}}\) | \({{\,\mathrm{PD}\,}}\) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

\(p'\) | 0.01 | 0.99 | 1.00 | 0.00 | 0.9900 | 1.0000 | 49.9975 | 1.0000 | 0.9801 | 0.9950 | 14.3076 | 0.9999 | 980100.0004 |

\(p''\) | 0.49 | 0.51 | 1.00 | 0.00 | 0.5100 | 1.0000 | 1.0204 | 1.0000 | 0.2601 | 0.7550 | 6.7065 | 0.9975 | 260100.0001 |

and characterized by two different true distributions (1st and 2nd row) across the same codeframe \({\mathcal {C}}=\{c_{1},c_{2}\}\).^{Footnote 19} The test consists in checking whether their respective perverse estimators obtain from *D* the same score: if the values of measure *D* in the two rows are not the same (italic values), this implies that *D* does not satisfy MAX (if they are the same, this does *not* necessarily mean that *D* satisfies MAX). Concerning the values obtained by \({{\,\mathrm{NKLD}\,}}\), see the discussion in Footnote 14.

The table shows that none of \({{\,\mathrm{AE}\,}}\), \({{\,\mathrm{RAE}\,}}\), \({{\,\mathrm{SE}\,}}\), \({{\,\mathrm{DR}\,}}\), \({{\,\mathrm{KLD}\,}}\), \({{\,\mathrm{PD}\,}}\) satisfies MAX.

### Appendix 2.2: A counterexample for **IMP**

In the test for IMP we consider the scenario described in the following table

\(p(c_{1})\) | \(p(c_{2})\) | \({\hat{p}}(c_{1})\) | \({\hat{p}}(c_{2})\) | \({{\,\mathrm{AE}\,}}\) | \({{\,\mathrm{NAE}\,}}\) | \({{\,\mathrm{RAE}\,}}\) | \({{\,\mathrm{NRAE}\,}}\) | \({{\,\mathrm{SE}\,}}\) | \({{\,\mathrm{DR}\,}}\) | \({{\,\mathrm{KLD}\,}}\) | \({{\,\mathrm{NKLD}\,}}\) | \({{\,\mathrm{PD}\,}}\) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

\(p'\) | 0.20 | 0.80 | 0.25 | 0.75 | 0.0500 | 0.0625 | 0.1562 | 0.0625 | 0.0025 | 0.1312 | 0.0070 | 0.0035 | 0.0117 |

\(p''\) | 0.20 | 0.80 | 0.15 | 0.85 | 0.0500 | 0.0625 | 0.1562 | 0.0625 | 0.0025 | 0.1544 | 0.0090 | 0.0045 | 0.0181 |

and characterized by a codeframe \({\mathcal {C}}=\{c_{1},c_{2}\}\), a true distribution *p* (Columns 2 and 3), and two predicted distributions \({\hat{p}}'\) and \({\hat{p}}''\) (Columns 4 and 5, Rows 2 and 3) which are such that (1) \({\hat{p}}'\) overestimates and \({\hat{p}}''\) underestimates the prevalence of a class \(c_{1}\) by a certain amount \(a>0\) (here: 0.05), and, symmetrically, (2) \({\hat{p}}'\) underestimates and \({\hat{p}}''\) overestimates the prevalence of another class \(c_{2}\) by the same amount *a*. If the values of \(D(p,{\hat{p}}')\) and \(D(p,{\hat{p}}'')\) are not the same (which in the table is indicated by italic values), this implies that *D* does not satisfy IMP (if they are the same, this does *not* necessarily mean that *D* satisfies IMP).

The table shows that none of \({{\,\mathrm{DR}\,}}\), \({{\,\mathrm{KLD}\,}}\), \({{\,\mathrm{NKLD}\,}}\), \({{\,\mathrm{PD}\,}}\) satisfies IMP.

### Appendix 2.3: A counterexample for **REL**

In the test for REL we consider the scenario described in the following table

\(p(c_{1})\) | \(p(c_{2})\) | \({\hat{p}}(c_{1})\) | \({\hat{p}}(c_{2})\) | \({{\,\mathrm{AE}\,}}\) | \({{\,\mathrm{NAE}\,}}\) | \({{\,\mathrm{RAE}\,}}\) | \({{\,\mathrm{NRAE}\,}}\) | \({{\,\mathrm{SE}\,}}\) | \({{\,\mathrm{DR}\,}}\) | \({{\,\mathrm{KLD}\,}}\) | \({{\,\mathrm{NKLD}\,}}\) | \({{\,\mathrm{PD}\,}}\) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

\(p'\) | 0.20 | 0.80 | 0.70 | 0.30 | 0.5000 | 0.6250 | 1.5625 | 0.6250 | 0.2500 | 0.6696 | 0.5341 | 0.2609 | 0.7738 |

\(p''\) | 0.25 | 0.75 | 0.75 | 0.25 | 0.5000 | 0.6667 | 1.3333 | 0.6667 | 0.2500 | 0.6667 | 0.5493 | 0.2679 | 0.8333 |

with a codeframe \({\mathcal {C}}=\{c_{1},c_{2}\}\), two true distributions \(p'\) and \(p''\) (Rows 2 and 3, Columns 2 to 4), and two corresponding predicted distributions \({\hat{p}}'\) and \({\hat{p}}''\) (Rows 2 and 3, Columns 5 to 7), such that in both cases the predicted distribution overestimates the prevalence of \(c_{1}\) by the same amount \(a>0\) (here: 0.50), with \(p'(c_{1})<p''(c_{1})\). Here, if it is not the case that \(D(p,{\hat{p}}')>D(p,{\hat{p}}'')\) (which in the table is indicated by italic values), then *D* does not satisfy REL (if \(D(p,{\hat{p}}')\not =D(p,{\hat{p}}'')\), this does *not* necessarily mean that *D* satisfies REL).

The table shows that none of \({{\,\mathrm{AE}\,}}\), \({{\,\mathrm{NAE}\,}}\), \({{\,\mathrm{NRAE}\,}}\), \({{\,\mathrm{SE}\,}}\), \({{\,\mathrm{KLD}\,}}\), \({{\,\mathrm{NKLD}\,}}\), \({{\,\mathrm{PD}\,}}\) satisfies REL.

### Appendix 2.4: A counterexample for **ABS**

In the test for ABS we consider the same scenario as described in “Appendix 2.3” section, i.e.,

\(p(c_{1})\) | \(p(c_{2})\) | \({\hat{p}}(c_{1})\) | \({\hat{p}}(c_{2})\) | \({{\,\mathrm{AE}\,}}\) | \({{\,\mathrm{NAE}\,}}\) | \({{\,\mathrm{RAE}\,}}\) | \({{\,\mathrm{NRAE}\,}}\) | \({{\,\mathrm{SE}\,}}\) | \({{\,\mathrm{DR}\,}}\) | \({{\,\mathrm{KLD}\,}}\) | \({{\,\mathrm{NKLD}\,}}\) | \({{\,\mathrm{PD}\,}}\) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

\(p'\) | 0.20 | 0.80 | 0.70 | 0.30 | 0.5000 | 0.6250 | 1.5625 | 0.6250 | 0.2500 | 0.6696 | 0.5341 | 0.2609 | 0.7738 |

\(p''\) | 0.25 | 0.75 | 0.75 | 0.25 | 0.5000 | 0.6667 | 1.3333 | 0.6667 | 0.2500 | 0.6667 | 0.5493 | 0.2679 | 0.8333 |

with a codeframe \({\mathcal {C}}=\{c_{1},c_{2}\}\), two true distributions \(p'\) and \(p''\) (Rows 2 and 3, Columns 2 to 4), and two corresponding predicted distributions \({\hat{p}}'\) and \({\hat{p}}''\) (Rows 2 and 3, Columns 5 to 7), such that in both cases the predicted distribution overestimates the prevalence of \(c_{1}\) by the same amount \(a>0\) (here: 0.50), with \(p'(c_{1})<p''(c_{1})\). Here, if the values of \(D(p,{\hat{p}}')\) and \(D(p,{\hat{p}}'')\) are not equal (which in the table is indicated by italic values), this implies that *D* does not satisfy ABS (if \(D(p,{\hat{p}}')=D(p,{\hat{p}}'')\), this does *not* necessarily mean that *D* satisfies ABS).

The table shows that none of \({{\,\mathrm{NAE}\,}}\), \({{\,\mathrm{RAE}\,}}\), \({{\,\mathrm{NRAE}\,}}\), \({{\,\mathrm{DR}\,}}\), \({{\,\mathrm{KLD}\,}}\), \({{\,\mathrm{NKLD}\,}}\), \({{\,\mathrm{PD}\,}}\) satisfies ABS.

### Appendix 3: Proving that **MON** holds

In this section we prove that MON holds for \({{\,\mathrm{KLD}\,}}\) and \({{\,\mathrm{PD}\,}}\). For this it will be sufficient to prove that \({{\,\mathrm{KLD}\,}}\) and \({{\,\mathrm{PD}\,}}\) enjoy B-MON, since it is immediate to verify that \({{\,\mathrm{KLD}\,}}\) and \({{\,\mathrm{PD}\,}}\) enjoy IND.

For ease of exposition, let us define the shorthands \(a\equiv p(c_{1})\) and \(x\equiv {\hat{p}}(c_{1})\).^{Footnote 20} In order to show that *D* satisfies B-MON it is sufficient to show that

- 1.
if \((a-x)>0\), then \(\dfrac{\partial }{\partial (a-x)} D>0\) for \(a,x,(a-x)\in (0,1)\)

- 2.
if \((x-a)>0\), then \(\dfrac{\partial }{\partial (x-a)} D>0\) for \(a,x,(x-a)\in (0,1)\)

because an increase in \((a-x)=(p(c_{1})-{\hat{p}}(c_{1}))\) implies an equivalent increase in \((p(c_{2})-{\hat{p}}(c_{2}))\) (same for \((x-a)\)).

###
**Theorem 1**

\({{\,\mathrm{KLD}\,}}\)*satisfies B-MON*.

### Proof

We first treat the case \((a-x)>0\); let us define \(y\equiv (a-x)\). In this case

Since we are in the case in which \((x-a)<0\), and since \((x-1)<0\) and \(x>0\), then \(\dfrac{x-a}{(x-1)x}>0\) for all \(a,x,(a-x)\in (0,1)\).

Let us now treat the case \((x-a)>0\), and let us define \(y\equiv (x-a)\). In this case

Since in this case it holds that \((a-x)<0\), and since \((x-1)<0\) and \(x>0\), then \(\dfrac{x-a}{(x-1)x}>0\) for all \(a,x,(x-a)\in (0,1)\). This concludes our proof. \(\square \)

###
**Theorem 2**

\({{\,\mathrm{PD}\,}}\)*satisfies B-MON*.

### Proof

We first treat the case \((a-x)>0\); let us define \(y\equiv (a-x)\). In this case

Since in this case it holds that \(a>x\), it is true that that \((a-2ax+x)>(x-2ax+x)=2x(1-a)>0\), since by hypothesis it holds that \(x,a\in (0,1)\). Therefore, \(\dfrac{\partial }{\partial y} {{\,\mathrm{PD}\,}}=\dfrac{(a-x)(a-2ax+x)}{x^{2}(1-x)^{2}}>0\), since the two factors at the numerator and the two factors at the denominator are all strictly \(>0\).

Let us now treat the case \((x-a)>0\), and let us define \(y\equiv (x-a)\). In this case

Since in this case it holds that \(x>a\), it is true that that \((-2ax+x+a)>(-2ax+2a)=2a(1-x)>0\), since by hypothesis it holds that \(x,a\in (0,1)\). Therefore, \(\dfrac{\partial {{\,\mathrm{PD}\,}}}{\partial y}=\dfrac{(x-a)(-2ax+x+a)}{(2a-x+1)^{2}x^{2}}>0\), since the two factors at the numerator and the two factors at the denominator are all strictly \(>0\). This concludes our proof. \(\square \)

## Rights and permissions

## About this article

### Cite this article

Sebastiani, F. Evaluation measures for quantification: an axiomatic approach.
*Inf Retrieval J* **23, **255–288 (2020). https://doi.org/10.1007/s10791-019-09363-y

Received:

Accepted:

Published:

Issue Date:

### Keywords

- Quantification
- Supervised prevalence estimation
- Supervised learning
- Evaluation measures