Re-assessing the “Classify and Count” Quantification Method

Moreo, Alejandro; Sebastiani, Fabrizio

doi:10.1007/978-3-030-72240-1_6

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12657))

Included in the following conference series:

European Conference on Information Retrieval

2406 Accesses
6 Citations

Abstract

Learning to quantify (a.k.a. quantification) is a task concerned with training unbiased estimators of class prevalence via supervised learning. This task originated with the observation that “Classify and Count” (CC), the trivial method of obtaining class prevalence estimates, is often a biased estimator, and thus delivers suboptimal quantification accuracy. Following this observation, several methods for learning to quantify have been proposed and have been shown to outperform CC. In this work we contend that previous works have failed to use properly optimised versions of CC. We thus reassess the real merits of CC and its variants, and argue that, while still inferior to some cutting-edge methods, they deliver near-state-of-the-art accuracy once (a) hyperparameter optimisation is performed, and (b) this optimisation is performed by using a truly quantification-oriented evaluation protocol. Experiments on three publicly available binary sentiment classification datasets support these conclusions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

From classification to quantification in tweet sentiment analysis

Article 12 April 2016

Why is quantification an interesting learning problem?

Article 30 September 2016

A Framework for Deep Quantification Learning

Notes

1.
https://github.com/AlexMoreo/CC.
2.
Consistently with most mathematical literature, we use the caret symbol ( \(\hat{}\) ) to indicate estimation.
3.
Note that this is similar to what we do, say, in classification, where the different hyperparameter values are tested on many validation documents; here we test these hyperparameter values on many validation samples, since the objects of study of text quantification are document samples inasmuch as the objects of study of text classification are individual documents.
4.
Note that we do not retrain the classifier on the entire \(L_{\mathrm {Tr}}\). While this might seem beneficial, since \(L_{\mathrm {Tr}}\) contains more training data than \(L_{\mathrm {Tr}}^{\mathrm {Tr}}\), we need to consider that the estimates \(\hat{\mathrm {TPR}}_{h}\) and \(\hat{\mathrm {FPR}}_{h}\) have been computed on \(L_{\mathrm {Tr}}\) and not on \(L_{\mathrm {Tr}}^{\mathrm {Tr}}\).
5.
The three datasets are available at https://doi.org/10.5281/zenodo.4117827 in pre-processed form. The raw versions of the HP and Kindle datasets can be accessed from http://hlt.isti.cnr.it/quantification/, while the raw version of IMDB can be found at https://ai.stanford.edu/~amaas/data/sentiment/.
6.
http://scikit-learn.org/.
7.
https://pytorch.org/.
8.
When the depth is set to “max” then nodes are expanded until all leaves belong to the same class.

References

Barranquero, J., Díez, J., del Coz, J.J.: Quantification-oriented learning based on reliable classifiers. Pattern Recognit. 48(2), 591–604 (2015). https://doi.org/10.1016/j.patcog.2014.07.032
Barranquero, J., González, P., Díez, J., del Coz, J.J.: On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recognit. 46(2), 472–482 (2013). https://doi.org/10.1016/j.patcog.2012.07.022
Article MATH Google Scholar
Bella, A., Ferri, C., Hernández-Orallo, J., Ramírez-Quintana, M.J.: Quantification via probability estimators. In: Proceedings of the 11th IEEE International Conference on Data Mining (ICDM 2010), Sydney, AU, pp. 737–742 (2010). https://doi.org/10.1109/icdm.2010.75
Borge-Holthoefer, J., Magdy, W., Darwish, K., Weber, I.: Content and network dynamics behind Egyptian political polarization on Twitter. In: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW 2015), Vancouver, CA, pp. 700–711 (2015)
Google Scholar
Card, D., Smith, N.A.: The importance of calibration for estimating proportions from annotations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2018), New Orleans, US, pp. 1636–1646 (2018). https://doi.org/10.18653/v1/n18-1148
Esuli, A., Molinari, A., Sebastiani, F.: A critical reassessment of the Saerens-Latinne-Decaestecker algorithm for posterior probability adjustment. ACM Trans. Inf. Syst. 19(2), 1–34 (2020). Article 19, https://doi.org/10.1145/3433164
Esuli, A., Moreo, A., Sebastiani, F.: A recurrent neural network for sentiment quantification. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM 2018), Torino, IT, pp. 1775–1778 (2018). https://doi.org/10.1145/3269206.3269287
Esuli, A., Moreo, A., Sebastiani, F.: Cross-lingual sentiment quantification. IEEE Intell. Syst. 35(3), 106–114 (2020). https://doi.org/10.1109/MIS.2020.2979203
Article MATH Google Scholar
Esuli, A., Sebastiani, F.: Explicit loss minimization in quantification applications (preliminary draft). In: Proceedings of the 8th International Workshop on Information Filtering and Retrieval (DART 2014), Pisa, IT, pp. 1–11 (2014)
Google Scholar
Esuli, A., Sebastiani, F.: Optimizing text quantifiers for multivariate loss functions. ACM Trans. Knowl. Discov. Data 9(4), 1–27 (2015). Article 27, https://doi.org/10.1145/2700406
Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Discov. 17(2), 164–206 (2008). https://doi.org/10.1007/s10618-008-0097-y
Article MathSciNet Google Scholar
Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Soc. Netw. Anal. Min. 6(19), 1–22 (2016). https://doi.org/10.1007/s13278-016-0327-z
Article Google Scholar
González, P., Castaño, A., Chawla, N.V., del Coz, J.J.: A review on quantification learning. ACM Comput. Surv. 50(5), 74:1–74:40 (2017). https://doi.org/10.1145/3117807
González, P., Díez, J., Chawla, N., del Coz, J.J.: Why is quantification an interesting learning problem? Prog. Artif. Intell. 6(1), 53–58 (2017). https://doi.org/10.1007/s13748-016-0103-3
González-Castro, V., Alaiz-RodríÂguez, R., Alegre, E.: Class distribution estimation based on the Hellinger distance. Inf. Sci. 218, 146–164 (2013). https://doi.org/10.1016/j.ins.2012.05.028
Hassan, W., Maletzke, A., Batista, G.: Accurately quantifying a billion instances per second. In: Proceedings of the 7th IEEE International Conference on Data Science and Advanced Analytics (DSAA 2020), Sydney, AU (2020)
Google Scholar
Hopkins, D.J., King, G.: A method of automated nonparametric content analysis for social science. Am. J. Polit. Sci. 54(1), 229–247 (2010). https://doi.org/10.1111/j.1540-5907.2009.00428.x
Article Google Scholar
Joachims, T.: A support vector method for multivariate performance measures. In: Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), Bonn, DE, pp. 377–384 (2005)
Google Scholar
Levin, R., Roitman, H.: Enhanced probabilistic classify and count methods for multi-label text quantification. In: Proceedings of the 7th ACM International Conference on the Theory of Information Retrieval (ICTIR 2017), Amsterdam, NL, pp. 229–232 (2017). https://doi.org/10.1145/3121050.3121083
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011), Portland, US, pp. 142–150 (2011)
Google Scholar
Milli, L., Monreale, A., Rossetti, G., Giannotti, F., Pedreschi, D., Sebastiani, F.: Quantification trees. In: Proceedings of the 13th IEEE International Conference on Data Mining (ICDM 2013), Dallas, US, pp. 528–536 (2013). https://doi.org/10.1109/icdm.2013.122
Moreno-Torres, J.G., Raeder, T., Alaíz-Rodríguez, R., Chawla, N.V., Herrera, F.: A unifying view on dataset shift in classification. Pattern Recognit. 45(1), 521–530 (2012). https://doi.org/10.1016/j.patcog.2011.06.019
Morik, K., Brockhausen, P., Joachims, T.: Combining statistical learning with a knowledge-based approach. A case study in intensive care monitoring. In: Proceedings of the 16th International Conference on Machine Learning (ICML 1999), Bled, SL, pp. 268–277 (1999)
Google Scholar
Platt, J.C.: Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In: Smola, A., Bartlett, P., Schölkopf, B., Schuurmans, D. (eds.) Advances in Large Margin Classifiers, pp. 61–74. The MIT Press, Cambridge (2000)
Google Scholar
Pérez-Gállego, P., Castaño, A., Quevedo, J.R., del Coz, J.J.: Dynamic ensemble selection for quantification tasks. Inf. Fusion 45, 1–15 (2019). https://doi.org/10.1016/j.inffus.2018.01.001
Article Google Scholar
Pérez-Gállego, P., Quevedo, J.R., del Coz, J.J.: Using ensembles for problems with characterizable changes in data distribution: a case study on quantification. Inf. Fusion 34, 87–100 (2017). https://doi.org/10.1016/j.inffus.2016.07.001
Article Google Scholar
Saerens, M., Latinne, P., Decaestecker, C.: Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput. 14(1), 21–41 (2002). https://doi.org/10.1162/089976602753284446
Article MATH Google Scholar
Sebastiani, F.: Evaluation measures for quantification: an axiomatic approach. Inf. Retr. J. 23(3), 255–288 (2020). https://doi.org/10.1007/s10791-019-09363-y
Article Google Scholar

Download references

Acknowledgments

The present work has been supported by the SoBigData++ project, funded by the European Commission (Grant 871042) under the H2020 Programme INFRAIA-2019-1, and by the AI4Media project, funded by the European Commission (Grant 951911) under the H2020 Programme ICT-48-2020. The authors’ opinions do not necessarily reflect those of the European Commission.

Author information

Authors and Affiliations

Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, 56124, Pisa, Italy
Alejandro Moreo & Fabrizio Sebastiani

Authors

Alejandro Moreo
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Sebastiani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alejandro Moreo .

Editor information

Editors and Affiliations

Radboud University Nijmegen, Nijmegen, The Netherlands
Djoerd Hiemstra
Department of Computer Science, Katholieke Universiteit Leuven, Heverlee, Belgium
Marie-Francine Moens
Toulouse, Toulouse Institute of Computer Science Research, Toulouse, France
Josiane Mothe
Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy
Raffaele Perego
Leipzig University, Leipzig, Germany
Martin Potthast
Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy
Fabrizio Sebastiani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moreo, A., Sebastiani, F. (2021). Re-assessing the “Classify and Count” Quantification Method. In: Hiemstra, D., Moens, MF., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2021. Lecture Notes in Computer Science(), vol 12657. Springer, Cham. https://doi.org/10.1007/978-3-030-72240-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-72240-1_6
Published: 30 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72239-5
Online ISBN: 978-3-030-72240-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Re-assessing the “Classify and Count” Quantification Method

Abstract

Access this chapter

Similar content being viewed by others

From classification to quantification in tweet sentiment analysis

Why is quantification an interesting learning problem?

A Framework for Deep Quantification Learning

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Re-assessing the “Classify and Count” Quantification Method

Abstract

Access this chapter

Similar content being viewed by others

From classification to quantification in tweet sentiment analysis

Why is quantification an interesting learning problem?

A Framework for Deep Quantification Learning

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation