Skip to main content
Log in

Aggregative quantification for regression

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The problem of estimating the class distribution (or prevalence) for a new unlabelled dataset (from a possibly different distribution) is a very common problem which has been addressed in one way or another in the past decades. This problem has been recently reconsidered as a new task in data mining, renamed quantification when the estimation is performed as an aggregation (and possible adjustment) of a single-instance supervised model (e.g., a classifier). However, the study of quantification has been limited to classification, while it is clear that this problem also appears, perhaps even more frequently, with other predictive problems, such as regression. In this case, the goal is to determine a distribution or an aggregated indicator of the output variable for a new unlabelled dataset. In this paper, we introduce a comprehensive new taxonomy of quantification tasks, distinguishing between the estimation of the whole distribution and the estimation of some indicators (summary statistics), for both classification and regression. This distinction is especially useful for regression, since predictions are numerical values that can be aggregated in many different ways, as in multi-dimensional hierarchical data warehouses. We focus on aggregative quantification for regression and see that the approaches borrowed from classification do not work. We present several techniques based on segmentation which are able to produce accurate estimations of the expected value and the distribution of the output variable. We show experimentally that these methods especially excel for the relevant scenarios where training and test distributions dramatically differ.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. We use the same acronym for Regress & Splice and Regress & Sum, since both just aggregate the individual values with any further processing.

  2. The example is elaborated, with some fictional elements, from the cars dataset in the UCI repository (Frank and Asuncion 2010).

  3. The example is elaborated, with some fictional elements, from the lowbwt dataset in the UCI repository (Frank and Asuncion 2010), originally from Hosmer and Lemeshow (2000).

  4. In this paper the methodology for indicators and distribution is the same (except for some minor specific techniques, mostly at the end of the process), but this could be different in the view that some indicators require less information and effort than the whole distribution.

  5. As a mean-unbiased estimator minimises squared loss, a median-unbiased estimator is a different choice which minimises the absolute error.

  6. The idea of segmenting the set of outputs is not new and has led to some classifier calibration techniques, such as binning (Zadrozny and Elkan 2002; Bella et al. 2009b). Calibration techniques are somewhat related to quantification techniques. In fact, \(RS\) would be optimal if the predictive model were perfectly calibrated—for the test set. This is a key point because calibration is always understood relative to a distribution or dataset. Given the quantification problems with distribution shift we are considering here, it is the test set distribution what we want to infer, so calibrating for the training set may be useless.

  7. Note that this adjustment is performed with information from the training data exclusively. An alternative possibility would be to use a validation dataset, but this would reduce the available training data.

  8. An alternative, more lightweight approach could be to introduce a normal jitter to each prediction \(\hat{y}\). While this may have a similar effect, it has random effects that may be important for small datasets. The smoothing approach presented here always leads to the same result, since it has no random components.

  9. Some alternatives could be figure out here, such as the use of one-vs-previous or one-vs-adjacent schemes. This is left as a possibility for future work.

  10. http://archive.ics.uci.edu/ml/.

  11. http://tunedit.org/repo.

  12. http://mldata.org/repository/data/.

References

  • Alonzo TA, Pepe MS, Lumley T (2003) Estimating disease prevalence in two-phase studies. Biostatistics 4(2):313–326

    Article  MATH  Google Scholar 

  • Anderson T (1962) On the distribution of the two-sample Cramer–von Mises criterion. Ann Math Stat 33(3):1148–1159

    Article  MATH  Google Scholar 

  • Bakar AA, Othman ZA, Shuib NLM (2009) Building a new taxonomy for data discretization techniques. In: Proceedings of 2nd conference on data mining and optimization (DMO’09), pp 132–140

  • Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2009a) Calibration of machine learning models. In: Handbook of research on machine learning applications. IGI Global, Hershey

  • Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2009b) Similarity-binning averaging: a generalisation of binning calibration. In: International conference on intelligent data engineering and automated learning. LNCS, vol 5788. Springer, Berlin, pp 341–349

  • Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2010) Quantification via probability estimators. In: International conference on data mining, ICDM2010, pp 737–742

  • Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2012) On the effect of calibration in classifier combination. Appl Intell. doi:10.1007/s10489-012-0388-2

  • Chan Y, Ng H (2006) Estimating class priors in domain adaptation for word sense disambiguation. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp 89–96

  • Chawla N, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6

    Article  Google Scholar 

  • Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MATH  MathSciNet  Google Scholar 

  • Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Prieditis A, Russell S (eds) Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 194–202

  • Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recogn Lett 30(1):27–38

    Article  Google Scholar 

  • Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Forman G (2005) Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European conference on machine learning (ECML), pp 564–575

  • Forman G (2006) Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 157–166

  • Forman G (2008) Quantifying counts and costs via classification. Data Min Knowl Discov 17(2):164–206

    Article  MathSciNet  Google Scholar 

  • Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • González-Castro V, Alaiz-Rodríguez R, Alegre E (2012) Class distribution estimation based on the Hellinger distance. Inf Sci 218(1):146–164

    Google Scholar 

  • Hastie TJ, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, Berlin

    Book  Google Scholar 

  • Hernández-Orallo J, Flach P, Ferri C (2012) A unified view of performance metrics: translating threshold choice into expected classification loss. J Mach Learn Res (JMLR) 13:2813–2869

    Google Scholar 

  • Hodges J, Lehmann E (1963) Estimates of location based on rank tests. Ann Math Stat 34(5):598–611

    Article  MATH  MathSciNet  Google Scholar 

  • Hosmer DW, Lemeshow S (2000) Applied logistic regression. Wiley, New York

    Book  MATH  Google Scholar 

  • Hwang JN, Lay SR, Lippman A (1994) Nonparametric multivariate density estimation: a comparative study. IEEE Trans Signal Process 42(10):2795–2810

    Article  Google Scholar 

  • Hyndman RJ, Bashtannyk DM, Grunwald GK (1996) Estimating and visualizing conditional densities. J Comput Graph Stat 5(4):315–336

    MathSciNet  Google Scholar 

  • Moreno-Torres J, Raeder T, Alaiz-Rodríguez R, Chawla N, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45(1):521–530

    Article  Google Scholar 

  • Neyman J (1938) Contribution to the theory of sampling human populations. J Am Stat Assoc 33(201):101–116

    Article  MATH  Google Scholar 

  • Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in large margin classifiers. MIT Press, Cambridge, pp 61–74

  • Raeder T, Forman G, Chawla N (2012) Learning from imbalanced data: evaluation matters. Data Min 23:315–331

    MathSciNet  Google Scholar 

  • Sánchez L, González V, Alegre E, Alaiz R (2008) Classification and quantification based on image analysis for sperm samples with uncertain damaged/intact cell proportions. In: Proceedings of the 5th international conference on image analysis and recognition. LNCS, vol 5112. Springer, Heidelberg, pp 827–836

  • Sturges H (1926) The choice of a class interval. J Am Stat Assoc 21(153):65–66

    Article  Google Scholar 

  • Team R et al (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

    Google Scholar 

  • Tenenbein A (1970) A double sampling scheme for estimating from binomial data with misclassifications. J Am Stat Assoc 65(331):1350–1361

    Article  Google Scholar 

  • Weiss G (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19

    Article  Google Scholar 

  • Weiss G, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report ML-TR-44

  • Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques with Java implementations. Elsevier, Amsterdam

    Google Scholar 

  • Xiao Y, Gordon A, Yakovlev A (2006a) A C++ program for the Cramér–von Mises two-sample test. J Stat Softw 17:1–15

    Google Scholar 

  • Xiao Y, Gordon A, Yakovlev A (2006b) The L1-version of the Cramér-von Mises test for two-sample comparisons in microarray data analysis. EURASIP J Bioinform Syst Biol 2006:85769

  • Xue J, Weiss G (2009) Quantification and semi-supervised classification methods for handling changes in class distribution. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 897–906

  • Yang Y (2003) Discretization for naive-bayes learning. PhD thesis, Monash University

  • Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: Proceedings of the 8th international conference on machine learning (ICML), pp 609–616

  • Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: The 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 694–699

Download references

Acknowledgments

We would like to thank the anonymous reviewers for their careful reviews, insightful comments and very useful suggestions. This work was supported by the MEC/MINECO projects CONSOLIDER-INGENIO CSD2007-00022 and TIN 2010-21062-C02-02, GVA project PROMETEO/2008/051, the COST—European Cooperation in the field of Scientific and Technical Research IC0801 AT, and the REFRAME project granted by the European Coordinated Research on Long-term Challenges in Information and Communication Sciences & Technologies ERA-Net (CHIST-ERA), and funded by the Ministerio de Economía y Competitividad in Spain.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cèsar Ferri.

Additional information

Responsible editor: Johannes Fürnkranz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bella, A., Ferri, C., Hernández-Orallo, J. et al. Aggregative quantification for regression. Data Min Knowl Disc 28, 475–518 (2014). https://doi.org/10.1007/s10618-013-0308-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-013-0308-z

Keywords

Navigation