Abstract
Several machine learning applications use classifiers as a way of quantifying the prevalence of positive class labels in a target dataset, a task named quantification. For instance, a naive a way of determining what proportion of people like a given product with no labeled reviews is to (i) train a classifier based on the Google Shopping reviews to predict whether a user likes a product given its review, and then (ii) apply this classifier to Facebook/Google+ posts about that product. It is well known that such a two-step approach, named Classify and Count, fails because of dataset shift, and thus, several improvements have been recently proposed under an assumption named prior shift. Unfortunately, these methods only explore the relationship between the covariates and the response via classifiers. Moreover, the literature lacks in the theoretical foundation to improve these techniques. We propose a new family of estimators named Ratio Estimator which is able to explore the relationship between the cov ariates and the response using any function \( g: \mathscr {X} \rightarrow \mathbb {R}\) and not only classifiers. We show that for some choices of g, our estimator matches standard estimators used in the literature. We also explore alternative ways of constructing functions g that lead to estimators with good performance, and compare them using real datasets. Finally, we provide a theoretical analysis of the method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Forman, G.: Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 157–166 (2006)
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset Shift in Machine Learning. The MIT Press, Cambridge (2009)
Izbicki, R., Lee, A.B., Freeman, P.E.: Photo-\( z \) estimation: an example of nonparametric conditional density estimation under selection bias. Ann. Appl. Stat. 11(2), 698–724 (2017)
Du Plessis, M.C., Sugiyama, M.: Semi-supervised learning of class balance under class-prior change by distribution matching. Neural Netw. 50, 110–119 (2014)
Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Discov. 17, 164–206 (2008)
Lehmann, E.L.: Elements of Large-sample Theory. Springer Science & Business Media, Berlin (2004)
Scholkopf, B., Smola, A.J.: Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, Cambridge (2001)
Zhang, L.H.: On optimizing the sum of the Rayleigh quotient and the generalized Rayleigh quotient on the unit sphere. Comput. Optim. Appl. 54(1), 111 (2013)
Freeman, P.E., Izbicki, R., Lee, A.B., Newman, J.A., Conselice, C.J., Koekemoer, A.M., Lotz, J.M., Mozena, M.: New image statistics for detecting disturbed galaxy morphologies at high redshift. Mon. Not. R. Astron. Soc. 434(1), 282–295 (2013)
Izbicki, R., Stern, R.B.: Learning with many experts: model selection and sparsity. Mon. Not. R. Astron. Soc. 6(6), 565–577 (2013)
Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA: University of California. Department of Information and Computer Science, vol. 55, (1998)
Acknowledgements
This work was partially supported by FAPESP grant 2017/03363-8 and CAPES.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Vaz, A., Izbicki, R., Stern, R.B. (2018). Prior Shift Using the Ratio Estimator. In: Polpo, A., Stern, J., Louzada, F., Izbicki, R., Takada, H. (eds) Bayesian Inference and Maximum Entropy Methods in Science and Engineering. maxent 2017. Springer Proceedings in Mathematics & Statistics, vol 239. Springer, Cham. https://doi.org/10.1007/978-3-319-91143-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-91143-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91142-7
Online ISBN: 978-3-319-91143-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)