Skip to main content
Log in

Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

Estimation of the ratio of probability densities has attracted a great deal of attention since it can be used for addressing various statistical paradigms. A naive approach to density-ratio approximation is to first estimate numerator and denominator densities separately and then take their ratio. However, this two-step approach does not perform well in practice, and methods for directly estimating density ratios without density estimation have been explored. In this paper, we first give a comprehensive review of existing density-ratio estimation methods and discuss their pros and cons. Then we propose a new framework of density-ratio estimation in which a density-ratio model is fitted to the true density-ratio under the Bregman divergence. Our new framework includes existing approaches as special cases, and is substantially more general. Finally, we develop a robust density-ratio estimation method under the power divergence, which is a novel instance in our framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Ali S.M., Silvey S.D. (1966) A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B 28(1): 131–142

    MathSciNet  MATH  Google Scholar 

  • Banerjee A., Merugu S., Dhillon I.S., Ghosh J. (2005) Clustering with Bregman divergences. Journal of Machine Learning Research 6: 1705–1749

    MathSciNet  MATH  Google Scholar 

  • Basu A., Harris I.R., Hjort N.L., Jones M.C. (1998) Robust and efficient estimation by minimising a density power divergence. Biometrika 85(3): 549–559

    Article  MathSciNet  MATH  Google Scholar 

  • Best, M. J. (1982). An algorithm for the solution of the parametric quadratic programming problem. Technical report 82–24, Faculty of Mathematics, University of Waterloo.

  • Bickel, S., Bogojeska, J., Lengauer, T., Scheffer, T. (2008). Multi-task learning for HIV therapy screening. In A. McCallum, S. Roweis (Eds.), Proceedings of 25th annual international conference on machine learning (ICML2008) (pp. 56–63).

  • Bregman L.M. (1967) The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics 7: 200–217

    Article  Google Scholar 

  • Caruana R., Pratt L., Thrun S. (1997) Multitask learning. Machine Learning 28: 41–75

    Article  Google Scholar 

  • Cayton, L. (2008). Fast nearest neighbor retrieval for Bregman divergences. In A. McCallum, S. Roweis (Eds.), Proceedings of the 25th annual international conference on machine learning (ICML2008) (pp. 112–119). Madison: Omnipress.

  • Chen S.S., Donoho D.L., Saunders M.A. (1998) Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 20(1): 33–61

    Article  MathSciNet  Google Scholar 

  • Cheng K.F., Chu C.K. (2004) Semiparametric density estimation under a two-sample density ratio model. Bernoulli 10(4): 583–604

    Article  MathSciNet  MATH  Google Scholar 

  • Collins M., Schapire R.E., Singer Y. (2002) Logistic regression, adaboost and Bregman distances. Machine Learning 48(1–3): 253–285

    Article  MATH  Google Scholar 

  • Cover T.M., Thomas J.A. (2006) Elements of information theory (2nd ed.). Wiley, Hoboken, NJ, USA

    MATH  Google Scholar 

  • Csiszár I. (1967) Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica 2: 229–318

    Google Scholar 

  • Dhillon, I., Sra, S. (2006). Generalized nonnegative matrix approximations with Bregman divergences. In Y. Weiss, B. Schölkopf, J. Platt (Eds.), Advances in neural information processing systems (Vol. 8, pp. 283–290). Cambridge, MA: MIT Press.

  • Efronm B., Hastie T., Johnstone I., Tibshirani R. (2004) Least angle regression. The Annals of Statistics 32(2): 407–499

    Article  MathSciNet  Google Scholar 

  • Fujisawa H., Eguchi S. (2008) Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis 99(9): 2053–2081

    Article  MathSciNet  MATH  Google Scholar 

  • Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., Schölkopf, B. (2009). Covariate shift by kernel mean matching. In J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, N. Lawrence (Eds.), Dataset shift in machine learning (Chap. 8, pp. 131–160). Cambridge, MA, USA: MIT Press.

  • Hastie T., Tibshirani R., Friedman J. (2001) The elements of statistical learning: Data mining, inference, and prediction. Springer, New York, NY, USA

    MATH  Google Scholar 

  • Hastie T., Rosset S., Tibshirani R., Zhu J. (2004) The entire regularization path for the support vector machine. Journal of Machine Learning Research 5: 1391–1415

    MathSciNet  MATH  Google Scholar 

  • Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., Kanamori, T. (2008). Inlier-based outlier detection via direct density ratio estimation. In F. Giannotti, D. Gunopulos, F. Turini, C. Zaniolo, N. Ramakrishnan, X. Wu (Eds.), Proceedings of IEEE international conference on data mining (ICDM2008) (pp. 223–232). Pisa, Italy.

  • Hido S., Tsuboi Y., Kashima H., Sugiyama M., Kanamori T. (2011) Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems 26(2): 309–336

    Article  Google Scholar 

  • Huang, J., Smola, A., Gretton, A., Borgwardt, K. M., Schölkopf, B. (2007). Correcting sample selection bias by unlabeled data. In B. Schölkopf, J. Platt, T. Hoffman (Eds.), Advances in neural information processing systems (Vol. 19, pp. 601–608). Cambridge, MA, USA: MIT Press.

  • Huber P.J. (1981) Robust statistics. Wiley, New York, NY, USA

    Book  MATH  Google Scholar 

  • Jones M.C., Hjort N.L., Harris I.R., Basu A. (2001) A comparison of related density-based minimum divergence estimators. Biometrika 88: 865–873

    Article  MathSciNet  MATH  Google Scholar 

  • Jordan M.I., Ghahramani Z., Jaakkola T.S., Saul L.K. (1999) An introduction to variational methods for graphical models. Machine Learning 37(2): 183

    Article  MATH  Google Scholar 

  • Kanamori T., Hido S., Sugiyama M. (2009) A least-squares approach to direct importance estimation. Journal of Machine Learning Research 10: 1391–1445

    MathSciNet  MATH  Google Scholar 

  • Kanamori, T., Suzuki, T., Sugiyama, M. (2010). Theoretical analysis of density ratio estimation. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E93-A(4), 787–798.

  • Kanamori, T., Suzuki, T., Sugiyama, M. (2012). Kernel-based least-squares density-ratio estimation I: Statistical analysis. Machine Learning (to appear).

  • Kawahara, Y., Sugiyama, M. (2009). Change-point detection in time-series data by direct density-ratio estimation. In H. Park, S. Parthasarathy, H. Liu, Z. Obradovic (Eds.), Proceedings of 2009 SIAM international conference on data mining (SDM2009) (pp. 389–400). Nevada, USA: Sparks.

  • Keziou A. (2003) Dual representation of \({\phi}\)-divergences and applications. Comptes Rendus Mathématique 336(10): 857–862

    Article  MathSciNet  MATH  Google Scholar 

  • Keziou A., Leoni-Aubin S. (2005) Test of homogeneity in semiparametric two-sample density ratio models. Comptes Rendus Mathématique 340(12): 905–910

    Article  MathSciNet  MATH  Google Scholar 

  • Kimura M., Sugiyama M. (2011) Dependence-maximization clustering with least-squares mutual information. Journal of Advanced Computational Intelligence and Intelligent Informatics 15(7): 800–805

    Google Scholar 

  • Kullback S., Leibler R.A. (1951) On information and sufficiency. Annals of Mathematical Statistics 22: 79–86

    Article  MathSciNet  MATH  Google Scholar 

  • Minka, T. P. (2007). A comparison of numerical optimizers for logistic regression. Technical report, Microsoft Research. http://research.microsoft.com/~minka/papers/logreg/minka-logreg.pdf.

  • Murata N., Takenouchi T., Kanamori T, Eguchi S. (2004) Information geometry of U-boost and Bregman divergence. Neural Computation 16(7): 1437–1481

    Article  MATH  Google Scholar 

  • Nguyen X., Wainwright M.J., Jordan M.I. (2010) Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory 56(11): 5847–5861

    Article  MathSciNet  Google Scholar 

  • Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5, 50(302), 157–175.

    Google Scholar 

  • Qin J. (1998) Inferences for case-control and semiparametric two-sample density ratio models. Biometrika 85(3): 619–630

    Article  MathSciNet  MATH  Google Scholar 

  • Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N. (Eds.) (2009). Dataset shift in machine learning. Cambridge, MA, USA: MIT Press.

  • Rockafellar R.T. (1970) Convex analysis. Princeton University Press, Princeton, NJ, USA

    MATH  Google Scholar 

  • Schölkopf, B., Smola, A. J. (2002). Learning with kernels. Cambridge, MA, USA: MIT Press.

  • Shimodaira H. (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function.. Journal of Statistical Planning and Inference 90(2): 227–244

    Article  MathSciNet  MATH  Google Scholar 

  • Silverman B.W. (1978) Density ratios, empirical likelihood and cot death. Journal of the Royal Statistical Society, Series C 27(1): 26–33

    Google Scholar 

  • Smola, A., Song, L., Teo, C. H. (2009). Relative novelty detection. In D. van Dyk, M. Welling (Eds.), Proceedings of twelfth international conference on artificial intelligence and statistics (AISTATS2009) (Vol. 5, pp. 536–543). Clearwater Beach, FL, USA: JMLR Workshop and Conference Proceedings.

  • Steinwart I. (2001) On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research 2: 67–93

    MathSciNet  Google Scholar 

  • Stummer W. (2007) Some Bregman distances between financial diffusion processes. Proceedings in applied mathematics and mechanics 7: 1050503–1050504

    Article  Google Scholar 

  • Sugiyama, M. (2010). Superfast-trainable multi-class probabilistic classifier by least-squares posterior fitting. IEICE Transactions on Information and Systems, E93-D(10), 2690–2701.

  • Sugiyama, M., Kawanabe, M. (2011). Machine learning in non-stationary environments: introduction to covariate shift adaptation. Cambridge, MA, USA: MIT Press (to appear).

  • Sugiyama M., Müller K.R. (2005) Input-dependent estimation of generalization error under covariate shift. Statistics and Decisions 23(4): 249–279

    Article  MathSciNet  MATH  Google Scholar 

  • Sugiyama M., Krauledat M., Müller K.R. (2007) Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8: 985–1005

    MATH  Google Scholar 

  • Sugiyama M., Suzuki T., Nakajima S., Kashima H., von Bünau P., Kawanabe M. (2008) Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics 60(4): 699–746

    Article  MathSciNet  MATH  Google Scholar 

  • Sugiyama M., Kanamori T., Suzuki T., Hido S., Sese J., Takeuchi I., Wang L. (2009) A density-ratio framework for statistical data processing. IPSJ Transactions on Computer Vision and Applications 1: 183–208

    Article  Google Scholar 

  • Sugiyama, M., Takeuchi, I., Suzuki, T., Kanamori, T., Hachiya, H., Okanohara, D. (2010). Least-squares conditional density estimation. IEICE Transactions on Information and Systems, E93-D(3), 583–594.

  • Sugiyama M., Suzuki T., Itoh Y., Kanamori T., Kimura M. (2011a) Least-squares two-sample test. Neural Networks 24(7): 735–751

    Article  Google Scholar 

  • Sugiyama M., Yamada M., von Bünau P., Suzuki T., Kanamori T., Kawanabe M. (2011b) Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search. Neural Networks 24(2): 183–198

    Article  MATH  Google Scholar 

  • Sugiyama, M., Suzuki, T., Kanamori, T. (2012). Density ratio estimation in machine learning. Cambridge, UK: Cambridge University Press (to appear).

  • Suzuki, T., Sugiyama, M. (2009). Estimating squared-loss mutual information for independent component analysis. In T. Adali, C. Jutten, J. M. T. Romano, A. K. Barros (Eds.), Independent component analysis and signal separation (Vol. 5441, pp. 130–137), Lecture notes in computer science. Berlin, Germany: Springer.

  • Suzuki, T., Sugiyama, M. (2010). Sufficient dimension reduction via squared-loss mutual information estimation. In Y. W. Teh, M. Tiggerington (Eds.), Proceedings of the thirteenth international conference on artificial intelligence and statistics (AISTATS2010) (Vol. 9, pp. 804–811). Sardinia, Italy: JMLR Workshop and Conference Proceedings.

  • Suzuki T., Sugiyama M. (2011) Least-squares independent component analysis. Neural Computation 23(1): 284–301

    Article  MathSciNet  MATH  Google Scholar 

  • Suzuki, T., Sugiyama, M., Sese, J., Kanamori, T. (2008). Approximating mutual information by maximum likelihood density ratio estimation. In Y. Saeys, H. Liu, I. Inza, L. Wehenkel, Y. V. de Peer (Eds.), Proceedings of ECML-PKDD2008 workshop on new challenges for feature selection in data mining and knowledge discovery 2008 (FSDM2008) (Vol. 4, pp. 5–20). Antwerp, Belgium: JMLR Workshop and Conference Proceedings.

  • Suzuki, T., Sugiyama, M., Kanamori, T., Sese, J. (2009a). Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics, 10(1), S52.

  • Suzuki, T., Sugiyama, M., Tanaka, T. (2009b). Mutual information approximation via maximum likelihood estimation of density ratio. In Proceedings of 2009 IEEE international symposium on information theory (ISIT2009) (pp. 463–467). Seoul, Korea.

  • Tibshirani R. (1996) Regression shrinkage and subset selection with the lasso. Journal of the Royal Statistical Society, Series B 58(1): 267–288

    MathSciNet  MATH  Google Scholar 

  • Tipping M.E., Bishop C.M. (1999) Mixtures of probabilistic principal component analyzers. Neural Computation 11(2): 443–482

    Article  Google Scholar 

  • Tsuboi Y., Kashima H., Hido S., Bickel S., Sugiyama M. (2009) Direct density ratio estimation for large-scale covariate shift adaptation. Journal of Information Processing 17: 138–155

    Article  Google Scholar 

  • Tsuda, K., Rätsch, G., Warmuth, M. (2005). Matrix exponential gradient updates for on-line learning and Bregman projection. In L. K. Saul, Y. Weiss, L. Bottou (Eds.), Advances in neural information processing systems (Vol. 17, pp. 1425–1432). Cambridge, MA: MIT Press.

  • Williams P.M. (1995) Bayesian regularization and pruning using a Laplace prior. Neural Computation 7(1): 117–143

    Article  Google Scholar 

  • Wu, L., Jin, R., Hoi, S. C. H., Zhu, J., Yu, N. (2009). Learning Bregman distance functions and its application for semi-supervised clustering. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, A. Culotta (Eds.), Advances in neural information processing systems (Vol. 22, pp. 2089–2097). Curran Associates, Inc.

  • Yamada, M., Sugiyama, M. (2009) Direct importance estimation with Gaussian mixture models. IEICE Transactions on Information and Systems, E92-D(10), 2159–2162.

  • Yamada, M., Sugiyama, M. (2010). Dependence minimizing regression with model selection for non-linear causal inference under non-Gaussian noise. In Proceedings of the twenty-fourth AAAI conference on artificial intelligence (AAAI2010) (pp. 643–648). Atlanta, Georgia, USA: The AAAI Press.

  • Yamada, M., Sugiyama, M. (2011). Cross-domain object matching with model selection. In G. Gordon, D. Dunson, M. Dudík (Eds.), Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS2011) (pp. 807–815). Florida, USA: Fort Lauderdale.

  • Yamada, M., Sugiyama, M., Wichern, G., Simm, J. (2010). Direct importance estimation with a mixture of probabilistic principal component analyzers. IEICE Transactions on Information and Systems, E93-D(10), 2846–2849.

  • Yamada M., Sugiyama M., Wichern G., Simm J. (2011) Improving the accuracy of least-squares probabilistic classifiers. IEICE Transactions on Information and Systems E94-D(6): 1337–1340

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Masashi Sugiyama.

About this article

Cite this article

Sugiyama, M., Suzuki, T. & Kanamori, T. Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation. Ann Inst Stat Math 64, 1009–1044 (2012). https://doi.org/10.1007/s10463-011-0343-8

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-011-0343-8

Keywords

Navigation