Skip to main content

Direct importance estimation for covariate shift adaptation

Abstract

A situation where training and test samples follow different input distributions is called covariate shift. Under covariate shift, standard learning methods such as maximum likelihood estimation are no longer consistent—weighted variants according to the ratio of test and training input densities are consistent. Therefore, accurately estimating the density ratio, called the importance, is one of the key issues in covariate shift adaptation. A naive approach to this task is to first estimate training and test input densities separately and then estimate the importance by taking the ratio of the estimated densities. However, this naive approach tends to perform poorly since density estimation is a hard task particularly in high dimensional cases. In this paper, we propose a direct importance estimation method that does not involve density estimation. Our method is equipped with a natural cross validation procedure and hence tuning parameters such as the kernel width can be objectively optimized. Furthermore, we give rigorous mathematical proofs for the convergence of the proposed algorithm. Simulations illustrate the usefulness of our approach.

This is a preview of subscription content, access via your institution.

References

  • Baldi P., Brunak S. (1998) Bioinformatics: The machine learning approach. MIT Press, Cambridge

    Google Scholar 

  • Bartlett P., Bousquet O., Mendelson S. (2005) Local Rademacher complexities. Annals of Statistics 33: 1487–1537

    Article  MathSciNet  Google Scholar 

  • Bickel S., Scheffer T. (2007) Dirichlet-enhanced spam filtering based on biased samples. In: Schölkopf B., Platt J., Hoffman T.(eds) Advances in neural information processing systems, Vol. 19. MIT Press, Cambridge

    Google Scholar 

  • Bickel, S., Brückner, M., Scheffer, T. (2007). Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on machine learning.

  • Borgwardt K.M., Gretton A., Rasch M.J., Kriegel H.-P., Schölkopf B., Smola A.J. (2006) Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22(14): e49–e57

    Article  Google Scholar 

  • Bousquet O. (2002) A Bennett concentration inequality and its application to suprema of empirical process. Comptes Rendus de l’Académie des Sciences. Série I. Mathématique, 334: 495–500

    MATH  MathSciNet  Google Scholar 

  • Boyd S., Vandenberghe L. (2004) Convex optimization. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Fukumizu, K., Kuriki, T., Takeuchi, K., Akahira, M. (2004). Statistics of singular models. The Frontier of Statistical Science 7. Iwanami Syoten. in Japanese.

  • Ghosal S., van der Vaart A.W. (2001) Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. Annals of Statistics 29: 1233–1263

    MATH  Article  MathSciNet  Google Scholar 

  • Hachiya, H., Akiyama, T., Sugiyama, M., Peters, J. (2008). Adaptive importance sampling with automatic model selection in value function approximation. In Proceedings of the twenty-third AAAI conference on artificial intelligence (AAAI-08), Chicago, USA.

  • Hall P. (1987) On Kullback-Leibler loss and density estimation. The Annals of Statistics 15(4): 1491–1519

    MATH  Article  MathSciNet  Google Scholar 

  • Härdle W., Müller M., Sperlich S., Werwatz A. (2004) Nonparametric and Semiparametric Models. Springer Series in Statistics. Springer, Berlin

    Google Scholar 

  • Heckman J.J. (1979) Sample selection bias as a specification error. Econometrica 47(1): 153–162

    MATH  Article  MathSciNet  Google Scholar 

  • Huang J., Smola A., Gretton A., Borgwardt K.M., Schölkopf B. (2007) Correcting sample selection bias by unlabeled data. In: Schölkopf B., Platt J., Hoffman T.(eds) Advances in Neural Information Processing Systems, Vol. 19. MIT Press, Cambridge, pp 601–608

    Google Scholar 

  • Koltchinskii V. (2006) Local Rademacher complexities and oracle inequalities in risk minimization. Annals of Statistics 34: 2593–2656

    MATH  Article  MathSciNet  Google Scholar 

  • Lin, C.-J., Weng, R. C., Keerthi, S. S. (2007). Trust region Newton method for large-scale logistic regression. Technical report, Department of Computer Science, National Taiwan University.

  • Mendelson S. (2002) Improving the sample complexity using global data. IEEE Transactions on Information Theory 48: 1977–1991

    MATH  Article  MathSciNet  Google Scholar 

  • Minka, T. P. (2007). A comparison of numerical optimizers for logistic regression. Technical report, Microsoft Research.

  • Nguyen, X., Wainwright, M. J., Jordan, M. I. (2007). Estimating divergence functions and the likelihood ratio by penalized convex risk minimization. In Advances in neural information processing systems, Vol. 20.

  • Rasmussen, C. E., Neal, R. M., Hinton, G. E., van Camp, D., Revow, M., Ghahramani, Z., Kustra, R., Tibshirani, R. (1996). The DELVE manual.

  • Rätsch G., Onoda T., Müller K.-R. (2001) Soft margins for adaboost. Machine Learning 42(3): 287–320

    MATH  Article  Google Scholar 

  • Schuster E., Gregory C. (1982) On the non-consistency of maximum likelihood nonparametric density estimators. In: Eddy W.F.(eds) In Computer science and statistics: proceedings of the 13th symposium on the interface, New York, NY, USA: Springer pp 295–298

  • Self S.G., Liang K.Y. (1987) Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association 82: 605–610

    MATH  Article  MathSciNet  Google Scholar 

  • Shelton, C. R. (2001). Importance sampling for reinforcement learning with multiple objectives. Ph.D. thesis, Massachusetts Institute of Technology.

  • Shimodaira H. (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90(2): 227–244

    MATH  Article  MathSciNet  Google Scholar 

  • Steinwart I. (2001) On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research 2: 67–93

    Article  MathSciNet  Google Scholar 

  • Sugiyama M., Müller K.-R. (2005) Input-dependent estimation of generalization error under covariate shift. Statistics and Decisions 23(4): 249–279

    MATH  Article  MathSciNet  Google Scholar 

  • Sugiyama M., Kawanabe M., Müller K.-R. (2004) Trading variance reduction with unbiasedness: The regularized subspace information criterion for robust model selection in kernel regression. Neural Computation 16(5): 1077–1104

    MATH  Article  Google Scholar 

  • Sugiyama M., Krauledat M., Müller K.-R. (2007) Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8: 985–1005

    Google Scholar 

  • Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P., Kawanabe, M. (2008). Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, Vol. 20. Cambridge: MIT Press.

  • Sutton R.S., Barto G.A. (1998) Reinforcement learning: An introduction. MIT Press, Cambridge

    Google Scholar 

  • Talagrand M. (1994) Sharper bounds for Gaussian and empirical processes. Annals of Probability 22: 28–76

    MATH  Article  MathSciNet  Google Scholar 

  • Talagrand M. (1996a) New concentration inequalities in product spaces. Inventiones Mathematicae 126: 505–563

    MATH  Article  MathSciNet  Google Scholar 

  • Talagrand M. (1996b) A new look at independence. Annals of Statistics 24: 1–34

    MATH  Article  MathSciNet  Google Scholar 

  • Tsuboi Y., Kashima H., Hido S., Bickel S., Sugiyama M. (2008) Direct density ratio estimation for large-scale covariate shift adaptation. In: In:Zaki M.J., Wang K., Apte C., Park H.(eds) Proceedings of the eighth SIAM international conference on data mining (SDM2008). Atlanta, Georgia, USA, pp 443–454

    Google Scholar 

  • van de Geer S. (2000) Empirical processes in M-Estimation. Cambridge University Press, Cambridge

    Google Scholar 

  • van der Vaart A.W., Wellner J.A. (1996) Weak convergence and empirical processes. With applications to statistics. Springer, New York

    MATH  Google Scholar 

  • Wolpaw J.R., Birbaumer N., McFarland D.J., Pfurtscheller G., Vaughan T.M. (2002) Brain-computer interfaces for communication and control. Clinical Neurophysiology 113(6): 767–791

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Masashi Sugiyama.

About this article

Cite this article

Sugiyama, M., Suzuki, T., Nakajima, S. et al. Direct importance estimation for covariate shift adaptation. Ann Inst Stat Math 60, 699–746 (2008). https://doi.org/10.1007/s10463-008-0197-x

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-008-0197-x

Keywords

  • Covariate shift
  • Importance sampling
  • Model misspecification
  • Kullback–Leibler divergence
  • Likelihood cross validation