Machine Learning

, Volume 79, Issue 1–2, pp 151–175 | Cite as

A theory of learning from different domains

  • Shai Ben-David
  • John Blitzer
  • Koby Crammer
  • Alex Kulesza
  • Fernando Pereira
  • Jennifer Wortman Vaughan
Open Access
Article

Abstract

Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. Often, however, we have plentiful labeled training data from a source domain but wish to learn a classifier which performs well on a target domain with a different distribution and little or no labeled training data. In this work we investigate two questions. First, under what conditions can a classifier trained from source data be expected to perform well on target data? Second, given a small amount of labeled target data, how should we combine it during training with the large amount of labeled source data to achieve the lowest target error at test time?

We address the first question by bounding a classifier’s target error in terms of its source error and the divergence between the two domains. We give a classifier-induced divergence measure that can be estimated from finite, unlabeled samples from the domains. Under the assumption that there exists some hypothesis that performs well in both domains, we show that this quantity together with the empirical source error characterize the target error of a source-trained classifier.

We answer the second question by bounding the target error of a model which minimizes a convex combination of the empirical source and target errors. Previous theoretical work has considered minimizing just the source error, just the target error, or weighting instances from the two domains equally. We show how to choose the optimal combination of source and target error as a function of the divergence, the sample sizes of both domains, and the complexity of the hypothesis class. The resulting bound generalizes the previously studied cases and is always at least as tight as a bound which considers minimizing only the target error or an equal weighting of source and target errors.

Keywords

Domain adaptation Transfer learning Learning theory Sample-selection bias 

References

  1. Ando, R., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6, 1817–1853. MathSciNetGoogle Scholar
  2. Anthony, M., & Bartlett, P. (1999). Neural network learning: theoretical foundations. Cambridge: Cambridge University Press. MATHCrossRefGoogle Scholar
  3. Bartlett, P., & Mendelson, S. (2002). Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482. CrossRefMathSciNetGoogle Scholar
  4. Batu, T., Fortnow, L., Rubinfeld, R., Smith, W., & White, P. (2000). Testing that distributions are close. In: IEEE symposium on foundations of computer science (Vol. 41, pp. 259–269). Google Scholar
  5. Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research, 12, 149–198. MATHMathSciNetGoogle Scholar
  6. Ben-David, S., Eiron, N., & Long, P. (2003). On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences, 66, 496–514. MATHCrossRefMathSciNetGoogle Scholar
  7. Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2006). Analysis of representations for domain adaptation. In: Advances in neural information processing systems. Google Scholar
  8. Bickel, S., Brückner, M., & Scheffer, T. (2007). Discriminative learning for differing training and test distributions. In: Proceedings of the international conference on machine learning. Google Scholar
  9. Bikel, D., Miller, S., Schwartz, R., & Weischedel, R. (1997). Nymble: a high-performance learning name-finder. In: Conference on applied natural language processing. Google Scholar
  10. Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Wortman, J. (2007a). Learning bounds for domain adaptation. In: Advances in neural information processing systems. Google Scholar
  11. Blitzer, J., Dredze, M., & Pereira, F. (2007b) Biographies, Bollywood, boomboxes and blenders: domain adaptation for sentiment classification. In: ACL. Google Scholar
  12. Collins, M. (1999). Head-driven statistical models for natural language parsing. PhD thesis, University of Pennsylvania. Google Scholar
  13. Cortes, C., Mohri, M., Riley, M., & Rostamizadeh, A. (2008). Sample selection bias correction theory. In: Proceedings of the 19th annual conference on algorithmic learning theory. Google Scholar
  14. Crammer, K., Kearns, M., & Wortman, J. (2008). Learning from multiple sources. Journal of Machine Learning Research, 9, 1757–1774. MathSciNetGoogle Scholar
  15. Dai, W., Yang, Q., Xue, G., & Yu, Y. (2007). Boosting for transfer learning. In: Proceedings of the international conference on machine learning. Google Scholar
  16. Das, S., & Chen, M. (2001). Yahoo! for Amazon: extracting market sentiment from stock message boards. In: Proceedings of the Asia pacific finance association annual conference. Google Scholar
  17. Daumé, H. (2007). Frustratingly easy domain adaptation. In: Association for computational linguistics (ACL). Google Scholar
  18. Finkel, J. R. Manning, C. D. (2009). Hierarchical Bayesian domain adaptation. In: Proceedings of the north American association for computational linguistics. Google Scholar
  19. Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153–161. MATHCrossRefMathSciNetGoogle Scholar
  20. Huang, J., Smola, A., Gretton, A., Borgwardt, K., & Schoelkopf, B. (2007). Correcting sample selection bias by unlabeled data. In: Advances in neural information processing systems. Google Scholar
  21. Jiang, J., & Zhai, C. (2007). Instance weighting for domain adaptation. In: Proceedings of the association for computational linguistics. Google Scholar
  22. Kifer, D., Ben-David, S., & Gehrke, J. (2004). Detecting change in data streams. In: Ver large databases. Google Scholar
  23. Li, X., & Bilmes, J. (2007). A Bayesian divergence prior for classification adaptation. In: Proceedings of the international conference on artificial intelligence and statistics. Google Scholar
  24. Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009a). Domain adaptation with multiple sources. In: Advances in neural information processing systems. Google Scholar
  25. Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009b). Multiple source adaptation and the rényi divergence. In: Proceedings of the conference on uncertainty in artificial intelligence. Google Scholar
  26. McAllester, D. (2003). Simplified PAC-Bayesian margin bounds. In: Proceedings of the sixteenth annual conference on learning theory. Google Scholar
  27. Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of empirical methods in natural language processing. Google Scholar
  28. Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In: Proceedings of empirical methods in natural language processing. Google Scholar
  29. Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P., & Kawanabe, M. (2008). Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60, 699–746. MATHCrossRefMathSciNetGoogle Scholar
  30. Thomas, M., Pang, B., & Lee, L. (2006). Get out the vote: determining support or opposition from congressional floor-debate transcripts. In: Proceedings of empirical methods in natural language processing. Google Scholar
  31. Turney, P. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the association for computational linguistics. Google Scholar
  32. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. MATHGoogle Scholar
  33. Zhang, T. (2004). Solving large-scale linear prediction problems with stochastic gradient descent. In: Proceedings of the international conference on machine learning. Google Scholar

Copyright information

© The Author(s) 2009

Authors and Affiliations

  • Shai Ben-David
    • 1
  • John Blitzer
    • 2
  • Koby Crammer
    • 3
  • Alex Kulesza
    • 4
  • Fernando Pereira
    • 5
  • Jennifer Wortman Vaughan
    • 6
  1. 1.David R. Cheriton School of Computer ScienceUniversity of WaterlooWaterlooCanada
  2. 2.Department of Computer ScienceUC BerkeleyBerkeleyUSA
  3. 3.Department of Electrical EngineeringThe TechnionHaifaIsrael
  4. 4.Department of Computer and Information ScienceUniversity of PennsylvaniaPhiladelphiaUSA
  5. 5.Google ResearchMountain ViewUSA
  6. 6.School of Engineering and Applied SciencesHarvard UniversityCambridgeUSA

Personalised recommendations