Machine Learning

, Volume 90, Issue 2, pp 161–189 | Cite as

A theory of transfer learning with applications to active learning

Article

Abstract

We explore a transfer learning setting, in which a finite sequence of target concepts are sampled independently with an unknown distribution from a known family. We study the total number of labeled examples required to learn all targets to an arbitrary specified expected accuracy, focusing on the asymptotics in the number of tasks and the desired accuracy. Our primary interest is formally understanding the fundamental benefits of transfer learning, compared to learning each target independently from the others. Our approach to the transfer problem is general, in the sense that it can be used with a variety of learning protocols. As a particularly interesting application, we study in detail the benefits of transfer for self-verifying active learning; in this setting, we find that the number of labeled examples required for learning with transfer is often significantly smaller than that required for learning each target independently.

Keywords

Transfer learning Multi-task learning Active learning Statistical learning theory Bayesian learning Sample complexity 

References

  1. Ando, R. K., & Zhang, T. (2004). A framework for learning predictive structures from multiple tasks and unlabeled data (Technical Report RC23462). IBM T.J. Watson Research Center. Google Scholar
  2. Ando, R. K., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6, 1817–1853. MathSciNetMATHGoogle Scholar
  3. Balcan, M.-F., Broder, A., & Zhang, T. (2007). Margin based active learning. In Proceedings of the 20th conference on learning theory. Google Scholar
  4. Balcan, M.-F., Beygelzimer, A., & Langford, J. (2009). Agnostic active learning. Journal of Computer and System Sciences, 75(1), 78–89. MathSciNetMATHCrossRefGoogle Scholar
  5. Balcan, M.-F., Hanneke, S., & Wortman Vaughan, J. (2010). The true sample complexity of active learning. Machine Learning, 80(2–3), 111–139. CrossRefGoogle Scholar
  6. Baram, Y., El-Yaniv, R., & Luz, K. (2004). Online choice of active learning algorithms. The Journal of Machine Learning Research, 5, 255–291. MathSciNetGoogle Scholar
  7. Baxter, J. (1997). A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning, 28, 7–39. MATHCrossRefGoogle Scholar
  8. Baxter, J. (2000). A model of inductive bias learning. The Journal of Artificial Intelligence Research, 12, 149–198. MathSciNetMATHGoogle Scholar
  9. Ben-David, S., & Schuller, R. (2003). Exploiting task relatedness for multiple task learning. In Conference on learning theory. Google Scholar
  10. Beygelzimer, A., Dasgupta, S., & Langford, J. (2009). Importance weighted active learning. In Proceedings of the international conference on machine learning. Google Scholar
  11. Campbell, C., Cristianini, N., & Smola, A. (2000). Query learning with large margin classifiers. In International conference on machine learning. Google Scholar
  12. Carbonell, J. G. (1983). Learning by analogy: formulating and generalizing plans from past experience. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning, an artificial intelligence approach. Palo Alto: Tioga Press. Google Scholar
  13. Carbonell, J. G. (1986). Derivational analogy: a theory of reconstructive problem solving and expertise acquisition. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning, an artificial intelligence approach, Vol. II. San Mateo: Morgan Kaufmann. Google Scholar
  14. Caruana, R. (1997). Multitask learning. Machine Learning, 28, 41–75. CrossRefGoogle Scholar
  15. Castro, R., & Nowak, R. (2008). Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5), 2339–2353. MathSciNetCrossRefGoogle Scholar
  16. Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15(2), 201–221. Google Scholar
  17. Dasgupta, S. (2004). Analysis of a greedy active learning strategy. In Advances in neural information processing systems (pp. 337–344). Cambridge: MIT Press. Google Scholar
  18. Dasgupta, S. (2005). Coarse sample complexity bounds for active learning. In Proc. of neural information processing systems (NIPS). Google Scholar
  19. Dasgupta, S., Hsu, D., & Monteleoni, C. (2008). A general agnostic active learning algorithm. In Advances in neural information processing systems (Vol. 20). Google Scholar
  20. Dasgupta, S., Kalai, A. T., & Monteleoni, C. (2009). Analysis of perceptron-based active learning. Journal of Machine Learning Research, 10, 281–299. MathSciNetMATHGoogle Scholar
  21. Devroye, L., & Lugosi, G. (2001). Combinatorial methods in density estimation. New York: Springer. MATHCrossRefGoogle Scholar
  22. Donmez, P., & Carbonell, J. (2008). Paired sampling in density-sensitive active learning. In Proceedings of the 10th international symposium on artificial intelligence and mathematics (SIAM). Google Scholar
  23. Donmez, P., Carbonell, J., & Bennett, P. (2007). Dual strategy active learning. In Proceedings of the 18th European conference on machine learning. Google Scholar
  24. Evgeniou, T., & Pontil, M. (2004). Regularized multi-task learning. In ACM SIGKDD conference on knowledge discovery and data mining. Google Scholar
  25. Evgeniou, T., Micchelli, C., & Pontil, M. (2005). Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6, 615–637. MathSciNetMATHGoogle Scholar
  26. Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997). Selective sampling using the query by committee algorithm. In Machine learning (pp. 133–168). Google Scholar
  27. Friedman, E. (2009). Active learning for smooth problems. In Proceedings of the 22nd conference on learning theory. Google Scholar
  28. Hanneke, S. (2007a). A bound on the label complexity of agnostic active learning. In Proc. of the 24th international conference on machine learning. Google Scholar
  29. Hanneke, S. (2007b). Teaching dimension and the complexity of active learning. In Proc. of the 20th annual conference on learning theory (COLT). Google Scholar
  30. Hanneke, S. (2009). Theoretical foundations of active learning. Ph.D. thesis, Machine Learning Department, School of Computer Science, Carnegie Mellon University. Google Scholar
  31. Hanneke, S. (2011). Rates of convergence in active learning. Annals of Statistics, 39(1), 333–361. MathSciNetMATHCrossRefGoogle Scholar
  32. Haussler, D., Kearns, M., & Schapire, R. (1992). Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. In Machine learning (pp. 61–74). San Mateo: Morgan Kaufmann. Google Scholar
  33. He, J., & Carbonell, J. (2008). Rare class discovery based on active learning. In Proceedings of the 10th international symposium on artificial intelligence and mathematics. Google Scholar
  34. Kääriäinen, M. (2006). Active learning in the non-realizable case. In Proc. of the 17th international conference on algorithmic learning theory. Google Scholar
  35. Kallenberg, O. (2002). Foundations of modern probability (2nd ed.). New York: Springer. MATHGoogle Scholar
  36. Kolodner, J. (Ed.) (1993). Case-based learning. Dordrecht: Kluwer Academic. MATHGoogle Scholar
  37. Koltchinskii, V. (2010). Rademacher complexities and bounding the excess risk in active learning. Journal of Machine Learning Research, 11, 2457–2485. MathSciNetMATHGoogle Scholar
  38. McCallum, A., & Nigam, K. (1998). Employing EM and pool-based active learning for text classification. In International conference on machine learning. Google Scholar
  39. Micchelli, C., & Pontil, M. (2004). Kernels for multi-task learning. In Advances in neural information processing (Vol. 18). Google Scholar
  40. Nguyen, H., & Smeulders, A. (2004). Active learning using pre-clustering. In International conference on machine learning. Google Scholar
  41. Nowak, R. D. (2008). Generalized binary search. In Proceedings of the 46th annual Allerton conference on communication, control, and computing. Google Scholar
  42. Schervish, M. J. (1995). Theory of statistics. New York: Springer. MATHCrossRefGoogle Scholar
  43. Silver, D. L. (2000). Selective transfer of neural network task knowledge. Ph.D. thesis, Computer Science, University of Western Ontario. Google Scholar
  44. Thrun, S. (1996). Is learning the n-th thing any easier than learning the first? In Advances in neural information processing systems (Vol. 8). Google Scholar
  45. Tong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2(1), 45–66. Google Scholar
  46. Vapnik, V. (1982). Estimation of dependencies based on empirical data. New York: Springer. Google Scholar
  47. Veloso, M. M., & Carbonell, J. G. (1993). Derivational analogy in prodigy: automating case acquisition, storage and utilization. Machine Learning, 10, 249–278. CrossRefGoogle Scholar
  48. Wang, L. (2009). Sufficient conditions for agnostic active learnable. In Advances in neural information processing systems (Vol. 22). Google Scholar
  49. Yatracos, Y. G. (1985). Rates of convergence of minimum distance estimators and Kolmogorov’s entropy. Annals of Statistics, 13, 768–774. MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© The Author(s) 2012

Authors and Affiliations

  1. 1.Machine Learning DepartmentCarnegie Mellon UniversityPittsburghUSA
  2. 2.Department of StatisticsCarnegie Mellon UniversityPittsburghUSA
  3. 3.Language Technologies InstituteCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations