Machine Learning

, Volume 73, Issue 3, pp 243–272 | Cite as

Convex multi-task feature learning

  • Andreas ArgyriouEmail author
  • Theodoros Evgeniou
  • Massimiliano Pontil


We present a method for learning sparse representations shared across multiple tasks. This method is a generalization of the well-known single-task 1-norm regularization. It is based on a novel non-convex regularizer which controls the number of learned features common across the tasks. We prove that the method is equivalent to solving a convex optimization problem for which there is an iterative algorithm which converges to an optimal solution. The algorithm has a simple interpretation: it alternately performs a supervised and an unsupervised step, where in the former step it learns task-specific functions and in the latter step it learns common-across-tasks sparse representations for these functions. We also provide an extension of the algorithm which learns sparse nonlinear representations using kernels. We report experiments on simulated and real data sets which demonstrate that the proposed method can both improve the performance relative to learning each task independently and lead to a few learned features common across related tasks. Our algorithm can also be used, as a special case, to simply select—not learn—a few common variables across the tasks.


Collaborative filtering Inductive transfer Kernels Multi-task learning Regularization Transfer learning Vector-valued functions 


  1. Aaker, D. A., Kumar, V., & Day, G. S. (2004). Marketing research (8th ed.). New York: Wiley. Google Scholar
  2. Abernethy, J., Bach, F., Evgeniou, T., & Vert, J.-P. (2006). Low-rank matrix factorization with attributes (Technical Report 2006/68/TOM/DS). INSEAD, Working paper. Google Scholar
  3. Ando, R. K., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6, 1817–1853. MathSciNetGoogle Scholar
  4. Argyriou, A., Micchelli, C. A., & Pontil, M. (2005). Learning convex combinations of continuously parameterized basic kernels. In Lecture notes in artificial intelligence : Vol. 3559. Proceedings of the 18th annual conference on learning theory (COLT) (pp. 338–352). Berlin: Springer. Google Scholar
  5. Argyriou, A., Evgeniou, T., & Pontil, M. (2007a). Multi-task feature learning. In Schölkopf, B. Platt, J. Hoffman, T. (Eds.), Advances in neural information processing systems (Vol. 19, pp. 41–48). Cambridge: MIT Press. Google Scholar
  6. Argyriou, A., Micchelli, C. A., & Pontil, M. (2007b). Representer theorems for spectral norms. Working paper, Dept. of Computer Science, University College London. Google Scholar
  7. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 686, 337–404. CrossRefMathSciNetGoogle Scholar
  8. Bakker, B., & Heskes, T. (2003). Task clustering and gating for Bayesian multi–task learning. Journal of Machine Learning Research, 4, 83–99. CrossRefGoogle Scholar
  9. Baxter, J. (2000). A model for inductive bias learning. Journal of Artificial Intelligence Research, 12, 149–198. zbMATHMathSciNetGoogle Scholar
  10. Ben-David, S., & Schuller, R. (2003). Exploiting task relatedness for multiple task learning. In Lecture notes in computer science : Vol. 2777. Proceedings of the 16th annual conference on learning theory (COLT) (pp. 567–580). Berlin: Springer. Google Scholar
  11. Bennett, K. P., & Embrechts, M. J. (2003). An optimization perspective on partial least squares. In J. A. K. Suykens, G. Horvath, S. Basu, C. Micchelli, J. Vandewalle (Eds.), NATO science series III: computer & systems sciences : Vol. 190. Advances in learning theory: methods, models and applications (pp. 227–250). Amsterdam: IOS Press. Google Scholar
  12. Bhatia, R. (1997). Matrix analysis. Springer: Graduate texts in Mathematics. Google Scholar
  13. Borga, M. (1998). Learning multidimensional signal processing. PhD thesis, Dept. of Electrical Engineering, Linköping University, Sweden. Google Scholar
  14. Borwein, J. M., & Lewis, A. S. (2005). CMS books in mathematics. Convex analysis and nonlinear optimization: theory and examples. Berlin: Springer. Google Scholar
  15. Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press. zbMATHGoogle Scholar
  16. Breiman, L., & Friedman, J. H. (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society, Series B, 59(1), 3–54. zbMATHCrossRefMathSciNetGoogle Scholar
  17. Caponnetto, A., & De Vito, E. (2006). Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, August 2006. Google Scholar
  18. Caruana, R. (1997). Multi-task learning. Machine Learning, 28, 41–75. CrossRefGoogle Scholar
  19. Chapelle, O., & Harchaoui, Z. (2005). A machine learning approach to conjoint analysis. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems (Vol. 17, pp. 257–264). Cambridge: MIT Press. Google Scholar
  20. Donoho, D. (2004). For most large underdetermined systems of linear equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Preprint, Dept. of Statistics, Stanford University. Google Scholar
  21. Evgeniou, T., Micchelli, C. A., & Pontil, M. (2005). Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6, 615–637. MathSciNetGoogle Scholar
  22. Evgeniou, T., Pontil, M., & Toubia, O. (2006). A convex optimization approach to modeling consumer heterogeneity in conjoint estimation (Technical Report). INSEAD. Google Scholar
  23. Fazel, M., Hindi, H., & Boyd, S. P. (2001). A rank minimization heuristic with application to minimum order system approximation. In Proceedings of the American control conference (Vol. 6, pp. 4734–4739). Google Scholar
  24. Goldstein, H. (1991). Multilevel modelling of survey data. The Statistician, 40, 235–244. CrossRefGoogle Scholar
  25. Golub, G. H., & van Loan, C. F. (1996). Matrix computations. Baltimore: Johns Hopkins University Press. zbMATHGoogle Scholar
  26. Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: an overview with application to learning methods. Neural Computation, 16(12), 2639–2664. zbMATHCrossRefGoogle Scholar
  27. Hastie, T., Tibshirani, R., & Friedman, J. (2001). Springer series in statistics. The elements of statistical learning: data mining, inference and prediction. Berlin: Springer. zbMATHGoogle Scholar
  28. Heisele, B., Serre, T., Pontil, M., Vetter, T., & Poggio, T. (2002). Categorization by learning and combining object parts. In Advances in neural information processing systems (Vol. 14, pp. 1239–1245). Cambridge: MIT Press. Google Scholar
  29. Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321–377. zbMATHGoogle Scholar
  30. Izenman, A. J. (1975). Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis, 5, 248–264. zbMATHCrossRefMathSciNetGoogle Scholar
  31. Jebara, T. (2004). Multi-task feature and kernel selection for SVMs. In Proceedings of the 21st international conference on machine learning. Google Scholar
  32. Lawrence, N. D., & Platt, J. C. (2004). Learning to learn with the informative vector machine. In R. Greiner (Ed.), Proceedings of the international conference in machine learning. Helsinki: Omnipress. Google Scholar
  33. Lenk, P. J., DeSarbo, W. S., Green, P. E., & Young, M. R. (1996). Hierarchical Bayes conjoint analysis: recovery of partworth heterogeneity from reduced experimental designs. Marketing Science, 15(2), 173–191. CrossRefGoogle Scholar
  34. Lewis, A. S. (1995). The convex analysis of unitarily invariant matrix functions. Journal of Convex Analysis, 2(1), 173–183. zbMATHMathSciNetGoogle Scholar
  35. Maurer, A. (2006). Bounds for linear multi-task learning. Journal of Machine Learning Research, 7, 117–139. MathSciNetGoogle Scholar
  36. Micchelli, C. A., & Pinkus, A. (1994). Variational problems arising from balancing several error criteria. Rendiconti di Matematica, Serie VII, 14, 37–86. zbMATHMathSciNetGoogle Scholar
  37. Micchelli, C. A., & Pontil, M. (2005). On learning vector-valued functions. Neural Computation, 17, 177–204. zbMATHCrossRefMathSciNetGoogle Scholar
  38. Neve, M., De Nicolao, G., & Marchesi, L. (2007). Nonparametric identification of population models via Gaussian processes. Automatica (Journal of IFAC), 43(7), 1134–1144. zbMATHCrossRefGoogle Scholar
  39. Obozinski, G., Taskar, B., & Jordan, M. I. (2006). Multi-task feature selection (Technical report). Deptartment of Statistics, UC Berkeley, June 2006. Google Scholar
  40. Poggio, T., & Girosi, F. (1998). A sparse representation for function approximation. Neural Computation, 10, 1445–1454. CrossRefGoogle Scholar
  41. Serre, T., Kouh, M., Cadieu, C., Knoblich, U., Kreiman, G., & Poggio, T. (2005). Theory of object recognition: computations and circuits in the feedforward path of the ventral stream in primate visual cortex (AI Memo 2005-036). Massachusetts Institute of Technology. Google Scholar
  42. Srebro, N., Rennie, J. D. M., & Jaakkola, T. S. (2005). Maximum-margin matrix factorization. In Advances in neural information processing systems (Vol. 17, pp. 1329–1336). Cambridge: MIT Press. Google Scholar
  43. Torralba, A., Murphy, K. P., & Freeman, W. T. (2004). Sharing features: efficient boosting procedures for multiclass object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (Vol. 2, pp. 762–769). Google Scholar
  44. Wahba, G. (1990). Series in applied mathematics : Vol. 59. Splines models for observational data. Philadelphia: SIAM. Google Scholar
  45. Wold, S., Ruhe, A., Wold, H., & Dunn III, W. J. (1984). The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM Journal of Scientific Computing, 3, 735–743. Google Scholar
  46. Xue, Y., Liao, X., Carin, L., & Krishnapuram, B. (2007). Multi-task learning for classification with Dirichlet process priors. Journal of Machine Learning Research, 8, 35–63. MathSciNetGoogle Scholar
  47. Yu, K., Tresp, V., & Schwaighofer, A. (2005). Learning Gaussian processes from multiple tasks. In Proceedings of the 22nd international conference on machine learning. Google Scholar
  48. Zhang, J., Ghahramani, Z., & Yang, Y. (2006). Learning multiple related tasks using latent independent component analysis. In Advances in neural information processing systems (Vol. 18, pp. 1585–1592). Cambridge: MIT Press. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Andreas Argyriou
    • 1
    Email author
  • Theodoros Evgeniou
    • 2
  • Massimiliano Pontil
    • 1
  1. 1.Department of Computer ScienceUniversity College LondonLondonUK
  2. 2.Technology Management and Decision SciencesINSEADFontainebleauFrance

Personalised recommendations