Machine Learning

, Volume 83, Issue 3, pp 265–287 | Cite as

Estimating variable structure and dependence in multitask learning via gradients

Open Access


We consider the problem of hierarchical or multitask modeling where we simultaneously learn the regression function and the underlying geometry and dependence between variables. We demonstrate how the gradients of the multiple related regression functions over the tasks allow for dimension reduction and inference of dependencies across tasks jointly and for each task individually. We provide Tikhonov regularization algorithms for both classification and regression that are efficient and robust for high-dimensional data, and a mechanism for incorporating a priori knowledge of task (dis)similarity into this framework. The utility of this method is illustrated on simulated and real data.


Multitask learning Dimension reduction Covariance estimation Inverse regression Graphical models 


  1. Ando, R., & Zhang, T. (2005). A framework for learning predictive structure from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6, 1817–1853. MathSciNetGoogle Scholar
  2. Argyriou, A., Evgeniou, T., & Pontil, M. (2006). Multi-task feature learning. In NIPS 20. Google Scholar
  3. Ben-David, S., & Schuller, R. (2003). Exploiting task relatedness for multiple task learning. In Proc. of computational learning theory (COLT). Google Scholar
  4. Bickel, S., Bogojeska, J., Lengauer, T., & Scheffer, T. T. (2008). Multi-task learning for HIV therapy screening. In ICML ’08: Proceedings of the 25th international conference on machine learning (pp. 56–63). New York, NY, USA. New York: ACM. CrossRefGoogle Scholar
  5. Caruana, R. (1997). Multi-task learning. Machine Learning, 28, 41–75. CrossRefGoogle Scholar
  6. Cook, R. (2007). Fisher lecture: Dimension reduction in regression. Statistical Science, 22(1), 1–26. MathSciNetCrossRefGoogle Scholar
  7. Cook, R., & Weisberg, S. (1991). Discussion of “sliced inverse regression for dimension reduction”. Journal of the American Statistical Association, 86, 328–332. CrossRefGoogle Scholar
  8. Edelman, E., Guinney, J., Chi, J., Febbo, P., & Mukherjee, S. (2008). Modeling cancer progression via pathway dependencies. PLoS Computational Biology, 4(2). Google Scholar
  9. Evgeniou, T., & Pontil, M. (2004). Regularized multi-task learning. In Proc. conference on knowledge discovery and data mining. Google Scholar
  10. Evgeniou, T., Micchelli, C., & Pontil, M. (2005). Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6, 615–637. MathSciNetGoogle Scholar
  11. Fisher, R. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Statistical Society A, 222, 309–368. CrossRefGoogle Scholar
  12. Fukumizu, K., Bach, F., & Jordan, M. (2005). Dimensionality reduction in supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5, 73–99. MathSciNetGoogle Scholar
  13. Hastie, T., & Tibshirani, R. (1996). Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society. Series B, 58(1), 155–176. MathSciNetMATHGoogle Scholar
  14. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. Berlin: Springer. MATHGoogle Scholar
  15. Hotelling, H. (1933). Analysis of a complex of statistical variables in principal components. Journal of Educational Psychology, 24, 417–441. CrossRefGoogle Scholar
  16. Jebara, T. (2004). Multi-task feature and kernel selection for svms. In Proc. of ICML. Google Scholar
  17. Jiang, J., Neubauer, B., Graff, J., Chedid, M., Thomas, J., Roehm, N., Zhang, S., Eckert, G., Koch, M., Eble, J., & Cheng, L. (2002). Expression of group iia secretory phospholipase a2 is elevated in prostatic intraepithelial neoplasia and adenocarcinoma. The American Journal of Pathology, 160, 667–671. CrossRefGoogle Scholar
  18. Lauritzen, S. (1996). Graphical models. Oxford: Clarendon Press. Google Scholar
  19. Li, K. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316–342. MathSciNetMATHCrossRefGoogle Scholar
  20. Liang, F., Mukherjee, S., Liao, M., & West, M. (2008). Nonparametric Bayesian kernel models (Technical Report 07-25). ISDS, Duke Univ. Google Scholar
  21. Maggioni, M., & Coifman, R. (2007). Multiscale analysis of data sets using diffusion wavelets. In Proc. data mining for biomedical informatics. Google Scholar
  22. Mukherjee, S., & Wu, Q. (2006). Estimation of gradients and coordinate covariation in classification. Journal of Machine Learning Research, 7, 2481–2514. MathSciNetGoogle Scholar
  23. Mukherjee, S., & Zhou, D. (2006). Learning coordinate covariances via gradients. Journal of Machine Learning Research, 7, 519–549. MathSciNetGoogle Scholar
  24. Mukherjee, S., Wu, Q., & Zhou, D.-X. (2010). Learning gradients on manifolds. Bernoulli, 16(1), 181–207. MathSciNetMATHCrossRefGoogle Scholar
  25. Obozinski, G., Taskar, B., & Jordan, M. (2006). Multi-task feature selection (Technical report). Dept. of Statistics, University of California, Berkeley. Google Scholar
  26. Speed, T., & Kiiveri, H. (1986). Gaussian Markov distributions over finite graphs. Annals of Statistics, 14, 138–150. MathSciNetMATHCrossRefGoogle Scholar
  27. Sugiyama, M. (2007). Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis. Journal of Machine Learning Research, 8, 1027–1061. Google Scholar
  28. Tomlins, S., Mehra, R., Rhodes, D., Cao, X., Wang, L., Dhanasekaran, S., Kalyana-Sundaram, S., Wei, J., Rubin, M., Pienta, K., Shah, R., & Chinnaiyan, A. (2007). Integrative molecular concept modeling of prostate cancer progression. Nature Genetics, 39(1), 41–51. CrossRefGoogle Scholar
  29. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. MATHGoogle Scholar
  30. Wahba, G. (1990). Series in applied mathematics : Vol. 59. Splines models for observational data. Philadelphia: SIAM. Google Scholar
  31. Wu, Q., Guinney, J., Maggioni, M., & Mukherjee, S. (2010). Learning gradients: predictive models that reflect geometry and dependencies. Journal of Machine Learning Research, 11, 2175–2198. MathSciNetGoogle Scholar
  32. Wu, Q., Liang, F., & Mukherjee, S. (2007). Regularized sliced inverse regression for kernel models (Technical report 07-25). ISDS, Duke Univ. Google Scholar
  33. Xia, Y., Tong, H., Li, W., & Zhu, L.-X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society. Series B, 64(3), 363–410. MathSciNetMATHCrossRefGoogle Scholar
  34. Xue, Y., Liao, X., Carin, L., & Krishnapuram, B. (2007). Multi-task learning for classification with Dirichlet process priors. Journal of Machine Learning Research, 8, 35–63. MathSciNetGoogle Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  1. 1.Sage BionetworksSeatleUSA
  2. 2.Department of MathematicsMichigan State UniversityEast LansingUSA
  3. 3.Departments of Statistical Science, Computer Science, and MathematicsDuke UniversityDurhamUSA

Personalised recommendations