Machine Learning

, Volume 85, Issue 1–2, pp 149–173 | Cite as

Boosted multi-task learning

  • Olivier Chapelle
  • Pannagadatta Shivaswamy
  • Srinivas Vadrevu
  • Kilian Weinberger
  • Ya Zhang
  • Belle Tseng
Article

Abstract

In this paper we propose a novel algorithm for multi-task learning with boosted decision trees. We learn several different learning tasks with a joint model, explicitly addressing their commonalities through shared parameters and their differences with task-specific ones. This enables implicit data sharing and regularization. Our algorithm is derived using the relationship between 1-regularization and boosting. We evaluate our learning method on web-search ranking data sets from several countries. Here, multi-task learning is particularly helpful as data sets from different countries vary largely in size because of the cost of editorial judgments. Further, the proposed method obtains state-of-the-art results on a publicly available multi-task dataset. Our experiments validate that learning various tasks jointly can lead to significant improvements in performance with surprising reliability.

Keywords

Multi-task learning Boosting Decision trees Web search Ranking 

References

  1. Argyriou, A., Evgeniou, T., & Pontil, M. (2007). Multi-task feature learning. In Advances in neural information processing systems (Vol. 19). Cambridge: MIT Press. Google Scholar
  2. Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. Machine Learning, 73(3), 243–272. CrossRefGoogle Scholar
  3. Bakker, B., & Heskes, T. (2003). Task clustering and gating for Bayesian multitask learning. Journal of Machine Learning Research, 4, 83–99. doi:10.1162/153244304322765658. Google Scholar
  4. Ben-David, S., & Schuller, R. (2003). Exploiting task relatedness for multiple task learning. In 16th annual conference on learning theory (pp. 567–580). Google Scholar
  5. Bian, J., Li, X., Li, F., Zheng, Z., & Zha, H. (2010). Ranking specialization for web search: a divide-and-conquer approach by using topical ranksvm. In WWW’10: proceedings of the 19th international World Wide Web conference. Google Scholar
  6. Bottou, L., & Vapnik, V. (1992). Local learning algorithms. Neural Computation, 4(6), 888–900. CrossRefGoogle Scholar
  7. Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. London/Boca Raton: Chapman & Hall/CRC. MATHGoogle Scholar
  8. Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3–10. CrossRefGoogle Scholar
  9. Burges, C., Shaked, T., Renshaw, E., Deeds, M., Hamilton, N., & Hullender, G. (2005). Learning to rank using gradient descent. In Internation conference on machine learning (pp. 89–96). Google Scholar
  10. Cao, Y., Xu, J., Liu, T., Li, H., Huang, Y., & Hon, H. (2006). Adapting ranking SVM to document retrieval. In Proceedings of the 29th international ACM SIGIR conference on research and development in information retrieval (pp. 186–193). Google Scholar
  11. Caponnetto, A., Micchelli, C., Pontil, M., & Ying, Y. (2008). Universal multi-task kernels. The Journal of Machine Learning Research, 9, 1615–1646. MathSciNetGoogle Scholar
  12. Caruana, R. (1997). Multitask learning. In Machine learning (pp. 41–75). Google Scholar
  13. Chapelle, O., & Wu, M. (2010). Gradient descent optimization of smoothed information retrieval metrics. Information Retrieval Journal, 13(3), 216–235. CrossRefGoogle Scholar
  14. Chen, D., Xiong, Y., Yan, J., Xue, G. R., Wang, G., & Chen, Z. (2009). Knowledge transfer for cross domain learning to rank. Information Retrieval. Google Scholar
  15. Collobert, R., & Weston, J. (2008). A unified architecture for NLP: deep neural networks with multitask learning. In Proceedings of the 25th international conference on machine learning (pp. 160–167). New York: ACM. CrossRefGoogle Scholar
  16. Cossock, D., & Zhang, T. (2006). Subset ranking using regression. In Proceedings of the conference on learning theory. Google Scholar
  17. Evgeniou, T., & Pontil, M. (2004). Regularized multi-task learning. In KDD (pp. 109–117). CrossRefGoogle Scholar
  18. Evgeniou, T., Micchelli, C., & Pontil, M. (2006). Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6(1), 615. MathSciNetGoogle Scholar
  19. Friedman, J. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29, 1189–1232. MathSciNetMATHCrossRefGoogle Scholar
  20. Gao, J., Wu, Q., Burges, C., Svore, K., Su, Y., Khan, N., Shah, S., & Zhou, H. (2009). Model adaptation via model interpolation and boosting for web search ranking. In EMNLP (pp. 505–513). CrossRefGoogle Scholar
  21. Geng, X., Liu, T. Y., Qin, T., Arnold, A., Li, H., & Shum, H. Y. (2008). Query dependent ranking using k-nearest neighbor. In SIGIR’08: proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 115–122). New York: ACM. CrossRefGoogle Scholar
  22. Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. In A. Smola, P. Bartlett, B. Schölkopf, D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 115–132). Cambridge: MIT Press. Google Scholar
  23. Jarvelin, K., & Kekalainen, J. (2002). IR evaluation methods for retrieving highly relevant documents. In ACM special interest group in information retrieval (SIGIR) (pp. 41–48). New York: ACM. Google Scholar
  24. Jebara, T. (2004). Multi-task feature and kernel selection for svms. In Proceedings of the 21st international conference on machine learning. Google Scholar
  25. Kang, I. H., & Kim, G. (2003). Query type classification for web document retrieval. In SIGIR’03: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (pp. 64–71). New York: ACM. Google Scholar
  26. Li, P., Burges, C., & Wu, Q. (2008). Mcrank: learning to rank using multiple classification and gradient boosting. In J. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems (Vol. 21, pp. 897–904). Cambridge: MIT Press. Google Scholar
  27. Mason, L., Baxter, J., Bartlett, P., & Frean, M. (2000). Boosting algorithms as gradient descent in function space. In Neural information processing systems (Vol. 12, pp. 512–518). Google Scholar
  28. Maurer, A. (2006). Bounds for linear multi-task learning. Journal of Machine Learning Research, 7, 117–139. MathSciNetGoogle Scholar
  29. Rosset, S., Zhu, J., Hastie, T., & Schapire, R. (2004). Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5, 941–973. Google Scholar
  30. Schölkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge: MIT Press. Google Scholar
  31. Taylor, M., Guiver, J., Robertson, S., & Minka, T. (2008). SoftRank: optimizing non-smooth rank metrics. In Proceedings of the 1st ACM international conference on web search and data mining (pp. 77–86). CrossRefGoogle Scholar
  32. Thrun, S. (1996). Is learning the n-th thing any easier than learning the first? In D. Touretzky & M. Mozer (Eds.), Advances in neural information processing systems (NIPS) (Vol. 8, pp. 640–646). Cambridge: MIT Press. Google Scholar
  33. Wang, X., Zhang, C., & Zhang, Z. (2009). Boosted multi-task learning for face verification with applications to web image and video search. In Proceedings of IEEE computer society conference on computer vision and patter recognition. Google Scholar
  34. Weinberger, K., Dasgupta, A., Attenberg, J., Langford, J., & Smola, A. (2009). Feature hashing for large scale multitask learning. In ICML. Google Scholar
  35. Xue, Y., Liao, X., Carin, L., & Krishnapuram, B. (2007). Multi-task learning for classification with Dirichlet process priors. Journal of Machine Learning Research, 8, 2007. MathSciNetGoogle Scholar
  36. Yu, K., Tresp, V., & Schwaighofer, A. (2005). Learning Gaussian processes from multiple tasks. In Proceedings of the 22nd international conference on machine learning. Google Scholar
  37. Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (2007). A support vector method for optimizing average precision. In Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval (pp. 271–278). CrossRefGoogle Scholar
  38. Zheng, Z., Zha, H., Zhang, T., Chapelle, O., Chen, K., & Sun, G. (2008). A general boosting method and its application to learning ranking functions for web search. In J. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems (Vol. 20, pp. 1697–1704). Google Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  • Olivier Chapelle
    • 1
  • Pannagadatta Shivaswamy
    • 2
  • Srinivas Vadrevu
    • 1
  • Kilian Weinberger
    • 3
  • Ya Zhang
    • 4
  • Belle Tseng
    • 1
  1. 1.Yahoo! LabsSunnyvaleUSA
  2. 2.Department of Computer ScienceCornell UniversityIthacaUSA
  3. 3.Washington UniversitySaint LouisUSA
  4. 4.Shanghai Jiao Tong UniversityShanghaiChina

Personalised recommendations