Knowledge and Information Systems

, Volume 53, Issue 2, pp 423–438 | Cite as

A new accelerated proximal technique for regression with high-dimensional datasets

Regular Paper

Abstract

We consider the problem of minimization of the sum of two convex functions, one of which is a smooth function, while another one may be a nonsmooth function. Many high-dimensional learning problems (classification/regression) can be designed using such frameworks, which can be efficiently solved with the help of first-order proximal-based methods. Due to slow convergence of traditional proximal methods, a recent trend is to introduce acceleration to such methods, which increases the speed of convergence. Such proximal gradient methods belong to a wider class of the forward–backward algorithms, which mathematically can be interpreted as fixed-point iterative schemes. In this paper, we design few new proximal gradient methods corresponding to few state-of-the-art fixed-point iterative schemes and compare their performances on the regression problem. In addition, we propose a new accelerated proximal gradient algorithm, which outperforms earlier traditional methods in terms of convergence speed and regression error. To demonstrate the applicability of our method, we conducted experiments for the problem of regression with several publicly available high-dimensional real datasets taken from different application domains. Empirical results exhibit that the proposed method outperforms the previous methods in terms of convergence, accuracy, and objective function values.

Keywords

Nonsmooth convex minimization Proximal methods Regression 

Notes

Acknowledgements

The first author would like to acknowledge the financial support by Indian Institute of Technology (Banaras Hindu University) in terms of teaching assistantship.

References

  1. 1.
    Agarwal RP, O’Regan D, Sahu DR (2007) Iterative construction of fixed points of nearly asymptotically nonexpansive mappings. J Nonlinear Convex Anal 8(1):61–79MathSciNetMATHGoogle Scholar
  2. 2.
    Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853, http://dl.acm.org/citation.cfm?id=1046920.1194905
  3. 3.
    Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272. doi: 10.1007/s10994-007-5040-8 CrossRefGoogle Scholar
  4. 4.
    Bach F, Jenatton R, Mairal J, Obozinski G (2012) Structured sparsity through convex optimization. Stat Sci 27(4):450–468. doi: 10.1214/12-STS394 MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Bach FR (2008) Consistency of the group lasso and multiple kernel learning. J Mach Learn Res 9:1179–1225, http://dl.acm.org/citation.cfm?id=1390681.1390721
  6. 6.
    Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202. doi: 10.1137/080716542 MathSciNetCrossRefMATHGoogle Scholar
  7. 7.
    Bertsekas D (1999) Nonlinear programming, 2nd edn. Athena Scientific. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/1886529000
  8. 8.
    Bickel S, Bogojeska J, Lengauer T, Scheffer T (2008) Multi-task learning for HIV therapy screening. In: Proceedings of the 25th international conference on machine learning. ACM, New York, ICML ’08, pp 56–63. doi: 10.1145/1390156.1390164
  9. 9.
    Chambolle A, Dossal C (2015) On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. J Optim Theory Appl 166(3):968–982. doi: 10.1007/s10957-015-0746-4 MathSciNetCrossRefMATHGoogle Scholar
  10. 10.
    Chang S, Qi GJ, Yang Y, Aggarwal CC, Zhou J, Wang M, Huang TS (2016) Large-scale supervised similarity learning in networks. Knowl Inform Syst 48(3):707–740. doi: 10.1007/s10115-015-0894-8 CrossRefGoogle Scholar
  11. 11.
    Chapelle O, Shivaswamy P, Vadrevu S, Weinberger K, Zhang Y, Tseng B (2010) Multi-task learning for boosting with application to web search ranking. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, KDD ’10, pp 1189–1198. doi: 10.1145/1835804.1835953
  12. 12.
    Chen X, Lin Q, Kim S, Carbonell JG, Xing EP (2012) Smoothing proximal gradient method for general structured sparse regression. Ann Appl Stat 6(2):719–752. doi: 10.1214/11-AOAS514 MathSciNetCrossRefMATHGoogle Scholar
  13. 13.
    Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning. ACM, New York, ICML ’08, pp 160–167. doi: 10.1145/1390156.1390177
  14. 14.
    Daumé H III, Kumar A, Saha A (2010) Frustratingly easy semi-supervised domain adaptation. In: Proceedings of the 2010 workshop on domain adaptation for natural language processing, association for computational linguistics, Stroudsburg, PA, DANLP 2010, pp 53–59. http://dl.acm.org/citation.cfm?id=1870526.1870534
  15. 15.
    Duchi J, Singer Y (2009) Efficient online and batch learning using forward backward splitting. J Mach Learn Res 10:2899–2934. http://dl.acm.org/citation.cfm?id=1577069.1755882
  16. 16.
    Jenatton R, Mairal J, Obozinski G, Bach F (2011) Proximal methods for hierarchical sparse coding. J Mach Learn Res 12:2297–2334. http://dl.acm.org/citation.cfm?id=1953048.2021074
  17. 17.
    Juditsky A, Nemirovski A (2012) First-order methods for nonsmooth convex large-scale optimization, I: general purpose methods. MIT Press, pp 121–148. http://www.jstor.org/stable/j.ctt5hhgpg.9
  18. 18.
    Li B, Yang Q, Xue X (2009) Transfer learning for collaborative filtering via a rating-matrix generative model. In: Proceedings of the 26th annual international conference on machine learning. ACM, New York, ICML ’09, pp 617–624. doi: 10.1145/1553374.1553454
  19. 19.
    Liu H, Palatucci M, Zhang J (2009) Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. In: Proceedings of the 26th annual international conference on machine learning. ACM, New York, ICML ’09, pp 649–656. doi: 10.1145/1553374.1553458
  20. 20.
    Luo D, Ding C, Huang H (2013) Toward structural sparsity: an explicit \(l_2-l_0\) approach. Knowl Inform Syst 36(2):411–438. doi: 10.1007/s10115-012-0545-2 CrossRefGoogle Scholar
  21. 21.
    Mainge PE (2008) Convergence theorems for inertial km-type algorithms. J Comput Appl Math 219(1):223–236. doi: 10.1016/j.cam.2007.07.021 MathSciNetCrossRefMATHGoogle Scholar
  22. 22.
    Mann WR (1953) Mean value methods in iteration. Proc Am Math Soc 4:506–510. doi: 10.1090/S0002-9939-1953-0054846-3 MathSciNetCrossRefMATHGoogle Scholar
  23. 23.
    Nesterov Y (2007) Gradient methods for minimizing composite objective function. CORE Discussion Papers 2007076, Universit catholique de Louvain, Center for Operations Research and Econometrics (CORE). http://EconPapers.repec.org/RePEc:cor:louvco:2007076
  24. 24.
    Parikh N, Boyd S (2014) Proximal algorithms. Found Trends Optim 1(3):127–239. doi: 10.1561/2400000003 CrossRefGoogle Scholar
  25. 25.
    Quattoni A, Carreras X, Collins M, Darrell T (2009) An efficient projection for l1,∞ regularization. In: Proceedings of the 26th annual international conference on machine learning. ACM, New York, ICML ’09, pp 857–864. doi: 10.1145/1553374.1553484
  26. 26.
    Sahu DR (2011) Applications of the s-iteration process to constrained minimization problems and split feasibility problems. Fixed Point Theory 12(1):187–204MathSciNetMATHGoogle Scholar
  27. 27.
    Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288. http://www.jstor.org/stable/2346178
  28. 28.
    Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B (Stat Methodol) 67(1):91–108. doi: 10.1111/j.1467-9868.2005.00490.x MathSciNetCrossRefMATHGoogle Scholar
  29. 29.
    Turlach BA, Venables WN, Wright SJ (2005) Simultaneous variable selection. Technometrics 47(3):349–363. doi: 10.1198/004017005000000139 MathSciNetCrossRefGoogle Scholar
  30. 30.
    Wang J, Liu J, Ye J (2013) Efficient mixed-norm regularization: algorithms and safe screening methods. arXiv:1307.4156 abs/CoRR
  31. 31.
    Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83CrossRefGoogle Scholar
  32. 32.
    Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68(1):49–67. doi: 10.1111/j.1467-9868.2005.00532.x MathSciNetCrossRefMATHGoogle Scholar
  33. 33.
    Zhang S, Qian H, Gong X (2016) An alternating proximal splitting method with global convergence for nonconvex structured sparsity optimization. In: Proceedings of the thirtieth AAAI conference on artificial intelligence. AAAI Press, AAAI’16, pp 2330–2336, http://dl.acm.org/citation.cfm?id=3016100.3016224
  34. 34.
    Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37(6A):3468–3497. doi: 10.1214/07-AOS584 MathSciNetCrossRefMATHGoogle Scholar
  35. 35.
    Zhou J, Yuan L, Liu J, Ye J (2011) A multi-task learning formulation for predicting disease progression. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, KDD ’11, pp 814–822, doi: 10.1145/2020408.2020549

Copyright information

© Springer-Verlag London 2017

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringIIT (BHU)VaranasiIndia

Personalised recommendations