Computational Optimization and Applications

, Volume 72, Issue 3, pp 675–705 | Cite as

A framework for parallel second order incremental optimization algorithms for solving partially separable problems

  • Kamer Kaya
  • Figen Öztoprak
  • Ş. İlker BirbilEmail author
  • A. Taylan Cemgil
  • Umut Şimşekli
  • Nurdan Kuru
  • Hazal Koptagel
  • M. Kaan Öztürk


We propose Hessian Approximated Multiple Subsets Iteration (HAMSI), which is a generic second order incremental algorithm for solving large-scale partially separable convex and nonconvex optimization problems. The algorithm is based on a local quadratic approximation, and hence, allows incorporating curvature information to speed-up the convergence. HAMSI is inherently parallel and it scales nicely with the number of processors. We prove the convergence properties of our algorithm when the subset selection step is deterministic. Combined with techniques for effectively utilizing modern parallel computer architectures, we illustrate that a particular implementation of the proposed method based on L-BFGS updates converges more rapidly than a parallel gradient descent when both methods are used to solve large-scale matrix factorization problems. This performance gain comes only at the expense of using memory that scales linearly with the total size of the optimization variables. We conclude that HAMSI may be considered as a viable alternative in many large scale problems, where first order methods based on variants of gradient descent are applicable.


Large-scale unconstrained optimization Second order information Shared-memory parallel implementation Balanced coloring Balanced stratification Matrix factorization 



This work is supported by the Scientific and Technological Research Council of Turkey (TUBITAK) Grant No. 113M492.


  1. 1.
    Berahas, A.S., Nocedal, J., Takáč, M.: A multi-batch L-BFGS method for machine learning. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain, pp. 1055–1063 (2016)Google Scholar
  2. 2.
    Bertsekas, D.P.: Incremental least squares methods and the extended Kalman filter. SIAM J. Optim. 6(3), 807–822 (1996)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim. Mach. Learn. 1–38, 2011 (2010)Google Scholar
  4. 4.
    Blatt, D., Hero, A.O., Gauchman, H.: A convergent incremental gradient method with a constant step size. SIAM J. Optim. 18(1), 29–51 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Bozdağ, D., Çatalyürek, Ü.V., Gebremedhin, A.H., Manne, F., Boman, E.G., Özgüner, F.: Distributed-memory parallel algorithms for distance-2 coloring and related problems in derivative computation. SIAM J. Sci. Comput. 32(4), 2418–2446 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Byrd, R.H., Nocedal, J., Schnabel, R.B.: Representations of quasi-Newton matrices and their use in limited memory methods. Math. Program. 63(1–3), 129–156 (1994)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor Factorization. Wiley, New York (2009)CrossRefzbMATHGoogle Scholar
  9. 9.
    Daneshmand, A., Facchinei, F., Kungurtsev, V., Scutari, G.: Hybrid random/deterministic parallel algorithms for convex and nonconvex big data optimization. IEEE Trans. Signal Process. 63(15), 3914–3929 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Facchinei, F., Scutari, G., Sagratella, S.: Parallel selective algorithms for nonconvex big data optimization. IEEE Trans. Signal Process. 63(7), 1874–1889 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Gebremedhin, A.H., Manne, F., Pothen, A.: Parallel distance-k coloring algorithms for numerical optimization. In: Euro-Par 2002 Parallel Processing—8th International Conference, pp. 912–921 (2002)Google Scholar
  12. 12.
    Gebremedhin, A.H., Manne, F., Pothen, A.: What color is your Jacobian? Graph coloring for computing derivatives. SIAM Rev. 47(4), 629–705 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Gebremedhin, A.H., Nguyen, D., Patwary, MdMA, Pothen, A.: ColPack: software for graph coloring and related problems in scientific computing. ACM Trans. Math. Softw. 40(1), 1:1–1:31 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization with distributed stochastic gradient descent. In: ACM SIGKDD (2011)Google Scholar
  15. 15.
    Gower, R.M., Goldfarb, D., Richtárik, P.: Stochastic block BFGS: squeezing more curvature out of data. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of the 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, New York, USA, 20–22 June 2016. PMLR, pp. 1869–1878Google Scholar
  16. 16.
    Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.: A globally convergent incremental Newton method. Math. Program. 151(1), 283–313 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5(4), 19:1–19:19 (2015)CrossRefGoogle Scholar
  18. 18.
    Kschischang, F.R., Frey, B.J., Loeliger, H.-A.: Factor graphs and the sum–product algorithm. IEEE Trans. Inf. Theory 47(2), 498–519 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Lian, X., Huang, Y., Li, Y., Liu, J.: Asynchronous parallel stochastic gradient for nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 2737–2745 (2015)Google Scholar
  20. 20.
    Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(1), 285–322 (2015)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Mangasarian, O.L., Solodov, M.V.: Serial and parallel backpropation convergence via nonmonotone perturbed minimization. Optim. Methods Softw. 4, 103–116 (1994)CrossRefGoogle Scholar
  22. 22.
    Mareček, J., Richtárik, P., Takáč, M.: Distributed block coordinate descent for minimizing partially separable functions. In: Al-Baali, M., Grandinetti, L., Purnama, A. (eds.) Numerical Analysis and Optimization, pp. 261–288. Springer, Berlin (2015)CrossRefGoogle Scholar
  23. 23.
    Matula, D.W.: A min-max theorem for graphs with application to graph coloring. SIAM Rev. 10, 481–482 (1968)Google Scholar
  24. 24.
    Mokhtari, A., Eisen, M., Ribeiro, A.: IQN: an incremental quasi-Newton method with local superlinear convergence rate. arXiv preprint arXiv:1702.00709 (2017)
  25. 25.
    Moritz, P., Nishihara, R., Jordan, M.I.: A linearly-convergent stochastic L-BFGS algorithm. In: Artificial Intelligence and Statistics, pp. 249–258 (2016)Google Scholar
  26. 26.
    Mota, J.F.C., Xavier, J.M.F., Aguiar, P.M.Q., Püschel, M.: D-ADMM: a communication-efficient distributed algorithm for separable optimization. IEEE Trans. Signal Process. 61(10), 2718–2723 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Pan, X., Lam, M., Tu, S., Papailiopoulos, D., Zhang, S., Jordan, M.I., Ramchandran, K., Ré, C.: Cyclades: conflict-free asynchronous machine learning. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 2568–2576. Curran Associates Inc, Red Hook (2016)Google Scholar
  28. 28.
    Recht, B., Re, C., Wright, S., Feng, N.: HOGWILD: a lock-free approach to parallelizing stochastic gradient descent. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 693–701. Curran Associates Inc., Red Hook (2011)Google Scholar
  29. 29.
    Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Program. 156(1–2), 433–484 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Roux, N.L., Schmidt, M., Bach, F.R.: A stochastic gradient method with an exponential convergence\_rate for finite training sets. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, pp. 2663–2671. Curran Associates Inc., Red Hook (2012)Google Scholar
  31. 31.
    Scherrer, C., Halappanavar, M., Tewari, A., Haglin, D.: Scaling up coordinate descent algorithms for large \(\ell _1\) regularization problems. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26–July 1 (2012)Google Scholar
  32. 32.
    Schraudolph, N.N., Yu, J., Gunter, S.: A stochastic quasi-Newton method for online convex optimization. In: Proceedings of the 11th International Conference Artificial Intelligence and Statistics (AISTATS), pp. 433–440 (2007)Google Scholar
  33. 33.
    Shamir, O., Srebro, N., Zhang, T.: Communication efficient distributed optimization using an approximate Newton-type method. In: International Conference on Machine Learning (ICML) (2014)Google Scholar
  34. 34.
    Singh, A.P., Gordon, G.J.: A unified view of matrix factorization models. In: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Part II, number 5212, pp. 358–373. Springer, Berlin (2008)Google Scholar
  35. 35.
    Sohl-Dickstein, J., Poole, B., Ganguli, S.: Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods. In: Proceedings of the 31th International Conference on Machine Learning (ICML), pp. 604–612 (2014)Google Scholar
  36. 36.
    Solodov, M.V.: Incremental gradient algorithms with stepsizes bounded away from zero. Comput. Optim. Appl. 11(1), 23–35 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  37. 37.
    Tseng, P.: An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM J. Optim. 8(2), 506–531 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  38. 38.
    Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  39. 39.
    Yousefian, F., Nedić, A., Shanbhag, U.V.: Stochastic quasi-Newton methods for non-strongly convex problems: convergence and rate analysis. In: 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 4496–4503. IEEE (2016)Google Scholar
  40. 40.
    Zuckerman, D.: Linear degree extractors and the inapproximability of max clique and chromatic number. Theory Comput. 3, 103–128 (2007)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Faculty of Engineering and Natural SciencesSabanci UniversityIstanbulTurkey
  2. 2.Department of Industrial EngineeringIstanbul Bilgi UniversityIstanbulTurkey
  3. 3.Econometric InstituteErasmus University RotterdamRotterdamThe Netherlands
  4. 4.Department of Computer EngineeringBoğaziçi UniversityIstanbulTurkey
  5. 5.LTCI, Télécom ParisTechUniversité Paris-SaclayParisFrance

Personalised recommendations