Knowledge and Information Systems

, Volume 41, Issue 3, pp 793–819 | Cite as

Parallel matrix factorization for recommender systems

  • Hsiang-Fu Yu
  • Cho-Jui Hsieh
  • Si Si
  • Inderjit S. Dhillon
Regular Paper


Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle web-scale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating least squares (ALS) and stochastic gradient descent (SGD) are two popular approaches to compute matrix factorization, and there has been a recent flurry of activity to parallelize these algorithms. However, due to the cubic time complexity in the target rank, ALS is not scalable to large-scale datasets. On the other hand, SGD conducts efficient updates but usually suffers from slow convergence that is sensitive to the parameters. Coordinate descent, a classical optimization approach, has been used for many other large-scale problems, but its application to matrix factorization for recommender systems has not been thoroughly explored. In this paper, we show that coordinate descent-based methods have a more efficient update rule compared to ALS and have faster and more stable convergence than SGD. We study different update sequences and propose the CCD++ algorithm, which updates rank-one factors one by one. In addition, CCD++ can be easily parallelized on both multi-core and distributed systems. We empirically show that CCD++ is much faster than ALS and SGD in both settings. As an example, with a synthetic dataset containing 14.6 billion ratings, on a distributed memory cluster with 64 processors, to deliver the desired test RMSE, CCD++ is 49 times faster than SGD and 20 times faster than ALS. When the number of processors is increased to 256, CCD++ takes only 16 s and is still 40 times faster than SGD and 20 times faster than ALS.


Recommender systems Missing value estimation Matrix factorization  Low rank approximation Parallelization Distributed computing 



This research was supported by NSF Grants CCF-0916309, CCF-1117055 and DOD Army Grant W911NF-10-1-0529. We also thank the Texas Advanced Computer Center (TACC) for providing computing resources required to conduct experiments in this work.


  1. 1.
    Dror G, Koenigstein N, Koren Y, Weimer M (2012) The Yahoo! music dataset and KDD-Cup’11. In: JMLR workshop and conference proceedings: proceedings of KDD Cup 2011 competition, vol. 18, pp 3–18Google Scholar
  2. 2.
    Zhou Y, Wilkinson D, Schreiber R, Pan R (2008) Large-scale parallel collaborative filtering for the Netflix prize. In: Proceedings of international conference on algorithmic aspects in, information and managementGoogle Scholar
  3. 3.
    Koren Y, Bell RM, Volinsky C (2009) Matrix factorization techniques for recommender systems. IEEE Comput 42:30–37CrossRefGoogle Scholar
  4. 4.
    Takács G, Pilászy I, Németh B, Tikk D (2009) Scalable collaborative filtering approaches for large recommender systems. JMLR 10:623–656Google Scholar
  5. 5.
    Chen P-L, Tsai C-T, Chen Y-N, Chou K-C, Li C-L, Tsai C-H, Wu K-W, Chou Y-C, Li C-Y, Lin W-S, Yu S-H, Chiu R-B, Lin C-Y, Wang C-C, Wang P-W, Su W-L, Wu C-H, Kuo T-T, McKenzie TG, Chang Y-H, Ferng C-S, Niv, Lin H-T, Lin C-J, Lin S-D (2012) A linear ensemble of individual and blended models for music. In: JMLR workshop and conference proceedings: proceedings of KDD cup 2011 competition, vol. 18, pp 21–60Google Scholar
  6. 6.
    Langford J, Smola A, Zinkevich M (2009) Slow learners are fast. In: NIPSGoogle Scholar
  7. 7.
    Gemulla R, Haas PJ, Nijkamp E, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. In: ACM KDDGoogle Scholar
  8. 8.
    Recht B, Re C, (2013) Parallel stochastic gradient algorithms for large-scale matrix completion. Math Program Comput 5(2): 201–226Google Scholar
  9. 9.
    Zinkevich M, Weimer M, Smola A, Li L (2010) Parallelized stochastic gradient descent. In: NIPSGoogle Scholar
  10. 10.
    Niu F, Recht B, Re C, Wright SJ (2011) Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. In: NIPSGoogle Scholar
  11. 11.
    Cichocki A, Phan A-H (2009) Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Trans Fundam Electron Commun Comput Sci, vol. E92-A, no. 3, pp 708–721Google Scholar
  12. 12.
    Hsieh C-J, Dhillon IS (2011) Fast coordinate descent methods with variable selection for non-negative matrix factorization. In: ACM KDDGoogle Scholar
  13. 13.
    Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: proceedings of international conference on, computational statisticsGoogle Scholar
  14. 14.
    Agarwal A, Duchi JC (2011) Distributed delayed stochastic optimization. In: NIPSGoogle Scholar
  15. 15.
    Bertsekas DP (1999) Nonlinear programming. Belmont, MA 02178–9998: Athena Scientific, second ed.Google Scholar
  16. 16.
    Hsieh C-J, Chang K-W, Lin C-J, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear SVM. In: ICMLGoogle Scholar
  17. 17.
    Yu H-F, Huang F-L, Lin C-J (2011) Dual coordinate descent methods for logistic regression and maximum entropy models. Mach Learn 85(1–2):41–75CrossRefzbMATHMathSciNetGoogle Scholar
  18. 18.
    Hsieh C-J, Sustik M, Dhillon IS, Ravikumar P (2011) Sparse inverse covariance matrix estimation using quadratic approximation. In: NIPSGoogle Scholar
  19. 19.
    Pilászy I, Zibriczky D, Tikk D (2010) Fast ALS-based matrix factorization for explicit and implicit feedback datasets. In: ACM RecSysGoogle Scholar
  20. 20.
    Bell RM, Koren Y, Volinsky C (2007) Modeling relationships at multiple scales to improve accuracy of large recommender systems. In: ACM KDDGoogle Scholar
  21. 21.
    Ho N-D, Blondel PVDVD (2011) Descent methods for nonnegative matrix factorization. In: numerical linear algebra in signals, systems and control. Springer: Netherlands, SA, pp 251–293Google Scholar
  22. 22.
    Thakur R, Gropp W (2003) Improving the performance of collective operations in MPICH. In: proceedings of European PVM/MPI users’ group meetingGoogle Scholar
  23. 23.
    Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM (2010) Graphlab: a new framework for parallel machine learning. CoRR, vol. abs/1006.4990Google Scholar
  24. 24.
    Chung F, Lu L, Vu V (2003) The spectra of random graphs with given expected degrees. Intern Math, 1(3): 257–275Google Scholar
  25. 25.
    Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM (2012) Distributed graphlab: a framework for machine learning in the cloud. PVLDB 5(8): 716–727Google Scholar
  26. 26.
    Yuan G-X, Chang K-W, Hsieh C-J, Lin C-J (2010) A comparison of optimization methods and software for large-scale l1-regularized linear classification. J Mach Learn Res 11:3183–3234zbMATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Hsiang-Fu Yu
    • 1
  • Cho-Jui Hsieh
    • 1
  • Si Si
    • 1
  • Inderjit S. Dhillon
    • 1
  1. 1.Department of Computer ScienceThe University of Texas at AustinAustinUSA

Personalised recommendations