Knowledge and Information Systems

, Volume 42, Issue 3, pp 493–523 | Cite as

Shared-memory and shared-nothing stochastic gradient descent algorithms for matrix completion

  • Faraz Makari
  • Christina Teflioudi
  • Rainer Gemulla
  • Peter Haas
  • Yannis Sismanis
Regular Paper

Abstract

We provide parallel algorithms for large-scale matrix completion on problems with millions of rows, millions of columns, and billions of revealed entries. We focus on in-memory algorithms that run either in a shared-memory environment on a powerful compute node or in a shared-nothing environment on a small cluster of commodity nodes; even very large problems can be handled effectively in these settings. Our ASGD, DSGD-MR, DSGD++, and CSGD algorithms are novel variants of the popular stochastic gradient descent (SGD) algorithm, with the latter three algorithms based on a new “stratified SGD” approach. All of the algorithms are cache-friendly and exploit thread-level parallelism, in-memory processing, and asynchronous communication. We investigate the performance of both new and existing algorithms via a theoretical complexity analysis and a set of large-scale experiments. The results show that CSGD is more scalable, and up to 60 % faster, than the best-performing alternative method in the shared-memory setting. DSGD++ is superior in terms of overall runtime, memory consumption, and scalability in the shared-nothing setting. For example, DSGD++ can solve a difficult matrix completion problem on a high-variance matrix with 10M rows, 1M columns, and 10B revealed entries in around 40 min on 16 compute nodes. In general, algorithms based on SGD appear to perform better than algorithms based on alternating minimizations, such as the PALS and DALS alternating least-squares algorithms.

Keywords

Parallel and distributed matrix completion Low-rank matrix factorization Stochastic gradient descent Recommender systems 

References

  1. 1.
    Amatriain X, Basilico J (2012) Netflix recommendations: beyond the 5 stars (part 1). http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html
  2. 2.
    Battiti R (1989) Accelerated backpropagation learning: two optimization methods. Complex Syst 3:331–342MATHGoogle Scholar
  3. 3.
    Bennett J, Lanning S (2007) The Netflix prize. In: Proceedings of the KDD cup workshop, pp 3–6Google Scholar
  4. 4.
    Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16(5):1190–1208CrossRefMATHMathSciNetGoogle Scholar
  5. 5.
    Candes EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717–772CrossRefMATHMathSciNetGoogle Scholar
  6. 6.
    Chen P-L, Tsai C-T, Chen Y-N, Chou K-C, Li C-L, Tsai C-H, Wu K-W, Chou Y-C, Li C-Y, Lin W-S, Yu S-H, Chiu R-B, Lin C-Y, Wang C-C, Wang P-W, Su W-L, Wu C-H, Kuo T-T, McKenzie T, Chang Y-H, Ferng C-S, Ni C-M, Lin H-T, Lin C-J, Lin S-D (2012) A linear ensemble of individual and blended models for music rating prediction. J Mach Learn Res Proc Track 18:21–60Google Scholar
  7. 7.
    Chu CT, Kim SK, Lin YA, Yu YY, Bradski G, Ng AY, Olukotun K (2006) Map-reduce for machine learning on multicore. In: Advances in neural information processing systems (NIPS), pp 281–288Google Scholar
  8. 8.
    Cichocki A, Phan AH (2009) Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Trans 92–A(3):708–721CrossRefGoogle Scholar
  9. 9.
    Das AS, Datar M, Garg A, Rajaram S (2007) Google news personalization: scalable online collaborative filtering. In: Proceedings of the international conference on World Wide Web (WWW), pp 271–280Google Scholar
  10. 10.
    Das S, Sismanis Y, Beyer KS, Gemulla R, Haas PJ, McPherson J (2010) Ricardo: integrating R and Hadoop. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 987–998Google Scholar
  11. 11.
    Dror G, Koenigstein N, Koren Y, Weimer M (2012) The Yahoo! Music dataset and KDD-Cup’11. J Mach Learn Res Proc Track 18:8–18Google Scholar
  12. 12.
    Gemulla R, Haas PJ, Nijkamp E, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. Technical Report RJ10481, IBM Almaden Research Center, San Jose, CA. http://researcher.watson.ibm.com/researcher/files/us-phaas/rj10482Updated.pdf
  13. 13.
    Gemulla R, Nijkamp E, Haas PJ, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the ACM international conference on knowledge discovery and data mining (SIGKDD), pp 69–77Google Scholar
  14. 14.
    Hu Y, Koren Y, Volinsky C (2008) Collaborative filtering for implicit feedback datasets. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 263–272Google Scholar
  15. 15.
    Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. IEEE Comput 42(8):30–37CrossRefGoogle Scholar
  16. 16.
    Li B, Tata S, Sismanis Y (2013) Sparkler: supporting large-scale matrix factorization. Proceedings of the international conference on extending database technology (EDBT), pp 625–636Google Scholar
  17. 17.
    Liu C, Yang H-C, Fan J, He L-W, Wang Y-M (2010) Distributed nonnegative matrix factorization for web-scale dyadic data analysis on MapReduce. Proceedings of the international conference on World Wide Web (WWW), pp 681–690Google Scholar
  18. 18.
    Mackey L, Talwalkar A, Jordan M (2011) Divide-and-conquer matrix factorization. In: Advances in neural information processing systems (NIPS), pp 1134–1142Google Scholar
  19. 19.
    McDonald R, Hall K, Mann G (2010) Distributed training strategies for the structured perceptron. In: Human language technologies, pp 456–464Google Scholar
  20. 20.
    MPI (2013) Message passing interface forum. http://www.mpi-forum.org
  21. 21.
    Niu F, Recht B, Ré C, Wright SJ (2011) Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in neural information processing systems (NIPS), pp 693–701Google Scholar
  22. 22.
    Recht B, Ré C (2013) Parallel stochastic gradient algorithms for large-scale matrix completion. Math Progr Comput 5:201–226CrossRefMATHGoogle Scholar
  23. 23.
    Smola A, Narayanamurthy S (2010) An architecture for parallel topic models. Proc. VLDB Endow 3(1–2):703–710CrossRefGoogle Scholar
  24. 24.
    Teflioudi, C., Makari, F. and Gemulla, R. (2012). Distributed matrix completion, Proc. of the IEEE Intl. Conf. on Data Mining (ICDM), pp. 655–664Google Scholar
  25. 25.
    Tsitsiklis J, Bertsekas D, Athans M (1986) Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans Autom Control 31(9):803–812CrossRefMATHMathSciNetGoogle Scholar
  26. 26.
    Yu H-F, Hsieh C-J, Si S, Dhillon IS (2012) Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 765–774Google Scholar
  27. 27.
    Zhou Y, Wilkinson D, Schreiber R, Pan R (2008) Large-scale parallel collaborative filtering for the Netflix prize. In: Proceedings of the international conference on algorithmic aspects in information and management (AAIM), vol 5034, pp 337–348Google Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  • Faraz Makari
    • 1
  • Christina Teflioudi
    • 1
  • Rainer Gemulla
    • 1
  • Peter Haas
    • 2
  • Yannis Sismanis
    • 3
  1. 1.Max Planck Institute for Computer ScienceSaarbrückenGermany
  2. 2.IBM Almaden Research CenterSan JoseUSA
  3. 3.GoogleMountain ViewUSA

Personalised recommendations