Advertisement

Machine Learning

, Volume 107, Issue 8–10, pp 1457–1475 | Cite as

A distributed Frank–Wolfe framework for learning low-rank matrices with the trace norm

  • Wenjie Zheng
  • Aurélien Bellet
  • Patrick Gallinari
Article
  • 191 Downloads
Part of the following topical collections:
  1. Special Issue of the ECML PKDD 2018 Journal Track

Abstract

We consider the problem of learning a high-dimensional but low-rank matrix from a large-scale dataset distributed over several machines, where low-rankness is enforced by a convex trace norm constraint. We propose DFW-Trace, a distributed Frank–Wolfe algorithm which leverages the low-rank structure of its updates to achieve efficiency in time, memory and communication usage. The step at the heart of DFW-Trace is solved approximately using a distributed version of the power method. We provide a theoretical analysis of the convergence of DFW-Trace, showing that we can ensure sublinear convergence in expectation to an optimal solution with few power iterations per epoch. We implement DFW-Trace in the Apache Spark distributed programming framework and validate the usefulness of our approach on synthetic and real data, including the ImageNet dataset with high-dimensional features extracted from a deep neural network.

Keywords

Frank–Wolfe algorithm Low-rank learning Trace norm Distributed optimization Multi-task learning Multinomial logistic regression 

Notes

Acknowledgements

This work was partially supported by ANR Pamela (Grant ANR-16-CE23-0016-01) and by a grant from CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015–2020. The first author would like to thank Ludovic Denoyer, Hubert Naacke, Mohamed-Amine Baazizi, and the engineers of LIP6 for their help during the deployment of the cluster.

References

  1. Amit, Y., Fink, M., Srebro, N., & Ullman, S. (2007). Uncovering shared structures in multiclass classification. In ICML.Google Scholar
  2. Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. Machine Learning, 73(3), 243–272.CrossRefGoogle Scholar
  3. Bach, F. R. (2008). Consistency of trace norm minimization. Journal of Machine Learning Research, 9, 1019–1048.MathSciNetzbMATHGoogle Scholar
  4. Bellet, A., Liang, Y., Garakani, A. B., Balcan, M. F., & Sha, F. (2015). A distributed Frank–Wolfe algorithm for communication-efficient sparse learning. In SDM.Google Scholar
  5. Bhojanapalli, S., Neyshabur, B., & Srebro, N. (2016). Global optimality of local search for low rank matrix recovery. In NIPS.Google Scholar
  6. Bro, R., Acar, E., & Kolda, T. G. (2008). Resolving the sign ambiguity in the singular value decomposition. Journal of Chemometrics, 22(2), 135–140.CrossRefGoogle Scholar
  7. Cabral, R. S., De la Torre, F., Costeira, J. P., & Bernardino, A. (2011). Matrix completion for multi-label image classification. In NIPS.Google Scholar
  8. Cabral, R., De La Torre, F., Costeira, J. P., & Bernardino, A. (2013). Unifying nuclear norm and bilinear factorization approaches for low-rank matrix decomposition. In ICCV.Google Scholar
  9. Cai, J. F., Candès, E. J., & Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4), 1956–1982.MathSciNetCrossRefzbMATHGoogle Scholar
  10. Candes, E. J., Eldar, Y. C., Strohmer, T., & Voroninski, V. (2015). Phase retrieval via matrix completion. SIAM Review, 57(2), 225–251.MathSciNetCrossRefzbMATHGoogle Scholar
  11. Candès, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6), 717–772.MathSciNetCrossRefzbMATHGoogle Scholar
  12. Candès, E. J., & Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5), 2053–2080.MathSciNetCrossRefzbMATHGoogle Scholar
  13. Caruana, R. (1997). Multitask Learning. Machine Learning, 28(1), 41–75.MathSciNetCrossRefGoogle Scholar
  14. Clarkson, K. L. (2010). Coresets, sparse greedy approximation, and the Frank–Wolfe algorithm. ACM Transactions on Algorithms, 6(4), 63.MathSciNetCrossRefzbMATHGoogle Scholar
  15. Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.CrossRefGoogle Scholar
  16. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Li, F. F. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.Google Scholar
  17. Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1–2), 95–110.MathSciNetCrossRefGoogle Scholar
  18. Freund, R. M., & Grigas, P. (2016). New analysis and results for the Frank–Wolfe method. Mathematical Programming, 155(1–2), 199–230.MathSciNetCrossRefzbMATHGoogle Scholar
  19. Garber, D., & Hazan, E. (2015). Faster rates for the Frank–Wolfe method over strongly-convex sets. In ICML.Google Scholar
  20. Goldberg, A., Recht, B., Xu, J., Nowak, R., & Zhu, X. (2010). Transduction with matrix completion: Three birds with one stone. In NIPS.Google Scholar
  21. Gross, D. (2011). Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3), 1548–1566.MathSciNetCrossRefzbMATHGoogle Scholar
  22. Gross, D., Liu, Y. K., Flammia, S. T., Becker, S., & Eisert, J. (2010). Quantum state tomography via compressed sensing. Physical Review Letters, 105(15), 150,401.CrossRefGoogle Scholar
  23. Harchaoui, Z., Douze, M., Paulin, M., Dudik, M., & Malick, J. (2012). Large-scale image classification with trace-norm regularization. In CVPR.Google Scholar
  24. Harchaoui, Z., Juditsky, A., & Nemirovski, A. (2015). Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, 152(1–2), 75–112.MathSciNetCrossRefzbMATHGoogle Scholar
  25. Hazan, E. (2008). Sparse approximate solutions to semidefinite programs. In Latin American symposium on theoretical informatics.Google Scholar
  26. Hazan, E., & Kale, S. (2012). Projection-free online learning. In ICML.Google Scholar
  27. Hazan, E., & Luo, H. (2016). Variance-reduced and projection-free stochastic optimization. In ICML.Google Scholar
  28. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.Google Scholar
  29. Jaggi, M. (2013). Revisiting Frank–Wolfe: Projection-free sparse convex optimization. In ICML.Google Scholar
  30. Jaggi, M., Sulovsk, M., et al. (2010). A simple algorithm for nuclear norm regularized problems. In ICML.Google Scholar
  31. Ji, H., Liu, C., Shen, Z., & Xu, Y. (2010). Robust video denoising using low rank matrix completion. In CVPR.Google Scholar
  32. Koltchinskii, V., Lounici, K., & Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5), 2302–2329.MathSciNetCrossRefzbMATHGoogle Scholar
  33. Koren, Y., Bell, R., Volinsky, C., et al. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30–37.CrossRefGoogle Scholar
  34. Kuczyński, J., & Woźniakowski, H. (1992). Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start. SIAM Journal on Matrix Analysis and Applications, 13(4), 1094–1122.MathSciNetCrossRefzbMATHGoogle Scholar
  35. Lacoste-Julien, S., & Jaggi, M. (2015). On the global linear convergence of Frank–Wolfe optimization variants. In NIPS.Google Scholar
  36. Lacoste-Julien, S., Jaggi, M., Schmidt, M., & Pletscher, P. (2013). Block-coordinate Frank–Wolfe optimization for structural SVMs. In ICML.Google Scholar
  37. Lan, G., & Zhou, Y. (2016). Conditional gradient sliding for convex optimization. SIAM Journal on Optimization, 26(2), 1379–1409.MathSciNetCrossRefzbMATHGoogle Scholar
  38. Liu, Z., & Tsang, I. (2017). Approximate conditional gradient descent on multi-class classification. In AAAI.Google Scholar
  39. Ma, S., Goldfarb, D., & Chen, L. (2011). Fixed point and Bregman iterative methods for matrix rank minimization. Mathematical Programming, 128(1–2), 321–353.MathSciNetCrossRefzbMATHGoogle Scholar
  40. Mackey, L.W., Jordan, M.I., & Talwalkar, A. (2011). Divide-and-conquer matrix factorization. In Advances in neural information processing systems (pp. 1134–1142).Google Scholar
  41. Moharrer, A., & Ioannidis, S. (2017). Distributing Frank–Wolfe via map-reduce. In ICDM.Google Scholar
  42. Parikh, N., & Boyd, S. (2013). Proximal algorithms. Foundations and Trends in Optimization, 1(3), 123–231.Google Scholar
  43. Pong, T. K., Tseng, P., Ji, S., & Ye, J. (2010). Trace norm regularization: Reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization, 20(6), 3465–3489.MathSciNetCrossRefzbMATHGoogle Scholar
  44. Recht, B. (2011). A simpler approach to matrix completion. Journal of Machine Learning Research, 12, 3413–3430.MathSciNetzbMATHGoogle Scholar
  45. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.MathSciNetCrossRefGoogle Scholar
  46. Sturm, J. F. (1999). Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, 11(1–4), 625–653.MathSciNetCrossRefzbMATHGoogle Scholar
  47. Toh, K. C., Todd, M. J., & Tütüncü, R. H. (1999). SDPT3a MATLAB software package for semidefinite programming, version 1.3. Optimization Methods and Software, 11(1–4), 545–581.MathSciNetCrossRefzbMATHGoogle Scholar
  48. Tran, N. L., Peel, T., & Skhiri, S. (2015). Distributed Frank–Wolfe under pipelined stale synchronous parallelism. In IEEE Big Data.Google Scholar
  49. Wai, H. T., Lafond, J., Scaglione, A., & Moulines, E. (2017). Decentralized Frank–Wolfe algorithm for convex and non-convex problems. IEEE Transactions on Automatic Control, 62, 5522–5537.MathSciNetCrossRefzbMATHGoogle Scholar
  50. Wang, Y. X., Sadhanala, V., Dai, W., Neiswanger, W., Sra, S., & Xing, E. (2016). Parallel and distributed block-coordinate Frank–Wolfe algorithms. In ICML.Google Scholar
  51. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster computing with working sets. In HotCloud.Google Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.UMR 7606, LIP6Sorbonne Universités, UPMC Univ Paris 06ParisFrance
  2. 2.INRIAParisFrance

Personalised recommendations