# A distributed Frank–Wolfe framework for learning low-rank matrices with the trace norm

**Part of the following topical collections:**

## Abstract

We consider the problem of learning a high-dimensional but low-rank matrix from a large-scale dataset distributed over several machines, where low-rankness is enforced by a convex trace norm constraint. We propose DFW-Trace, a distributed Frank–Wolfe algorithm which leverages the low-rank structure of its updates to achieve efficiency in time, memory and communication usage. The step at the heart of DFW-Trace is solved approximately using a distributed version of the power method. We provide a theoretical analysis of the convergence of DFW-Trace, showing that we can ensure sublinear convergence in expectation to an optimal solution with few power iterations per epoch. We implement DFW-Trace in the Apache Spark distributed programming framework and validate the usefulness of our approach on synthetic and real data, including the ImageNet dataset with high-dimensional features extracted from a deep neural network.

## Keywords

Frank–Wolfe algorithm Low-rank learning Trace norm Distributed optimization Multi-task learning Multinomial logistic regression## Notes

### Acknowledgements

This work was partially supported by ANR Pamela (Grant ANR-16-CE23-0016-01) and by a grant from CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015–2020. The first author would like to thank Ludovic Denoyer, Hubert Naacke, Mohamed-Amine Baazizi, and the engineers of LIP6 for their help during the deployment of the cluster.

## References

- Amit, Y., Fink, M., Srebro, N., & Ullman, S. (2007). Uncovering shared structures in multiclass classification. In
*ICML*.Google Scholar - Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning.
*Machine Learning*,*73*(3), 243–272.CrossRefGoogle Scholar - Bach, F. R. (2008). Consistency of trace norm minimization.
*Journal of Machine Learning Research*,*9*, 1019–1048.MathSciNetzbMATHGoogle Scholar - Bellet, A., Liang, Y., Garakani, A. B., Balcan, M. F., & Sha, F. (2015). A distributed Frank–Wolfe algorithm for communication-efficient sparse learning. In
*SDM*.Google Scholar - Bhojanapalli, S., Neyshabur, B., & Srebro, N. (2016). Global optimality of local search for low rank matrix recovery. In
*NIPS*.Google Scholar - Bro, R., Acar, E., & Kolda, T. G. (2008). Resolving the sign ambiguity in the singular value decomposition.
*Journal of Chemometrics*,*22*(2), 135–140.CrossRefGoogle Scholar - Cabral, R. S., De la Torre, F., Costeira, J. P., & Bernardino, A. (2011). Matrix completion for multi-label image classification. In
*NIPS*.Google Scholar - Cabral, R., De La Torre, F., Costeira, J. P., & Bernardino, A. (2013). Unifying nuclear norm and bilinear factorization approaches for low-rank matrix decomposition. In
*ICCV*.Google Scholar - Cai, J. F., Candès, E. J., & Shen, Z. (2010). A singular value thresholding algorithm for matrix completion.
*SIAM Journal on Optimization*,*20*(4), 1956–1982.MathSciNetCrossRefzbMATHGoogle Scholar - Candes, E. J., Eldar, Y. C., Strohmer, T., & Voroninski, V. (2015). Phase retrieval via matrix completion.
*SIAM Review*,*57*(2), 225–251.MathSciNetCrossRefzbMATHGoogle Scholar - Candès, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization.
*Foundations of Computational mathematics*,*9*(6), 717–772.MathSciNetCrossRefzbMATHGoogle Scholar - Candès, E. J., & Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion.
*IEEE Transactions on Information Theory*,*56*(5), 2053–2080.MathSciNetCrossRefzbMATHGoogle Scholar - Caruana, R. (1997). Multitask Learning.
*Machine Learning*,*28*(1), 41–75.MathSciNetCrossRefGoogle Scholar - Clarkson, K. L. (2010). Coresets, sparse greedy approximation, and the Frank–Wolfe algorithm.
*ACM Transactions on Algorithms*,*6*(4), 63.MathSciNetCrossRefzbMATHGoogle Scholar - Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters.
*Communications of the ACM*,*51*(1), 107–113.CrossRefGoogle Scholar - Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Li, F. F. (2009). ImageNet: A large-scale hierarchical image database. In
*CVPR*.Google Scholar - Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming.
*Naval research logistics quarterly*,*3*(1–2), 95–110.MathSciNetCrossRefGoogle Scholar - Freund, R. M., & Grigas, P. (2016). New analysis and results for the Frank–Wolfe method.
*Mathematical Programming*,*155*(1–2), 199–230.MathSciNetCrossRefzbMATHGoogle Scholar - Garber, D., & Hazan, E. (2015). Faster rates for the Frank–Wolfe method over strongly-convex sets. In
*ICML*.Google Scholar - Goldberg, A., Recht, B., Xu, J., Nowak, R., & Zhu, X. (2010). Transduction with matrix completion: Three birds with one stone. In
*NIPS*.Google Scholar - Gross, D. (2011). Recovering low-rank matrices from few coefficients in any basis.
*IEEE Transactions on Information Theory*,*57*(3), 1548–1566.MathSciNetCrossRefzbMATHGoogle Scholar - Gross, D., Liu, Y. K., Flammia, S. T., Becker, S., & Eisert, J. (2010). Quantum state tomography via compressed sensing.
*Physical Review Letters*,*105*(15), 150,401.CrossRefGoogle Scholar - Harchaoui, Z., Douze, M., Paulin, M., Dudik, M., & Malick, J. (2012). Large-scale image classification with trace-norm regularization. In
*CVPR*.Google Scholar - Harchaoui, Z., Juditsky, A., & Nemirovski, A. (2015). Conditional gradient algorithms for norm-regularized smooth convex optimization.
*Mathematical Programming*,*152*(1–2), 75–112.MathSciNetCrossRefzbMATHGoogle Scholar - Hazan, E. (2008). Sparse approximate solutions to semidefinite programs. In
*Latin American symposium on theoretical informatics*.Google Scholar - Hazan, E., & Kale, S. (2012). Projection-free online learning. In
*ICML*.Google Scholar - Hazan, E., & Luo, H. (2016). Variance-reduced and projection-free stochastic optimization. In
*ICML*.Google Scholar - He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In
*CVPR*.Google Scholar - Jaggi, M. (2013). Revisiting Frank–Wolfe: Projection-free sparse convex optimization. In
*ICML*.Google Scholar - Jaggi, M., Sulovsk, M., et al. (2010). A simple algorithm for nuclear norm regularized problems. In
*ICML*.Google Scholar - Ji, H., Liu, C., Shen, Z., & Xu, Y. (2010). Robust video denoising using low rank matrix completion. In
*CVPR*.Google Scholar - Koltchinskii, V., Lounici, K., & Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion.
*The Annals of Statistics*,*39*(5), 2302–2329.MathSciNetCrossRefzbMATHGoogle Scholar - Koren, Y., Bell, R., Volinsky, C., et al. (2009). Matrix factorization techniques for recommender systems.
*Computer*,*42*(8), 30–37.CrossRefGoogle Scholar - Kuczyński, J., & Woźniakowski, H. (1992). Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start.
*SIAM Journal on Matrix Analysis and Applications*,*13*(4), 1094–1122.MathSciNetCrossRefzbMATHGoogle Scholar - Lacoste-Julien, S., & Jaggi, M. (2015). On the global linear convergence of Frank–Wolfe optimization variants. In
*NIPS*.Google Scholar - Lacoste-Julien, S., Jaggi, M., Schmidt, M., & Pletscher, P. (2013). Block-coordinate Frank–Wolfe optimization for structural SVMs. In
*ICML*.Google Scholar - Lan, G., & Zhou, Y. (2016). Conditional gradient sliding for convex optimization.
*SIAM Journal on Optimization*,*26*(2), 1379–1409.MathSciNetCrossRefzbMATHGoogle Scholar - Liu, Z., & Tsang, I. (2017). Approximate conditional gradient descent on multi-class classification. In
*AAAI*.Google Scholar - Ma, S., Goldfarb, D., & Chen, L. (2011). Fixed point and Bregman iterative methods for matrix rank minimization.
*Mathematical Programming*,*128*(1–2), 321–353.MathSciNetCrossRefzbMATHGoogle Scholar - Mackey, L.W., Jordan, M.I., & Talwalkar, A. (2011). Divide-and-conquer matrix factorization. In
*Advances in neural information processing systems*(pp. 1134–1142).Google Scholar - Moharrer, A., & Ioannidis, S. (2017). Distributing Frank–Wolfe via map-reduce. In
*ICDM*.Google Scholar - Parikh, N., & Boyd, S. (2013). Proximal algorithms.
*Foundations and Trends in Optimization*,*1*(3), 123–231.Google Scholar - Pong, T. K., Tseng, P., Ji, S., & Ye, J. (2010). Trace norm regularization: Reformulations, algorithms, and multi-task learning.
*SIAM Journal on Optimization*,*20*(6), 3465–3489.MathSciNetCrossRefzbMATHGoogle Scholar - Recht, B. (2011). A simpler approach to matrix completion.
*Journal of Machine Learning Research*,*12*, 3413–3430.MathSciNetzbMATHGoogle Scholar - Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge.
*International Journal of Computer Vision*,*115*(3), 211–252.MathSciNetCrossRefGoogle Scholar - Sturm, J. F. (1999). Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones.
*Optimization Methods and Software*,*11*(1–4), 625–653.MathSciNetCrossRefzbMATHGoogle Scholar - Toh, K. C., Todd, M. J., & Tütüncü, R. H. (1999). SDPT3a MATLAB software package for semidefinite programming, version 1.3.
*Optimization Methods and Software*,*11*(1–4), 545–581.MathSciNetCrossRefzbMATHGoogle Scholar - Tran, N. L., Peel, T., & Skhiri, S. (2015). Distributed Frank–Wolfe under pipelined stale synchronous parallelism. In
*IEEE Big Data*.Google Scholar - Wai, H. T., Lafond, J., Scaglione, A., & Moulines, E. (2017). Decentralized Frank–Wolfe algorithm for convex and non-convex problems.
*IEEE Transactions on Automatic Control*,*62*, 5522–5537.MathSciNetCrossRefzbMATHGoogle Scholar - Wang, Y. X., Sadhanala, V., Dai, W., Neiswanger, W., Sra, S., & Xing, E. (2016). Parallel and distributed block-coordinate Frank–Wolfe algorithms. In
*ICML*.Google Scholar - Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster computing with working sets. In
*HotCloud*.Google Scholar