Abstract
In this paper, we focus on the acceleration of doubly stochastic gradient descent method for computing the CANDECOMP/PARAFAC (CP) decomposition of tensors. This optimization problem has N blocks, where N is the order of the tensor. Under the doubly stochastic framework, each block subproblem is solved by the vanilla stochastic gradient method. However, the convergence analysis requires that the variance converges to zero, which is hard to check in practice and may not hold in some implementations. In this paper, we propose accelerating the stochastic gradient method by the momentum acceleration and the variance reduction technique, denoted as DS-MVR. Theoretically, the convergence of DS-MVR only requires the variance to be bounded. Under mild conditions, we show DS-MVR converges to a stochastic \(\varepsilon \)-stationary solution in \(\tilde{\mathcal {O}}(N^{3/2}\varepsilon ^{-3})\) iterations with varying stepsizes and in \(\mathcal {O}(N^{3/2}\varepsilon ^{-3})\) iterations with constant stepsizes, respectively. Numerical experiments on four real-world datasets show that our proposed algorithm can get better results compared with the baselines.
Similar content being viewed by others
Notes
References
Acar, E., Dunlavy, D.M., Kolda, T.G.: A scalable optimization approach for fitting Canonical tensor decompositions. J. Chemom. 25(2), 67–86 (2011)
Arjevani, Y., Carmon, Y., Duchi, J.C., Foster, D.J., Srebro, N., Woodworth, B.: Lower bounds for non-convex stochastic optimization. Mathematical Programming (2022)
Battaglino, C., Ballard, G., Kolda, T.G.: A practical randomized CP tensor decomposition. SIAM J. Matrix Anal. Appl. 39(2), 876–901 (2018)
Beutel, A., Talukdar, P.P., Kumar, A., Faloutsos, C., Papalexakis, E.E., Xing, E.P.: FlexiFaCT: Scalable flexible factorization of coupled tensors on Hadoop. In: Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 109–117 (2014)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: 19th International Conference on Computational Statistics, pp. 177–186 (2010)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Cutkosky, A., Orabona, F.: Momentum-based variance reduction in non-convex SGD. Adv. Neural Inf. Process. Syst. 32 (2019)
Douglas, C.J., Chang, J.J.: Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young’’ decomposition. Psychometrika 35(3), 283–319 (1970)
Duchi, J.C., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(61), 2121–2159 (2011)
Fu, X., Ibrahim, S., Wai, H., Gao, C., Huang, K.: Block-randomized stochastic proximal gradient for low-rank tensor factorization. IEEE Trans. Signal Process. 68, 2170–2185 (2020)
Geoffrey, H., Nitish, S., Kevin, S.: Neural networks for machine learning: Lecture 6E rmsprop: Divide the gradient by a running average of its recent magnitude. University of Toronto Lecture (2006)
Herbert, R., Sutton, M.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Hitchcock, F.L.: The expression of a tensor or a Polyadic as a sum of products. J. Math. Phys. 6, 164–189 (1927)
Hong, D., Kolda, T.G., Duersch, J.A.: Generalized Canonical Polyadic tensor decomposition. SIAM Rev. 62(1), 133–163 (2020)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations (2015)
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Kolda, T.G., Hong, D.: Stochastic gradients for large-scale tensor decomposition. SIAM J. Math. Data Sci. 2(4), 1066–1095 (2020)
Lan, G.: First-Order and Stochastic Optimization Methods for Machine Learning. Springer, Switzerland AG (2020)
Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., Lempitsky, V.: Speeding-up convolutional neural networks using fine-tuned CP-decomposition. In: In 3rd International Conference on Learning Representations (2015)
Li, Z., Li, J.: A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. Adv. Neural Inf. Process. Syst. 31, 5569–5579 (2018)
Li, Z., Li, J.: Simple and optimal stochastic gradient methods for nonsmooth nonconvex optimization. J. Mach. Learn. Res. 23, 239:1-239:61 (2022)
Lim, L., Comon, P.: Nonnegative approximations of nonnegative tensors. J. Chemom. 23, 432–441 (2009)
Lin, Z., Li, H., Fang, C.: Accelerated Optimization for Machine Learning. Springer, Singapore (2020)
Liu, J., Musialski, P., Wonka, P., Ye, J.: Tensor completion for estimating missing values in visual data. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 208–220 (2012)
Liu, Y., Liu, J., Long, Z., Zhu, C.: Tensor Computation for Data Analysis. Springer, Switzerland AG (2022)
Maehara, T., Hayashi, K., Kawarabayashi, K.: Expected tensor decomposition with stochastic gradient descent. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 1919–1925 (2016)
Nesterov, Y.E.: A method for unconstrained convex minimization problem with the rate of convergence \({O}(1/k^{2})\). Sov. Math. Doklady 27(2), 372–376 (1983)
Nguyen, L.M., Liu, J., Scheinberg, K., Takác, M.: SARAH: A novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning, pp. 2613–2621 (2017)
Paatero, P.: Construction and analysis of degenerate PARAFAC models. J. Chemom. 14(3), 285–299 (2000)
Passty, G.B.: Ergodic convergence to a zero of the suae of monotone operators in Hilbert space. J. Math. Anal. Appl. 72(2), 383–390 (1979)
Pham, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: ProxSARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21, 110:1-110:48 (2020)
Phan, A.H., Tichavský, P., Cichocki, A.: Fast alternating ls algorithms for high order candecomp/parafac tensor factorizations. IEEE Trans. Signal Process. 61(19), 4834–4846 (2013)
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Qi, L., Luo, Z.: Tensor Analysis: Spectral Theory and Special Tensors. Society for Industrial and Applied Mathematics, Philadelphia (2017)
Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145–151 (1999)
Reddi, S.J., Sra, S., Póczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Adv. Neural Inf. Process. Syst. 29, 1145–1153 (2016)
Reynolds, M.J., Doostan, A., Beylkin, G.: Randomized alternating least squares for Canonical tensor decompositions: application to a pde with random data. SIAM J. Sci. Comput. 38(5), 2634–2664 (2016)
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer, Berlin, Heidelberg (1998)
Sidiropoulos, N.D., Bro, R.: On the uniqueness of multilinear decomposition of N-way arrays. J. Chemom. 14(3), 229–239 (2000)
Sidiropoulos, N.D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E.E., Faloutsos, C.: Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process. 65(13), 3551–3582 (2017)
Silva, V.D., Lim, L.: Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl. 30(3), 1084–1127 (2008)
Sorber, L., Van Barel, M., De Lathauwer, L.: Optimization-based algorithms for tensor decompositions: canonical polyadic decomposition, decomposition in rank-\((l_r, l_r,1)\) terms, and a new generalization. SIAM J. Optim. 23(2), 695–720 (2013)
Veganzones, M.A., Cohen, J.E., Cabral, F.R., Chanussot, J., Comon, P.: Nonnegative tensor CP decomposition of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 54(5), 2577–2588 (2016)
Vervliet, N., De Lathauwer, L.: A randomized block sampling approach to Canonical Polyadic decomposition of large-scale tensors. IEEE J. Select. Top. Signal Process. 10(2), 284–295 (2016)
Wang, Q., Cui, C., Han, D.: A momentum block-randomized stochastic algorithm for low-rank tensor CP decomposition. Pac. J. Optim. 17(3), 433–452 (2021)
Xu, Y., Xu, Y.: Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization. J. Optim. Theory Appl. 196, 266–297 (2023)
Zhang, Z., Batselier, K., Liu, H., Danie, L., Wong, N.: Tensor computation: a new framework for high-dimensional problems in EDA. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 36(4), 521–536 (2016)
Zhou, D., Xu, P., Gu, Q.: Stochastic nested variance reduction for nonconvex optimization. J. Mach. Learn. Res. 21, 103:1-103:63 (2020)
Zhou, W., Conrad, B.A., Rahim, S.H., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Acknowledgements
The authors would like to thank the editor and anonymous reviewers for their insightful comments and constructive suggestions that improved the quality of our paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Lam M. Nguyen.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research is supported by the National Natural Science Foundation of China (NSFC) grants 12131004, 11926358, 12126608, the Fundamental Research Funds for the Central Universities (Grant No. YWF-22-T-204).
Appendix A: Several Lemmas for Theoretical Analysis in Algorithm 1
Appendix A: Several Lemmas for Theoretical Analysis in Algorithm 1
Lemma A.1
Under Assumption 4.3, we have
where \(\nabla _{A_{\xi ^{k}}}f^k=\nabla _{A_{\xi ^{k}}}f\left( A_{1}^{k},\dots ,A_{N}^{k}\right) \) and \(\nabla _{A_{\xi ^{k}}}f^{k-1} = \nabla _{A_{\xi ^{k}}}f\left( A_{1}^{k-1},\dots ,A_{N}^{k-1}\right) \).
Proof
We have the following inequality holds
where the last inequality follows from Young’s inequality. From Assumption 4.3, we have that
where the third equality follows from Assumption 4.3. \(\square \)
Lemma A.2
Suppose the assumption in Theorem 4.1 holds. Then, we have
Proof
It shows that
Further, we have
where the last inequality follows from \(a^{3}-b^{3}=(a-b)(a^{2}+ab+b^{2})\) for any \(a,b\in \mathbb {R}\). From the above equation, we have
where the last inequality follows from \(\sum _{k=1}^{+\infty }\frac{1}{k^{\frac{3}{2}}}\approx 2.612\).
Combining (43) and (45), we obtain (42). This completes the proof.
\(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, Q., Cui, C. & Han, D. Accelerated Doubly Stochastic Gradient Descent for Tensor CP Decomposition. J Optim Theory Appl 197, 665–704 (2023). https://doi.org/10.1007/s10957-023-02193-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-023-02193-5
Keywords
- Tensor CANDECOMP/PARAFAC decomposition
- Doubly stochastic gradient descent
- Nonconvex optimization
- Variance reduction
- Momentum acceleration