Skip to main content
Log in

Regularized sample average approximation for high-dimensional stochastic optimization under low-rankness

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

This paper concerns a high-dimensional stochastic programming (SP) problem of minimizing a function of expected cost with a matrix argument. To this problem, one of the most widely applied solution paradigms is the sample average approximation (SAA), which uses the average cost over sampled scenarios as a surrogate to approximate the expected cost. Traditional SAA theories require the sample size to grow rapidly when the problem dimensionality increases. Indeed, for a problem of optimizing over a p-by-p matrix, the sample complexity of the SAA is given by \({\widetilde{O}}(1)\cdot \frac{p^2}{\epsilon ^2}\cdot {polylog}(\frac{1}{\epsilon })\) to achieve an \(\epsilon \)-suboptimality gap, for some poly-logarithmic function \({polylog}(\,\cdot \,)\) and some quantity \({\widetilde{O}}(1)\) independent of dimensionality p and sample size n. In contrast, this paper considers a regularized SAA (RSAA) with a low-rankness-inducing penalty. We demonstrate that, when the optimal solution to the SP is of low rank, the sample complexity of RSAA is \({\widetilde{O}}(1)\cdot \frac{p}{\epsilon ^3}\cdot {polylog}(p,\,\frac{1}{\epsilon })\), which is almost linear in p and thus indicates a substantially lower dependence on dimensionality. Therefore, RSAA can be more advantageous than SAA especially for larger scale and higher dimensional problems. Due to the close correspondence between stochastic programming and statistical learning, our results also indicate that high-dimensional low-rank matrix recovery is possible generally beyond a linear model, even if the common assumption of restricted strong convexity is completely absent.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Ariyawansa, K.A., Zhu, Y.: Stochastic semidefinite programming: a new paradigm for stochastic optimization. 4OR 4(3), 239–253 (2006)

    Article  MATH  Google Scholar 

  2. Bartlett, P.L., Jordan, M.I., McAuliffe, J.D.: Convexity, classification, and risk bounds. J. Am. Stat. Assoc. 101(473), 138–156 (2006)

    Article  MATH  Google Scholar 

  3. Cai, J.-F., Candés, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. optimization 20(4), 1956–1982 (2010)

    Article  MATH  Google Scholar 

  4. Candes, E.J., Plan, Y.: Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Trans. Inf. Theor. 57(4), 2342–2359 (2011). https://doi.org/10.1109/TIT.2011.2111771

    Article  MATH  Google Scholar 

  5. Clémenc̣on, S., Lugosi, G., Vayatis, N.: Ranking and empirical minimization of \(U\)-statistics. Ann. Stat. 36(2), 844–874 (2008)

    Article  MATH  Google Scholar 

  6. Elsener, A., van de Geer, S.: Robust low-rank matrix estimation. Ann. of Stat. 46(6B), 3481–3509 (2018)

    Article  MATH  Google Scholar 

  7. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Stat. Assoc. 96(456), 1348–1360 (2001). https://doi.org/10.1198/016214501753382273

    Article  MATH  Google Scholar 

  8. Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming, version 2.0 beta. Online at: http://cvxr.com/cvx, (2013)

  9. Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs, Recent Advances in Learning and Control (a tribute to M. Vidyasagar), V. Blondel, S. Boyd, and H. Kimura, editors, pages 95-110, Lecture Notes in Control and Information Sciences, Springer, (2008). http://stanford.edu/~boyd/graph_dcp.html

  10. Jain, P., Tewari, A., Kar, P.: On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, pp. 685-693, (2014)

  11. Koltchinskii, V.: Rademacher complexities and bounding the excess risk in active learning. J. Mach. Learn. Res. 11, 2457–2485 (2010)

    MATH  Google Scholar 

  12. Liu, H., Lee, H. Y., Huo, Z.: Linearly constrainted high-dimensional learning. working paper. (2019)

  13. Liu, H., Ye, Y.: High-dimensional learning under approximate sparsity: Towards a unified framework for nonsmooth learning and regularized neural networks. arXiv e-prints, art. arXiv:1903.00616, (2019)

  14. Liu, H., Yao, T., Li, R., Ye, Y.: Folded concave penalized sparse linear regression: sparsity, statistical performance, and algorithmic theory for local solutions. Math. Program. 166(1), 207–240 (2017). https://doi.org/10.1007/s10107-017-1114-y

    Article  MATH  Google Scholar 

  15. Liu, H., Wang, X., Yao, T., Li, R., Ye, Y.: Sample average approximation with sparsity-inducing penalty for high-dimensional stochastic programming. Math. Program. (2018). https://doi.org/10.1007/s10107-018-1278-0

    Article  MATH  Google Scholar 

  16. Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp. 451-459. (2011)

  17. Negahban, S., Wainwright, M.J.: Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. J. Mach. Learn. Res. 13(May), 1665–1697 (2012)

    MATH  Google Scholar 

  18. Rohde, A., Tsybakov, A.B.: Estimation of high-dimensional low-rank matrices. Ann. of Stat. 39(2), 887–930 (2011)

    Article  MATH  Google Scholar 

  19. Ruszczyński, A., Shapiro, A.: Stochastic programming models. In Stochastic Programming, volume 10 of Handbooks in Operations Research and Management Science, 1 – 64. Elsevier, (2003a). https://doi.org/10.1016/S0927-0507(03)10001-1

  20. Ruszczyński, A., Shapiro, A.: Optimality and duality in stochastic programming. In Stochastic Programming, volume 10 of Handbooks in Operations Research and Management Science, 65 – 139. Elsevier, (2003b). https://doi.org/10.1016/S0927-0507(03)10002-3

  21. Shapiro, A., Xu, H.: Uniform laws of large numbers for set-valued mappings and subdifferentials of random functions. J. Math. Anal. Appl. 325(2), 1390–1399 (2007). https://doi.org/10.1016/j.jmaa.2006.02.078

    Article  MATH  Google Scholar 

  22. Shapiro, A., Dentcheva, D., Ruszczynski, A.: Lectures on Stochastic Programming: Modeling and Theory, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (2014)1611973422, 9781611973426

  23. Tanner, J., Wei, K.: Normalized iterative hard thresholding for matrix completion. SIAM J. Sci. Comput. 35(5), S104–S125 (2013)

    Article  MATH  Google Scholar 

  24. Vandenberghe, L., Boyd, S.: Applications of semidefinite programming. Appl. Numer. Math. 29(3), 283–299 (1999)

    Article  MATH  Google Scholar 

  25. Vandenberghe, L., Boyd, S.: Semidefinite programming. SIAM Rev. 38(1), 49–95 (1996)

    Article  MATH  Google Scholar 

  26. Zhang, C.-H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010). https://doi.org/10.1214/09-AOS729

    Article  MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers and editors for their constructive comments that have helped improve this paper. This research is partially supported by NSF grant CMMI-2016571.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hung Yi Lee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Technical proofs

Technical proofs

1.1 Proof of results on sample complexity

1.1.1 General ideas

The general idea of our proof is focused on addressing the question: how to show that an \(\hbox {S}^3 ONC\) solution has low rank. If this question is answered, then the desired results can be almost evident by analyzing the \(\epsilon \)-net for all the low-rank subspaces. Such an analysis is available in Lemma 3.1 of [4] and is restated (with minor modifications) in Lemma 28 herein. Proposition 22 then establishes a point-wise bound between the average cost and the expected cost for all the low-rank subspaces.

To bound the rank of an \(\hbox {S}^3\)ONC solution, we utilize a unique property of the MCP function, which ensures that the \(\hbox {S}^3\)ONC solutions \({\mathbf {X}}^{RSAA}\) must obey a thresholding rule: for all the singular values, they must be either 0 or greater than \(a\lambda \), where a and \(\lambda \) are hyper-parameters of the penalty. Proposition 21 formalizes this thresholding rule.

By this rule and the definition of the MCP, for each nonzero singular value in the \(\hbox {S}^3\)ONC solution \({\mathbf {X}}^{RSAA}\), the total value of penalty incurred by the MCP-based low-rankness-inducing regularization becomes \(\sum _{j=1}^pP_\lambda (\sigma _j({\mathbf {X}}^{RSAA}))={\mathbf{rk}({\mathbf {X}}^{RSAA})}\cdot \frac{a\lambda ^2}{2}\). Now, consider those \(\hbox {S}^3\)ONC points whose suboptimality gaps in terms of minimizing the RSAA are smaller than a user-specific quantity \(\Gamma \). These solutions should satisfy

$$\begin{aligned} {\mathcal {F}}_{n,\lambda }({\mathbf {X}}^{RSAA},{\mathbf {Z}}_1^n)={\mathcal {F}}_{n}({\mathbf {X}}^{RSAA},{\mathbf {Z}}_1^n)+{\mathbf{rk}({\mathbf {X}}^{RSAA})}\cdot \frac{a\lambda ^2}{2}\le {\mathcal {F}}_{n,\lambda }({\mathbf {X}}^{*},{\mathbf {Z}}_1^n)+\Gamma . \end{aligned}$$

By this inequality, we may observe that the rank of \(\mathbf{X}^{RSAA}\) must be bounded from above. It is also easy to see that this upper bound should be a function of \(\Gamma \). This function is explicated by Proposition 23. Then, the desired results in Theorem 7 immediately follows the combination of Propositions 22 and 23. Finally, the value of \(\Gamma \) can be well contained and explicated by proper (and tractable) initializations, as shown in and Corollaries 10 and 12. The proof of these corollaries are based on Lemma 27, which shows that \({\mathbf {X}}_\lambda ^{\ell _1}\) yields a small value of \(\Gamma \).

1.1.2 Proof of Theorem 7

Proof

This proof substantially generalizes the argument of Proposition 1 in [13] from handling sparsity to handling low-rankness. Meanwhile, much more flexible choices of penalty parameters \(\lambda \) is enabled. We follow the same set of notations in Proposition 24 in defining \(\widetilde{p}_u\), \(\epsilon \), and \( \Delta _1(\epsilon ):=\ln \left( \frac{18 p R\cdot (K_C+{\mathcal {C}}_\mu )}{\epsilon }\right) \). Furthermore, we will let \(\epsilon :={\frac{1}{n^{1/3}}}\) and \(\widetilde{\Delta }:=\ln \left( 18 \cdot R\cdot (K_C+{\mathcal {C}}_\mu )\right) \). Then \(\Delta _1(\epsilon )=\ln \left( \frac{18\cdot (K_C+\mathcal C_\mu )\cdot p\cdot R}{\epsilon }\right) =\ln (n^{1/3}p)+\widetilde{\Delta }>0\) and \(\lambda =\sqrt{\frac{8\cdot s^{-\rho }\cdot K(2p+1)^{2/3}\cdot \Delta _1(\epsilon )}{c\cdot a\cdot n^{2/3}}}=\sqrt{\frac{8\cdot s^{-\rho }\cdot K \cdot (2p+1)^{2/3}}{c\cdot a\cdot n^{2/3}}[\ln (n^{1/3} p)+\widetilde{\Delta }]}\). We will denote by O(1)’s the universal constants, which may be different in each occurence.

To show the desired results, it suffices to simplify the results in Proposition 24. We will first derive an explicit form for \({\widetilde{p}}_u\). To that end, we let \({P_X}:= { {\widetilde{p}}_u}\) and \(T_1:=2P_\lambda (a\lambda )-\frac{8K\cdot (2p+1) }{cn}\Delta _1(\epsilon )\). We then solve the following inequality, which is equivalent to (41) of Proposition 24, for a feasible \(P_X\),

$$\begin{aligned} \frac{T_1}{2}\cdot P_X-\frac{2 K }{\sqrt{n}}\sqrt{\frac{2P_X\cdot (2p+1)\Delta _1(\epsilon )}{c}}>\Gamma +2\epsilon +sP_\lambda (a\lambda ), \end{aligned}$$
(22)

for the same \(c\in (0,\, 0.5]\) in (8). Solving the above inequality in terms of \(P_X\), we have \( \sqrt{P_X}> \frac{2 K}{T_1\sqrt{n}}\sqrt{\frac{2(2p+1)\cdot \Delta _1(\epsilon )}{c}}+ \frac{\sqrt{\frac{2(2 K )^2\cdot (2p+1)\cdot \Delta _1(\epsilon )}{cn}+2T_1[\Gamma +2\epsilon +sP_\lambda (a\lambda )]}}{T_1}.\) To find a feasible \(P_X\), we may as well let \( {P_X}> \frac{32K^2\cdot (2p+1)\cdot \Delta _1(\epsilon )}{cT_1^2\cdot n}+8T_1^{-1}[\Gamma +2\epsilon +sP_\lambda (a\lambda )].\) For \(\lambda =\sqrt{\frac{8K\cdot s^{-\rho } \cdot \Delta _1(\epsilon )\cdot (2p+1)^{2/3}}{c\cdot a\cdot n^{2/3}}}=\sqrt{\frac{8K\cdot s^{-\rho }\cdot (2p+1)^{2/3}}{c\cdot a\cdot n^{2/3}}[\ln (n^{1/3}p)+{\widetilde{\Delta }}]}\) with \(\widetilde{\Delta }:=\ln \left( 18 \cdot R\cdot (K_C+{\mathcal {C}}_\mu )\right) \)), we have \(P_\lambda (a\lambda )=\frac{a\lambda ^2}{2}=\frac{4K \cdot s^{-\rho }\cdot (2p+1)^{2/3}}{c\cdot n^{2/3}}\cdot \Delta _1(\epsilon )\). Furthermore, \(2P_\lambda (a\lambda )=\frac{8K \cdot s^{-\rho }\cdot (2p+1)^{2/3}\cdot \Delta _1(\epsilon )}{c\cdot n^{2/3}}> \frac{4 \cdot s^{-\rho } K\cdot \Delta _1(\epsilon )\cdot (2p+1)^{2/3}}{c\cdot n^{2/3}}+\frac{8K \cdot (2p+1)}{nc}\Delta _1(\epsilon )\) as per our assumption (i.e., (11) implies that \(n^{1/3}> 2 s^{\rho }\)). Therefore, \(T_1=2P_\lambda (a\lambda )- \frac{8K\cdot (2p+1)}{nc}\Delta _1(\epsilon )>\frac{4 K \cdot s^{-\rho }\cdot \Delta _1(\epsilon )\cdot (2p+1)^{2/3}}{c\cdot n^{2/3}}\). Hence, if we recall \(\epsilon =n^{-1/3}\), to satisfy (22), it suffices to let \(P_X\) be any integer that satisfies \( P_X \ge \frac{2cn^{1/3} s^{2\rho }}{\Delta _1(n^{-\frac{1}{3}})\cdot (2p+1)^{2/3}\cdot }+\frac{2cn^{2/3}s^{\rho }}{K \Delta _1(n^{-\frac{1}{3}})\cdot (2p+1)^{2/3}\cdot }\cdot \left[ \Gamma +\frac{2}{n^{1/3}}+sP_\lambda (a\lambda )\right] ,\) which is satisfied by letting \(P_X\ge {\widetilde{p}}_u\) with

$$\begin{aligned} {\widetilde{p}}_u :=&\left\lceil \frac{2cn^{1/3}s^{2\rho }}{\Delta _1(n^{-\frac{1}{3}})\cdot (2p+1)^{1/3}}+\frac{2cn^{2/3}s^{\rho }}{K\cdot \Delta _1(n^{-\frac{1}{3}})\cdot (2p+1)^{2/3}}\cdot \left( \Gamma +\frac{2}{n^{1/3}}\right) +8s\right\rceil . \end{aligned}$$
(23)

In the meantime, verifiably, \({\widetilde{p}}_u>s\). Since the above is a sufficient to ensure (22), we know that (41) in Proposition 24 holds for any \({\widetilde{p}}:\, {\widetilde{p}}_u\le {\widetilde{p}}\le p\). Due to Proposition 24, with probability at least \( P^*:= \, 1-6\exp \left( -{\widetilde{p}}_u\cdot (2p+1)\cdot \Delta _1(n^{-\frac{1}{3}})\right) -2 (p+1) \exp (-{\widetilde{c}}n) \ge 1-6\exp (-2c\cdot (2p+1)^{2/3} \cdot n^{1/3})-2 (p+1) \exp (-{\widetilde{c}}n)\), it holds that

$$\begin{aligned} {\mathbb {F}}({\mathbf {X}}^{RSAA})-{\mathbb {F}}({{\mathbf {X}}}^*)\le & {} s\cdot P_\lambda (a\lambda )+ {\frac{2K}{\sqrt{n}}}\sqrt{\frac{2 {\widetilde{p}}_u(2p+1)}{c}\Delta _1(n^{-\frac{1}{3}})} \nonumber \\&+\frac{4K }{n}\frac{ {\widetilde{p}}_u(2p+1)}{c}\Delta _1(n^{-\frac{1}{3}})+2\epsilon +\Gamma , \end{aligned}$$
(24)

in which \({\widetilde{p}}_u\) is as per (23).

The following simplifies the formula while seeking to preserve the rates in n and p. Firstly, we have

$$\begin{aligned}&\sqrt{\frac{2 {\widetilde{p}}_u\cdot (2p+1)}{cn}\Delta _1(n^{-\frac{1}{3}})} \end{aligned}$$
(25)
$$\begin{aligned}&\quad \le \sqrt{\frac{4\cdot (2p+1)s^{2\rho }}{cn\cdot (2p+1)^{1/3}}\Delta _1(n^{-\frac{1}{3}}) \cdot \frac{cn^{1/3}}{\Delta _1(n^{-\frac{1}{3}})}+\frac{4cn^{2/3}(2p+1)s^{\rho }}{K(2p+1)^{2/3} \Delta _1(n^{-\frac{1}{3}})}\left( \Gamma +\frac{2}{n^{1/3}}\right) \cdot \frac{\Delta _1(n^{-\frac{1}{3}})}{cn}} \nonumber \\&\qquad +\sqrt{\frac{2}{cn}\Delta _1(n^{-\frac{1}{3}})\cdot \left( 8s+1\right) \cdot (2p+1)} \nonumber \\&\quad \le \sqrt{\frac{4(2p+1)^{2/3}s^{2\rho }}{n^{2/3}}+\frac{4s^{\rho }\cdot (\Gamma +\frac{2}{n^{1/3}})\cdot (2p+1)^{1/3}}{K n^{1/3}}}+ \sqrt{\frac{2}{nc}\Delta _1(n^{-\frac{1}{3}})\cdot \left( 8s+1\right) \cdot (2p+1)}, \end{aligned}$$
(26)

which is due to \(\sqrt{x+y}\le \sqrt{x}+\sqrt{y}\) for any \(x,\, y\ge 0\) and the relations that \(0<a<{\mathcal {U}}_L^{-1}\le 1\), \(0<c\le 0.5\), \( K\ge 1\), and \(\Delta _1(n^{-\frac{1}{3}})\ge \ln 36\).

Similar to the above, we obtain

$$\begin{aligned}&{\frac{3 {\widetilde{p}}_u\cdot (2p+1)}{cn}\Delta _1(n^{-\frac{1}{3}})} \nonumber \\&\quad \le {\frac{4\cdot (2p+1)^{2/3}s^{2\rho }}{n^{2/3}}}+ {\frac{2}{nc}\Delta _1(n^{-\frac{1}{3}})\left( 8s+1\right) \cdot (2p+1)}\nonumber \\&\qquad + {\frac{4\cdot s^{\rho }\cdot (\Gamma +\frac{2}{n^{1/3}})}{K\cdot n^{1/3}}}\cdot (2p+1)^{1/3}. \end{aligned}$$
(27)

Since (11) and \(\Delta _1(n^{-\frac{1}{3}})=\ln (np)+{\widetilde{\Delta }}\), we have \( \frac{4(2p+1)^{2/3}s^{2\rho }}{n^{2/3}}+\frac{4(\Gamma +\frac{2}{n^{1/3}})\cdot (2p+1)^{1/3}s^{\rho }}{K n^{1/3}}\le O(1)\) and \( {\frac{2}{nc}\Delta _1(n^{-\frac{1}{3}})\left[ 8s+1\right] \cdot (2p+1)}\le O(1)\). Therefore, it holds that \( {\frac{2 \widetilde{p}_u}{cn}\Delta _1(n^{-\frac{1}{3}})(2p+1)} \le O(1)\cdot \sqrt{\frac{(2p+1)^{2/3}s^{2\rho }}{n^{2/3}}+\frac{(\Gamma +\frac{2}{n^{1/3}})\cdot (2p+1)^{1/3}\cdot s^{\rho }}{K n^{1/3}}} +O(1)\cdot \sqrt{{\frac{\Delta _1(n^{-\frac{1}{3}})}{nc}\cdot \left( 8s+1\right) \cdot (2p+1)}}\). Combining the above with (26) and (27), the inequality in (24) can be simplified into \( {\mathbb {F}}(\mathbf{X}^{RSAA})-{\mathbb {F}}({{\mathbf {X}}^*}) \le O(1)s^{1-\rho }\cdot \frac{K \cdot \Delta _1(n^{-\frac{1}{3}})\cdot p^{2/3}}{c\cdot n^{2/3}}+ O(1)\cdot K\cdot \sqrt{\frac{p^{2/3}s^{2\rho }}{n^{2/3}}+\frac{(\Gamma +\frac{2}{n^{1/3}})\cdot p^{1/3}s^{\rho }}{K n^{1/3}}} +O(1)\cdot K \sqrt{\frac{s p}{nc}\Delta _1(n^{-\frac{1}{3}})} +\frac{2}{n^{1/3}}+\Gamma \). Together with \(\Delta _1(n^{-\frac{1}{3}})\ge \ln 2\), \(K \ge 1\), and \(0<c\le 0.5\), the above becomes

$$\begin{aligned}&{\mathbb {F}}({\mathbf {X}}^{RSAA})-{\mathbb {F}}({{\mathbf {X}}^*}) \nonumber \\&\quad \le O(1)\cdot \left( \frac{s^{1-\rho }\cdot \Delta _1(n^{-1/3}) \cdot p^{2/3}}{n^{2/3}}+\frac{p^{1/3}\cdot s^{\rho }}{n^{1/3}}+\sqrt{\frac{s\cdot p\cdot \Delta _1(n^{-1/3})}{n}}\right) \cdot K \nonumber \\&\qquad +O(1)\cdot \sqrt{\frac{K \cdot s^{\rho }\cdot p^{1/3}\cdot \Gamma }{n^{1/3}}}+\Gamma , \end{aligned}$$
(28)

which then shows Theorem 7 since \(\Delta _1(n^{-\frac{1}{3}}):=\ln \left( 18n^{1/3} (K_C+\mathcal C_\mu ) \cdot p\cdot R\right) \). \(\square \)

1.1.3 Proof of Corollary 10

Proof

Lemma 27 implies that \({\mathcal {F}}_{n,\lambda }({\mathbf {X}}^{RSAA},\,{\mathbf{Z}}_1^n)\le {\mathcal {F}}_{n,\lambda }({{\mathbf {X}}}^*,\,{\mathbf{Z}}_1^n)+\lambda \Vert {\mathbf {X}}^*\Vert _*\) almost surely. Below we invoke the results from Theorem 7 with \(\Gamma =\lambda \Vert {\mathbf {X}}^*\Vert _*\) and assumption that \(\rho =0\) and \(\lambda =\lambda (0)\). Note that it is assumed that

$$\begin{aligned} n> C_2\cdot p\cdot {\mathcal {U}}_L\cdot {[}\ln (np)+{\widetilde{\Delta }}]\cdot s^{3/2}R^{3/2}> O(1)\cdot p\cdot a^{-1}\cdot [\ln (np)+{\widetilde{\Delta }}]\cdot s^{3/2}R^{3/2}, \end{aligned}$$
(29)

and \(\frac{\Gamma }{K }\le \frac{\lambda \Vert \mathbf{X}^*\Vert _*}{K}\le \frac{ \Vert {\mathbf {X}}^*\Vert _*\cdot \sqrt{\frac{8K \cdot (2p+1)^{2/3}}{c\cdot a\cdot n^{2/3}}[\ln (n^{1/3}p)+{\widetilde{\Delta }}]}}{K}\) (as well as \(K \ge 1\)). In view of (29), it then holds under Assumption 1 that \(\frac{\Gamma }{K} \le R s\cdot \sqrt{\frac{8(2p+1)^{2/3}}{cK\cdot a\cdot n^{2/3}}[\ln (n^{1/3}p)+{\widetilde{\Delta }}]}\le O(1)\cdot \sqrt{\frac{Rs}{a^{1/3} }[\ln (n^{1/3}p)+{\widetilde{\Delta }}]^{1/3}}\). Therefore, \(\left( \frac{\Gamma }{K}\right) ^{3}\le \left( O(1)\cdot \sqrt{\frac{Rs}{a^{1/3} }[\ln (n^{1/3}p)+\widetilde{\Delta }]^{1/3}}\right) ^{3} \le O(1)\cdot R^{3/2}s^{3/2}\sqrt{a^{-1}\cdot [\ln (n^{1/3}p)+{\widetilde{\Delta }}]},\) for some universal constants O(1). Furthermore, since \(a<\mathcal U_L^{-1}\le 1\), it holds that, if n satisfies (13) for some universal constant \(C_2\), then \( n > O(1)\cdot p\cdot a^{-1 }\cdot [\ln (n^{1/3}p)+\widetilde{\Delta }]\cdot s^{3/2}R^{3/2} \ge O(1)\cdot p\cdot R^{3/2}s^{3/2}\sqrt{a^{-1}\cdot [\ln (n^{1/3}p)+\widetilde{\Delta }]}+O(1)\cdot p+ C_1\cdot s\cdot p\cdot \left( \ln (n^{1/3}p)+{\widetilde{\Delta }}\right) \ \ge C_1\cdot \left[ \left( \frac{\Gamma }{K}\right) ^{3}p+p+s\cdot p\cdot \left( \ln (n^{1/3}p)+{\widetilde{\Delta }}\right) \right] .\) Therefore, Theorem 7 is met and thus (12) in Theorem 7 implies that

$$\begin{aligned} {\mathbb {F}}({\mathbf {X}}^{RSAA})-{\mathbb {F}}({{\mathbf {X}}^*})\le & {} O(1)\cdot K\cdot \left( \frac{sp^{2/3}\Delta _1(n^{-1/3})}{n^{2/3}}+\sqrt{\frac{sp\Delta _1(n^{-\frac{1}{3}})}{n}}+\frac{p^{1/3}}{n^{1/3}}\right) \\&+O(1)\cdot \sqrt{\frac{Kp^{1/3}(\lambda \Vert {\mathbf {X}}^*\Vert _*)}{n^{1/3}}}+\lambda \Vert {\mathbf {X}}^*\Vert _*, \end{aligned}$$

with probability at least \( 1-2 (2p+1) \exp (-\widetilde{c}n)-6\exp \left( -2cn^{1/3}\cdot (2p+1)^{2/3} \right) \). Note that \(a<1\), \(K \ge 1\), \(p\ge 1\), \(\left[ \ln (n^{1/3}p)+\widetilde{\Delta }\right] \ge 1\) and \(\sqrt{\frac{sp\Delta _1(n^{-\frac{1}{3}})}{n}}\le \frac{s(2p+1)^{1/3}\cdot \sqrt{\Delta _1(n^{-\frac{1}{3}})}}{n^{1/3}}\) (due to (13) again). Hence, \(\mathbb F({\mathbf {X}}^{RSAA})-{\mathbb {F}}({{\mathbf {X}}^*}) \le O(1)\cdot K\cdot \left[ \frac{sp^{2/3}\cdot \left( \ln (np)+\widetilde{\Delta }\right) }{n^{2/3}}+\frac{p^{1/3}}{n^{1/3}}\right] +O(1)\cdot \frac{sRK\cdot (2p+1)^{1/3}}{\min \left\{ a^{1/2}n^{1/3},\,a^{1/4}n^{1/3}\right\} } \left[ \ln (n^{1/3}p)+{\widetilde{\Delta }}\right] ^{1/2}\), which shows Part (ii) by further noticing that \(a=\frac{1}{2{\mathcal {U}}_L}\) and \({\mathcal {U}}_L\ge 1\). \(\square \)

1.1.4 Proof of Corollary 12

Proof

The proof follows almost the same argument as in Sect. A.1.3 for proving Corollary 10, except that the choice of user-specific parameters are different. Again, Lemma 27 implies that \(\mathcal F_{n,\lambda }({\mathbf {X}}^{RSAA},\,{{\mathbf {Z}}}_1^n)\le \mathcal F_{n,\lambda }({{\mathbf {X}}}^*,\,{{\mathbf {Z}}}_1^n)+\lambda \Vert {\mathbf {X}}^*\Vert _*\) almost surely. As the same in Part (ii), below we invoke the results from Theorem 7 with \(\Gamma =\lambda \Vert {\mathbf {X}}^*\Vert _*\) and assumption that \(\rho =2/3\) and \(\lambda =\lambda (\frac{2}{3})\). Note that it is assumed that

$$\begin{aligned} n> C_3\cdot p\cdot {\mathcal {U}}_L\cdot [\ln (np)+{\widetilde{\Delta }}]\cdot s^{2}R^{3/2}> O(1)\cdot p\cdot a^{-1}\cdot [\ln (np)+{\widetilde{\Delta }}]\cdot s^{2}R^{3/2}, \end{aligned}$$
(30)

and \(\frac{\Gamma }{K }\le \frac{\lambda \Vert \mathbf{X}^*\Vert _*}{K}\le \frac{ \Vert {\mathbf {X}}^*\Vert _*\cdot \sqrt{\frac{8K \cdot (2p+1)^{2/3}\cdot s^{-2/3}}{c\cdot a\cdot n^{2/3}}[\ln (n^{1/3}p)+{\widetilde{\Delta }}]}}{K}\) (as well as \(K \ge 1\)). In view of (30), it then holds under Assumption 1 that \(\frac{\Gamma }{K} \le R s\cdot \sqrt{\frac{8(2p+1)^{2/3} s^{-2/3}}{cK\cdot a\cdot n^{2/3}}[\ln (n^{1/3}p)+{\widetilde{\Delta }}]}\le O(1)\cdot \sqrt{\frac{R}{a^{1/3} }[\ln (n^{1/3}p)+{\widetilde{\Delta }}]^{1/3}}\). Therefore, \(\left( \frac{\Gamma }{K}\right) ^{3}\le \left( O(1)\cdot \sqrt{\frac{R}{a^{1/3} }[\ln (n^{1/3}p)+\widetilde{\Delta }]^{1/3}}\right) ^{3} \le O(1)\cdot R^{3/2} \sqrt{a^{-1}\cdot [\ln (n^{1/3}p)+{\widetilde{\Delta }}]},\) for some universal constants O(1). Furthermore, since \(a<{\mathcal {U}}_L^{-1}\le 1\), it holds that, if n satisfies (17), then \( n > O(1)\cdot p\cdot a^{-1 }\cdot [\ln (n^{1/3}p)+\widetilde{\Delta }]\cdot s^{2}R^{3/2} \ge O(1)\cdot p\cdot R^{3/2}s^{2}\sqrt{a^{-1}\cdot [\ln (n^{1/3}p)+\widetilde{\Delta }]}+O(1)\cdot s^2\cdot p+ C_1s\cdot p \cdot \left( \ln (n^{1/3}p)+{\widetilde{\Delta }}\right) \ \ge C_1\cdot \left[ s^2\left( \frac{\Gamma }{K}\right) ^{3}p+s^2\cdot p+s\cdot p\cdot \left( \ln (n^{1/3}p)+{\widetilde{\Delta }}\right) \right] .\) Therefore, (11) in Theorem 7 is met and thus (12) in Theorem 7 implies that

$$\begin{aligned}&{\mathbb {F}}({\mathbf {X}}^{RSAA})-{\mathbb {F}}({{\mathbf {X}}^*})\\&\quad \le O(1)\cdot K\cdot \left( \frac{s^{1/3}p^{2/3}\Delta _1(n^{-1/3})}{n^{2/3}}+\sqrt{\frac{sp\Delta _1(n^{-\frac{1}{3}})}{n}}+\frac{p^{1/3}\cdot s^{2/3}}{n^{1/3}}\right) \\&\qquad +O(1)\cdot \sqrt{\frac{Kp^{1/3}\cdot s^{2/3}\cdot (\lambda \Vert {\mathbf {X}}^*\Vert _*)}{n^{1/3}}}+\lambda \Vert {\mathbf {X}}^*\Vert _*, \end{aligned}$$

with probability at least \( 1-2 (2p+1) \exp (-\widetilde{c}n)-6\exp \left( -2cn^{1/3}\cdot (2p+1)^{2/3} \right) \). Note that \(a<1\), \(K \ge 1\), \(p\ge s\ge 1\), \(\left[ \ln (n^{1/3}p)+\widetilde{\Delta }\right] \ge 1\) and \(\sqrt{\frac{sp\Delta _1(n^{-\frac{1}{3}})}{n}}\le \frac{(2p+1)^{1/3}\cdot \sqrt{\Delta _1(n^{-\frac{1}{3}})}}{n^{1/3}}\) (in view of (17) again). Hence, \({\mathbb {F}}({\mathbf {X}}^{RSAA})-{\mathbb {F}}({\mathbf{X}^*}) \le O(1)\cdot K\cdot \left[ \frac{s^{1/3}p^{2/3}\cdot \left( \ln (np)+{\widetilde{\Delta }}\right) }{n^{2/3}}+\frac{s^{2/3}\cdot p^{1/3}}{n^{1/3}}\right] +O(1)\cdot \frac{s^{2/3}RK\cdot (2p+1)^{1/3}}{\min \left\{ a^{1/2}n^{1/3},\,a^{1/4}n^{1/3}\right\} } \left[ \ln (n^{1/3}p)+{\widetilde{\Delta }}\right] ^{1/2}\), which shows Part (iii) by further noticing that \(a=\frac{1}{2{\mathcal {U}}_L}\). \(\square \)

1.1.5 Pillar results for sample complexity

Proposition 21

Suppose that \(a<{{\mathcal {U}}_L}^{-1}\). Assume that the \(\hbox {S}^3 ONC({\mathbf {Z}}_1^n)\) is satisfied almost surely at \({\mathbf {X}}^{RSAA}\in {\mathcal {S}}_p \). Then,

$$\begin{aligned} {\mathbb {P}}[\{\vert \sigma _j({\mathbf {X}}^{RSAA})\vert \notin (0,\,a\lambda ) \text { for all } j\}]=1. \end{aligned}$$

Proof

Since \({\mathbf {X}}^{RSAA}\) satisfies the \(\hbox {S}^3\)ONC\(({\mathbf {Z}}_1^n)\) almost surely, Eq. (9) implies that for any \(j\in \{1,...,p\}\), if \(\sigma _j({\mathbf {X}}^{RSAA})\in (0,\,a\lambda )\), then

$$\begin{aligned} 0\le&\,{\mathcal {U}}_L+\left[ \frac{\partial ^2 P_\lambda (\vert \sigma _j({\mathbf {X}})\vert )}{[\partial \sigma _j(\mathbf{X})]^2}\right] _{{\mathbf {X}}={\mathbf {X}}^{RSAA}} = \mathcal U_L-\frac{1}{a}. \end{aligned}$$
(31)

Further observe that \(\frac{\partial ^2 P_\lambda (t)}{\partial t^2}=-a^{-1}\) for \(t\in (0,\,a\lambda )\). Therefore, (31) contradicts with the assumption that \(\mathcal U_L<\frac{1}{a}\). This contradiction implies that

$$\begin{aligned}&{\mathbb {P}}[\{{\mathbf {X}}^{RSAA}\text { satisfies the } \hbox {S}^3\text { ONC}({\mathbf {Z}}_1^n)\}\cap \{\vert \sigma _j({\mathbf {X}}^{RSAA})\vert \in (0,\,a\lambda )\}]=0\\&\quad \Longrightarrow 0\ge 1-{\mathbb {P}}[\{{\mathbf {X}}^{RSAA}\text { does not satisfy the } \hbox {S}^3\text { ONC }({\mathbf {Z}}_1^n)\}]-{\mathbb {P}}[\{\vert \sigma _j({\mathbf {X}}^{RSAA})\vert \notin (0,\,a\lambda )\}]. \end{aligned}$$

Since \({\mathbb {P}}[\{{\mathbf {X}}^{RSAA}\text { satisfies the } \hbox {S}^3\text { ONC }({\mathbf {Z}}_1^n)\}]=1\), it holds that \({\mathbb {P}}[\{\vert \sigma _j({\mathbf {X}}^{RSAA})\vert \notin (0,\,a\lambda )\}]=1\) for all \(j=1,...,n\), which immediately leads to the desired result. \(\square \)

Proposition 22

Suppose that Assumptions 3 and 4 hold. Let \(\epsilon \in (0,\, 1]\), \({\widetilde{p}}:\,{\widetilde{p}}> s\), \(\Delta _1(\epsilon ):=\ln \left( \frac{18\cdot {(K_C+\mathcal C_\mu )}\cdot p\cdot R}{\epsilon }\right) \), and \(\mathcal B_{{\widetilde{p}},R}:=\left\{ {\mathbf {X}}\in {\mathcal {S}}_p:\, \sigma _{\max }({\mathbf {X}})\le R,\, \mathbf{rk}({\mathbf {X}})\le {\widetilde{p}} \right\} .\) Then, for the same \(c\in (0,\,0.5]\) as in (8) and for some \({\widetilde{c}}>0\),

$$\begin{aligned} \max _{{\mathbf {X}}\in {\mathcal {B}}_{\widetilde{p},R}}\left| \frac{1}{n}\sum _{i=1}^n f({\mathbf {X}},Z_i)-\mathbb F({\mathbf {X}})\right| \le {\frac{K }{\sqrt{n}}}\sqrt{\frac{2 {\widetilde{p}}(2p+1)}{c}\Delta _1(\epsilon )} +\frac{K}{n}\cdot \frac{2 \widetilde{p}(2p+1)}{c}\Delta _1(\epsilon )+\epsilon \end{aligned}$$

with probability at least \(1-2\exp \left( - \widetilde{p}(2p+1)\Delta _1(\epsilon )\right) -2\exp (-{\widetilde{c}}n)\).

Proof

We will follow the “\(\epsilon \)-net” argument similar to [22] to construct a net of discretization grids \({\mathcal {G}}(\epsilon ):=\{\widetilde{{\mathbf {X}}}^k\}\subseteq {\mathcal {B}}_{{\widetilde{p}},R} \) such that for any \({\mathbf {X}}\in {\mathcal {B}}_{{\widetilde{p}},R} \), there is \({\mathbf {X}}^k\in {\mathcal {G}}(\epsilon )\) that satisfies \(\Vert {\mathbf {X}}^k-{\mathbf {X}}\Vert \le \frac{\epsilon }{2K_C+2{\mathcal {C}}_\mu }\) for any fixed \(\epsilon \in (0,\,1]\).

Invoking Lemma 28, for an arbitrary \({\mathbf {X}}\in {\mathcal {B}}_{{\widetilde{p}},R}\), to ensure that there always exists \(\widetilde{{\mathbf {X}}}^k \in \mathcal G(\epsilon )\) that ensures \(\left\| {\mathbf {X}}-\widetilde{\mathbf{X}}^k\right\| \le \frac{\epsilon }{ (2K_C+2{\mathcal {C}}_\mu )}\), it is sufficient to have the number of grids to be no more than \(\left( \frac{18R\sqrt{{\widetilde{p}}}\cdot (K_C+\mathcal C_\mu )}{\epsilon }\right) ^{(2p+1){\widetilde{p}}}\). Now, we may observe

$$\begin{aligned}&{\mathbb {P}}\left[ \max _{{\mathbf {X}}^k\in \mathcal G(\epsilon )}\left| \frac{1}{n}\sum _{i=1}^n f(\mathbf{X}^k,Z_i)-{\mathbb {E}}\left[ \frac{1}{n}\sum _{i=1}^n f(\mathbf{X}^k,Z_i)\right] \right| \le K \sqrt{\frac{ t}{n}} +\frac{K t}{n} \right] \nonumber \\&\quad ={\mathbb {P}}\left[ \bigcap _{{\mathbf {X}}^k\in {\mathcal {G}}(\epsilon )}\left\{ \left| \frac{1}{n}\sum _{i=1}^n f({\mathbf {X}}^k,Z_i)-{\mathbb {E}}\left[ \frac{1}{n}\sum _{i=1}^n f({\mathbf {X}}^k,Z_i)\right] \right| \le K \sqrt{\frac{ t}{n}} +\frac{K t}{n} \right\} \right] \nonumber \\&\quad \ge 1-\sum _{{\mathbf {X}}^k\in {\mathcal {G}}(\epsilon )}{\mathbb {P}}\left[ \left| \frac{1}{n}\sum _{i=1}^n f({\mathbf {X}}^k,Z_i)-{\mathbb {E}}\left[ \frac{1}{n}\sum _{i=1}^n f({\mathbf {X}}^k,Z_i)\right] \right| > K \sqrt{\frac{ t}{n}} +\frac{K t}{n}\right] . \end{aligned}$$
(32)

Further invoking Eq. (8), for the same c as in (8), it holds that

$$\begin{aligned}&{\mathbb {P}}\left[ \max _{{\mathbf {X}}^k\in \mathcal G(\epsilon )}\left| \frac{1}{n}\sum _{i=1}^n f(\mathbf{X}^k,Z_i)-{\mathbb {E}}\left[ \frac{1}{n}\sum _{i=1}^n f(\mathbf{X}^k,Z_i)\right] \right| \le K \sqrt{\frac{ t}{n}} +\frac{K t}{n} \right] \\&\quad \ge 1-\vert {\mathcal {G}}(\epsilon )\vert \cdot 2\exp (-ct) \ge \,1-2\left( \frac{18R\sqrt{{\widetilde{p}}}\cdot (K_C+{\mathcal {C}}_\mu )}{\epsilon }\right) ^{(2p+1){\widetilde{p}}} \cdot \exp (-ct). \end{aligned}$$

Combined with Lemmas 25 and 26,

$$\begin{aligned}&\max _{\begin{array}{c} {{\mathbf {X}}}\in {\mathcal {B}}_{{\widetilde{p}}, R},\,{\mathbf {X}}^k\in {\mathcal {G}}(\epsilon ) \end{array}}\left\{ \left| \frac{1}{n}\sum _{i=1}^n f({\mathbf {X}},Z_i) -\frac{1}{n}\sum _{i=1}^n f({\mathbf {X}}^k,Z_i) \right| \right. \nonumber \\&\qquad \left. +\left| {\mathbb {E}}\left[ \frac{1}{n}\sum _{i=1}^n f({\mathbf {X}},Z_i)\right] -{\mathbb {E}}\left[ \frac{1}{n}\sum _{i=1}^n f({\mathbf {X}}^k,Z_i) \right] \right| \right\} \nonumber \\&\quad \le 2(K_C+{\mathcal {C}}_{\mu })\cdot \frac{\epsilon }{2K_C+2{\mathcal {C}}_\mu }=\epsilon , \end{aligned}$$
(33)

with probability at least \(1-2\exp (-{\widetilde{c}}\cdot n)\) for some problem independent \({\widetilde{c}}>0\) and any fixed \(\tau >0\). Observe that for any \({\mathbf {X}}\in {\mathcal {B}}_{{\widetilde{p}},R}\) and \({\mathbf {X}}^k\in \mathcal G(\epsilon )\), it holds that \(\left| {\mathcal {F}}_n(\mathbf{X},{\mathbf {Z}}_1^n)-\mathbb E\left[ {\mathcal {F}}_n({\mathbf {X}},{\mathbf {Z}}_1^n)\right] \right| \le \left| {\mathcal {F}}_n({\mathbf {X}}^k,{\mathbf {Z}}_1^n)-\mathbb E\left[ {\mathcal {F}}_n({\mathbf {X}}^k,\mathbf{Z}_1^n)\right] \right| + \left| \mathcal F_n({\mathbf {X}},{\mathbf {Z}}_1^n) -{\mathcal {F}}_n(\mathbf{X}^k,\right. \left. {\mathbf {Z}}_1^n) \right| +\left| \mathbb E\left[ {\mathcal {F}}_n({\mathbf {X}},{\mathbf {Z}}_1^n)\right] -\mathbb E\left[ {\mathcal {F}}_n({\mathbf {X}}^k,{\mathbf {Z}}_1^n) \right] \right| .\) Therefore, with probability at least \(1-2\exp (-{\widetilde{c}}\cdot n)\) for some positive constant \({\widetilde{c}}>0\),

$$\begin{aligned} \max _{\begin{array}{c} {{\mathbf {X}}}\in {\mathcal {B}}_{{\widetilde{p}}, R},\,{\mathbf {X}}^k\in {\mathcal {G}}(\epsilon ) \end{array}} \left\{ \left| {\mathcal {F}}_n({\mathbf {X}},\mathbf{Z}_1^n)-{\mathbb {E}}\left[ {\mathcal {F}}_n({\mathbf {X}},\mathbf{Z}_1^n)\right] \right| -\left| {\mathcal {F}}_n(\mathbf{X}^k,{\mathbf {Z}}_1^n)-{\mathbb {E}}\left[ {\mathcal {F}}_n(\mathbf{X}^k,{\mathbf {Z}}_1^n)\right] \right| \right\} \le \epsilon .\nonumber \\ \end{aligned}$$
(34)

Further invoking (32), we now obtain that

$$\begin{aligned} \max _{\begin{array}{c} {{\mathbf {X}}}\in {\mathcal {B}}_{{\widetilde{p}}, R},\,{\mathbf {X}}^k\in {\mathcal {G}}(\epsilon ) \end{array}}\left| \mathcal F_n({\mathbf {X}},{\mathbf {Z}}_1^n)-{\mathbb {F}}(\mathbf{X})\right| \le \epsilon + K \sqrt{\frac{ t}{n}} +\frac{K t}{n}, \end{aligned}$$

with probability at least \(1-2\left( \frac{18R\sqrt{\widetilde{p}}\cdot (K_C+\mathcal C_\mu )}{\epsilon }\right) ^{(2p+1){\widetilde{p}}} \cdot \exp (-ct)-2\exp (-{\widetilde{c}}\cdot n)\). Finally, we may let \(t:=\frac{2{\widetilde{p}}}{c}\cdot (2p+1)\cdot \Delta _1(\epsilon )\), where \(\Delta _1(\epsilon ):=\ln \left( \frac{18\cdot (K_C+\mathcal C_\mu )\cdot p\cdot R}{\epsilon }\right) \), and obtain the desired result. \(\square \)

Proposition 23

Suppose that Assumptions 1 through 3 hold, the solution \(\mathbf{X}^{RSAA}\in {\mathcal {S}}_p:\,\sigma _{\max }({\mathbf {X}}^{RSAA})\le R\) satisfies \(\hbox {S}^3\)ONC\(({\mathbf {Z}}_1^n)\) almost surely,

$$\begin{aligned}&{\mathcal {F}}_{n,\lambda }({\mathbf {X}}^{RSAA},{\mathbf {Z}}_1^n)\le {\mathcal {F}}_{n,\lambda }({{\mathbf {X}}}^*,\mathbf{Z}_1^n)+\Gamma ,~~w.p.1. \end{aligned}$$
(35)

where \(\Gamma \ge 0\), \(\epsilon \in (0,\, 1]\), \( \Delta _1(\epsilon ):=\ln \left( \frac{18\cdot {(K_C+\mathcal C_\mu )}\cdot p\cdot R}{\epsilon }\right) \). For a positive integer \({\widetilde{p}}_u:\,{\widetilde{p}}_u>s\), if

$$\begin{aligned} ({\widehat{p}}-s)\cdot P_\lambda (a\lambda ) > \frac{4K}{cn}\Delta _1(\epsilon )\cdot {\widehat{p}}\cdot (2p+1) +{\frac{2K}{\sqrt{n}}}\sqrt{\frac{2{\widehat{p}}\cdot (2p+1)}{c}\Delta _1(\epsilon )}+\Gamma +2\epsilon , \end{aligned}$$
(36)

for all \({\widehat{p}}:\, {\widetilde{p}}_u\le {\widehat{p}}\le p\), then \( {\mathbb {P}}[\mathbf{rk}({\mathbf {X}}^{RSAA})\le \widetilde{p}_u-1]\ge \,1-2 p \exp (-{\widetilde{c}}n)-4\exp \left( - \widetilde{p}_u(2p+1)\Delta _1(\epsilon )\right) \) for the same c in (8) and some \({\widetilde{c}}>0\).

Proof

This proof generalizes Proposition EC.3 from [13] bounding the sparsity of an \(\hbox {S}^3\)ONC solution to bounding the rank of an \(\hbox {S}^3\)ONC solution. Though the argument is similar, details are quite different and thus the result is different. Define \({\mathcal {B}}_{R}:=\{{\mathbf {X}}\in \mathcal S_p:\,\sigma _{\max }({\mathbf {X}})\le R\}\). Define a few events:

$$\begin{aligned} {\mathcal {E}}_{1}:=&\,\left\{ (\widetilde{{\mathbf {X}}},\,\widetilde{{\mathbf {Z}}}_1^n)\in {\mathcal {B}}_{R}\times {\mathcal {W}}^n:\,{\mathcal {F}}_{n,\lambda }(\widetilde{{\mathbf {X}}},\,\widetilde{{\mathbf {Z}}}_1^n)\le {\mathcal {F}}_{n,\lambda }({{\mathbf {X}}}^*,\,\widetilde{{\mathbf {Z}}}_1^n)+\Gamma \right\} ,\\ {\mathcal {E}}_2:=&\,\{\widetilde{{\mathbf {X}}}\in {\mathcal {B}}_{R}:\, \vert \sigma _j(\widetilde{{\mathbf {X}}})\vert \notin (0,\,a\lambda ) \text { for all } j\}, \\ {\mathcal {E}}_{3,{\widehat{p}}}:=&\,\left\{ \widetilde{\mathbf{X}}\in {\mathcal {B}}_{R}:\,\mathbf{rk}(\widetilde{{\mathbf {X}}})=\widehat{p}\right\} , \end{aligned}$$

where c in \({\mathcal {E}}_{5,{\widehat{p}}}\) is a universal constant defined to be the same as in (8), \({\widehat{p}}:\, {\widetilde{p}}_u\le {\widehat{p}}\le p\) and (thus \({\widehat{p}}>s\) by the assumption that \({\widetilde{p}}_u>s\)). For any \((\widetilde{{\mathbf {X}}},\widetilde{\mathbf{Z}}_1^n)\in \{(\widetilde{{\mathbf {X}}},\widetilde{{\mathbf {Z}}}_1^n)\in {\mathcal {E}}_{1}\}\cap \{\widetilde{{\mathbf {X}}}\in {\mathcal {E}}_2\cap {\mathcal {E}}_{3,p}\}\), where \(\widetilde{{\mathbf {Z}}}_1^n=(\widetilde{Z}_1,...,{\widetilde{Z}}_n)\), since \(\widetilde{{\mathbf {X}}}\in {\mathcal {E}}_{3,p}\cap \mathcal E_2\), which means that \(\widetilde{{\mathbf {X}}}\) has \({\widehat{p}}\)-many non-zero singular values and each must not be within the interval \((0,\,a\lambda )\), it holds that

$$\begin{aligned} {\mathcal {F}}_n(\widetilde{{\mathbf {X}}},\widetilde{{\mathbf {Z}}}_1^n) +{\widehat{p}}P_\lambda (a\lambda )\le \frac{1}{n} \mathcal F_n({{\mathbf {X}}^*},\widetilde{{\mathbf {Z}}}_1^n) +s P_\lambda (a\lambda )+\Gamma , \end{aligned}$$
(37)

Notice that \({\mathbf {X}}^*\in {\mathcal {B}}_R:\,\mathbf{rk}(\mathbf{X}^*)=s<{\widehat{p}}\) by Assumption 1. We may obtain that, for all \(\widetilde{{\mathbf {X}}} \in \mathcal E_{3,p}\),

$$\begin{aligned}&\frac{1}{n}\sum _{i=1}^n f({{\mathbf {X}}}^*,{\widetilde{Z}}_i) -\frac{1}{n}\sum _{i=1}^nf(\widetilde{{\mathbf {X}}},{\widetilde{Z}}_i) \nonumber \\&\quad = \left[ \frac{1}{n}\sum _{i=1}^n f({{\mathbf {X}}}^*,{\widetilde{Z}}_i) -{\mathbb {F}}({{\mathbf {X}}^*}) \right] + \left[ {\mathbb {F}} (\widetilde{{\mathbf {X}}}) -\frac{1}{n}\sum _{i=1}^nf(\widetilde{{\mathbf {X}}},{\widetilde{Z}}_i) \right] + \left[ {\mathbb {F}} ({{\mathbf {X}}^*} ) -\mathbb F(\widetilde{{\mathbf {X}}}) \right] \nonumber \\&\quad \le 2\max _{{\mathbf {X}}\in {\mathcal {E}}_{3,p}} \left| \frac{1}{n}\sum _{i=1}^n f({\mathbf {X}},{\widetilde{Z}}_i)-{\mathbb {F}}({\mathbf {X}} ) \right| +{\mathbb {F}}({{\mathbf {X}}^*}) -{\mathbb {F}}(\widetilde{{\mathbf {X}}})\nonumber \\&\quad \le \, 2\max _{{\mathbf {X}}\in {\mathcal {E}}_{3,p}} \,\,\left| \frac{1}{n}\sum _{i=1}^n f({\mathbf {X}},\widetilde{Z}_i)-{\mathbb {F}}({{\mathbf {X}}}) \right| , \end{aligned}$$
(38)

where the last inequality is due to \({\mathbb {F}}({{\mathbf {X}}^*}) \le {\mathbb {F}}({{\mathbf {X}}})\) for all \(\mathbf{X}\in {\mathcal {S}}_p\) by the definition of \({\mathbf {X}}^*\). Define that

$$\begin{aligned} {\mathcal {E}}_{4}:=&\,\left\{ (\widetilde{\mathbf{X}},\,\widetilde{{\mathbf {Z}}}_1^n)\in {\mathcal {B}}_{R}\times \mathcal W^n:\,\widetilde{{\mathbf {X}}} \text { satisfies } \hbox {S}^3\text { ONC }(\widetilde{{\mathbf {Z}}}_1^n)\right\} \\ {\mathcal {E}}_{5,{\widehat{p}}}:=&\,\Bigg \{\widetilde{\mathbf{Z}}_1^n\in {\mathcal {W}}^n:\,\max _{\mathbf{X}\in {\mathcal {B}}_{R}:\,\mathbf{rk}({\mathbf {X}})\le {\widehat{p}}}\left| {\mathcal {F}}_n({\mathbf {X}},\widetilde{\mathbf{Z}}_1^n)-{\mathbb {F}} ({\mathbf {X}})\right| \le {\frac{K }{\sqrt{n}}}\sqrt{\frac{2 {\widehat{p}}(2p+1)}{c}\Delta _1(\epsilon )} \\&\qquad +\frac{K}{n}\cdot \frac{2 {\widehat{p}}(2p+1)}{c}\Delta _1(\epsilon )+\epsilon \Bigg \}, \end{aligned}$$

Now let us examine the following set:

$$\begin{aligned} \Lambda =&\{(\widetilde{{\mathbf {X}}},\widetilde{{\mathbf {Z}}}_1^n):\,(\widetilde{{\mathbf {X}}},\widetilde{{\mathbf {Z}}}_1^n)\in {\mathcal {E}}_{1}\cap {\mathcal {E}}_{4}\}\cap \{(\widetilde{{\mathbf {X}}},\widetilde{{\mathbf {Z}}}_1^n):\,\widetilde{{\mathbf {X}}} \in {\mathcal {E}}_{3,p}\cap {\mathcal {E}}_2\}\cap \{(\widetilde{{\mathbf {X}}},\widetilde{{\mathbf {Z}}}_1^n):\,\widetilde{{\mathbf {Z}}}_{1}^n\in {\mathcal {E}}_{5,{\widehat{p}}}\}. \end{aligned}$$

Combined with (37) and (38), \(\Lambda \ne \emptyset \Longrightarrow (\widehat{p}-s)\cdot P_\lambda (a\lambda )\le {\frac{2K }{\sqrt{n}}}\sqrt{\frac{2 {\widehat{p}}(2p+1)}{c}\Delta _1(\epsilon )} +\frac{2K}{n}\cdot \frac{2 \widehat{p}(2p+1)}{c}\Delta _1(\epsilon )+2\epsilon +\Gamma \), which contradicts with (36) for all \({\widehat{p}}:\, {\widetilde{p}}_u\le {\widehat{p}}\le p\). Now we recall the definition of \({\mathbf {X}}^{RSAA}\in {\mathcal {B}}_{R}\), which is a solution that satisfies the \(\hbox {S}^3\)ONC\((\mathbf{Z}_1^n)\), w.p.1., and \({\mathcal {F}}_{n,\lambda }(\mathbf{X}^{RSAA},\,\widetilde{{\mathbf {Z}}}_1^n)\le \mathcal F_{n,\lambda }({{\mathbf {X}}}^*,\,\widetilde{{\mathbf {Z}}}_1^n)+\Gamma \), w.p.1. Invoking Proposition 21, we have \(\mathbb P\left[ ({\mathbf {X}}^{RSAA},{{\mathbf {Z}}}_1^n)\in {\mathcal {E}}_{1}\cap {\mathcal {E}}_{4},\, {\mathbf {X}}^{RSAA} \in {\mathcal {E}}_2\right] =1\). Hence,

$$\begin{aligned} 0&={\mathbb {P}}\left[ \Lambda \right] \ge \,1-{\mathbb {P}}\left[ {\mathbf {X}}^{RSAA} \notin {\mathcal {E}}_{3,p} \right] -{\mathbb {P}}\left[ {\mathbf {Z}}_1^n \notin {\mathcal {E}}_{5,{\widehat{p}}}\right] \\&\quad -\left\{ 1-{\mathbb {P}}\left[ ({\mathbf {X}}^{RSAA},{\mathbf{Z}}_1^n)\in {\mathcal {E}}_{1}\cap {\mathcal {E}}_{4},\, \mathbf{X}^{RSAA} \in {\mathcal {E}}_2 \right] \right\} , \end{aligned}$$

for all \({\widehat{p}}:\, {\widetilde{p}}_u\le {\widehat{p}}\le p\). The above then implies that \({\mathbb {P}}\left[ {\mathbf {Z}}_1^n \notin {\mathcal {E}}_{5,{\widehat{p}}} \right] \ge {\mathbb {P}}\left[ {\mathbf {X}}^{RSAA} \in {\mathcal {E}}_{3,p} \right] \) for all \({\widehat{p}}:\, {\widetilde{p}}_u\le {\widehat{p}}\le p\). Therefore, \({\mathbb {P}}[\mathbf{rk}(\mathbf{X}^{RSAA})= {\widehat{p}}]\le 1-{\mathbb {P}}\left[ {\mathbf {Z}}_1^n \in \mathcal E_{5,{\widehat{p}}}\right] \) for all \({\widehat{p}}:\, {\widetilde{p}}_u\le {\widehat{p}}\le p\). Together with Proposition 22, we have that

$$\begin{aligned} {\mathbb {P}}[\mathbf{rk}({\mathbf {X}}^{RSAA})\le {\widetilde{p}}_u-1]&={\mathbb {P}}[\mathbf{rk}(\mathbf{X}^{RSAA})\notin \{{\widetilde{p}}_u,\, {\widetilde{p}}_u+1,..., p\}] \nonumber \\&=1-{\mathbb {P}}\left[ \bigcup _{{\widehat{p}}={\widetilde{p}}_u}^{p}\{{\mathbf{rk}({\mathbf {X}}^{RSAA})}={\widehat{p}}\}\right] \nonumber \\&\ge 1- \sum _{{\widehat{p}}={\widetilde{p}}_u}^{p} {\mathbb {P}}[\mathbf{rk}({\mathbf {X}}^{RSAA})={\widehat{p}}]\nonumber \\&\ge 1-\sum _{{\widehat{p}}={\widetilde{p}}_u}^{p}\left( 1-\mathbb P\left[ {\mathbf {Z}}_1^n \in {\mathcal {E}}_{5,{\widehat{p}}}\right] \right) \nonumber \\&\ge 1-2({p}-{\widetilde{p}}_u+1)\exp (-{\widetilde{c}}n)\nonumber \\&\quad -\sum _{{\widehat{p}}={\widetilde{p}}_u}^{p} 2\exp \left( - {\widehat{p}}(2p+1)\cdot \Delta _1(\epsilon )\right) . \end{aligned}$$
(39)

where \({\widetilde{c}}>0\) is some universal constant. Observing that \(\Delta _1(\epsilon )=\ln \left( \frac{18\cdot (K_C+\mathcal C_\mu )\cdot p\cdot R}{\epsilon }\right) >1\) by observing that the above (39) involves a geometric sequence, we have

$$\begin{aligned} {\mathbb {P}}[\mathbf{rk}({\mathbf {X}}^{RSAA})\le {\widetilde{p}}_u-1]\ge&1-\frac{2\exp \left( - {\widetilde{p}}_u (2p+1)\Delta _1(\epsilon )\right) }{1- \exp \left( -(2p+1)\Delta _1(\epsilon )\right) }-2 {p} \exp (-\widetilde{c}n). \end{aligned}$$
(40)

Further noting that \(\frac{2\exp \left( - {\widetilde{p}}_u (2p+1)\Delta _1(\epsilon )\right) }{1- \exp \left( -(2p+1)\Delta _1(\epsilon )\right) }\le 4\exp \left( - {\widetilde{p}}_u(2p+1)\Delta _1(\epsilon )\right) \), we then have the desired result. \(\square \)

Proposition 24

Let \(\Delta _1(\epsilon ):=\ln \left( \frac{18\cdot (K_C+\mathcal C_\mu )\cdot p\cdot R}{\epsilon }\right) .\) Assume that (i) the solution \({\mathbf {X}}^{RSAA}\) satisfies \(\hbox {S}^3\)ONC\(({\mathbf {Z}}_1^n)\) almost surely; (ii) \({\mathcal {F}}_{n,\lambda }({\mathbf {X}}^{RSAA},\,{\mathbf{Z}}_1^n)\le {\mathcal {F}}_{n,\lambda }({{{\mathbf {X}}}}^*,\,{\mathbf{Z}}_1^n)+\Gamma \) with probability one; and (iii) for some integer \({\widetilde{p}}_u:\, {\widetilde{p}}_u>s\), it holds that

$$\begin{aligned} {\widehat{p}}> s+\frac{4K\cdot {\widehat{p}}\cdot (2p+1)}{cn\cdot P_\lambda (a\lambda ) }\Delta _1(\epsilon ) +{\frac{2K}{\sqrt{n}\cdot P_\lambda (a\lambda ) }}\sqrt{\frac{2{\widehat{p}}\cdot (2p+1)}{c}\Delta _1(\epsilon )}+\frac{\Gamma +2\epsilon }{P_\lambda (a\lambda )}, \end{aligned}$$
(41)

for all \({\widetilde{p}}:\, {\widetilde{p}}_u\le {\widetilde{p}}\le p\), any \(\Gamma \ge 0\), and any \(\epsilon \in (0,\,1]\). It then holds that

$$\begin{aligned}&{\mathbb {F}}({\mathbf {X}}^{RSAA})-{\mathbb {F}}({{{\mathbf {X}}}}^*) \nonumber \\&\quad \le \frac{4K\cdot {\widehat{p}}\cdot (p+1)}{cn}\Delta _1(\epsilon ) +{\frac{2K}{\sqrt{n}}}\sqrt{\frac{2{\widehat{p}}\cdot (2p+1)}{c}\Delta _1(\epsilon )}+\Gamma +2\epsilon +s P_\lambda (a\lambda ), \end{aligned}$$
(42)

with probability at least \( P^*:= \,1-2 (p+1) \exp (-{\widetilde{c}}n) -6\exp \left( - \widetilde{p}_u(2p+1)\Delta _1(\epsilon )\right) \) for some universal constant \({\widetilde{c}}>0\).

Proof

We first observe that \(\Delta _1(\epsilon ):=\ln \left( \frac{18 \cdot (K_C+{\mathcal {C}}_\mu )\cdot p\cdot R}{\epsilon }\right) \ge \ln 36\) because \(p\ge 1\), \({K}_C,\,C_\mu ,\,R\ge 1\) and \(0<\epsilon \le 1\). By assumption,

$$\begin{aligned}{\mathcal {F}}_{n,\lambda }(\mathbf{X}^{RSAA},\,{{\mathbf {Z}}}_1^n)\le {\mathcal {F}}_{n,\lambda }({{\mathbf {X}}}^*,\,{\mathbf{Z}}_1^n)+\Gamma , \end{aligned}$$

w.p.1., \(P_\lambda (t)\ge 0\) for all \(t\ge 0\), and \(\mathbf{rk}({\mathbf {X}}^*) =s\), yields that \(\frac{1}{n}\sum _{i=1}^n f({\mathbf {X}}^{RSAA},Z_i) \le \frac{1}{n} \sum _{i=1}^n f({{\mathbf {X}}^*},Z_i) +s P_\lambda (a\lambda )+\Gamma , \,a.s.\) Furthermore, conditioning on the events that (a) \(\mathbf{rk}({\mathbf {X}}^{RSAA})\le {\widetilde{p}}_u\), (b) \( \max _{\mathbf{X}\in \mathcal B_{{\widetilde{p}}_u,R}}\left| \frac{1}{n}\sum _{i=1}^n f(\mathbf{X},Z_i)-{\mathbb {E}}\left[ \frac{1}{n}\sum _{i=1}^n f(\mathbf{X},Z_i)\right] \right| \le {\frac{K}{\sqrt{n}}}\sqrt{\frac{{\widetilde{p}}_u\cdot (2p+1)}{c}\Delta _1(\epsilon )} +\frac{K}{n}\frac{{\widetilde{p}}_u\cdot (2p+1)}{c}\Delta _1(\epsilon )+\epsilon ,\) we obtain that \( \mathbb F({\mathbf {X}}^{RSAA})-{\mathbb {F}}({{\mathbf {X}}^*}) \le s\cdot P_\lambda (a\lambda )+ {\frac{2K }{\sqrt{n}}}\sqrt{\frac{2 \widetilde{p}_u\cdot (2p+1)}{c}\Delta _1(\epsilon )} +\frac{4K}{n}\frac{{\widetilde{p}}_u\cdot (2p+1)}{c}\Delta _1(\epsilon )+2\epsilon +\Gamma \), a.s. Further invoking Propositions 22 and 23, we have that both events hold simultaneously with probability at least as in \(P^*\), which verifiably implies the claimed results. \(\square \)

1.1.6 Useful Lemmata

Lemma 25

Under Assumption 4, it holds that, for some universal constant \(c>0\), with probability at least \(1-2\exp (-c\cdot n)\), it holds that

$$\begin{aligned}&\max _{\begin{array}{c} {{\mathbf {X}}}_1,\,{{\mathbf {X}}}_2\in {\mathcal {S}}_p \\ \cap \{{{\mathbf {X}}}:\,\sigma _{\max }({{\mathbf {X}}})\le R,\\ \,\Vert {\mathbf {X}}_1-{\mathbf {X}}_2\Vert \le \tau \} \end{array}}\{\vert \mathcal F_n({{\mathbf {X}}}_1, \mathbf{Z}_1^n)-{\mathcal {F}}_n({{\mathbf {X}}}_2, {\mathbf {Z}}_1^n)\vert \}\le \left( 2 K_C+{\mathcal {C}}_\mu \right) \cdot \tau . \end{aligned}$$

for any given \(\tau \ge 0\).

Proof

This proof follows a closely similar lemma by [22]. Similar proof has also been provided by [13], but some subtle differences in the problem context present and thus we redo the the proof herein. By Assumption 4, for some \(c>0\),

$$\begin{aligned} {\mathbb {P}}\left( \left| \sum _{i=1}^n \frac{1}{n}\left\{ \mathcal C(Z_i)-{\mathbb {E}}[{\mathcal {C}}(Z_i)] \right\} \right| >K_C\left( \frac{t}{n}+\sqrt{\frac{t}{n}}\right) \right) \le 2\exp \left( -c t\right) ,\qquad \forall t\ge 0. \end{aligned}$$

If we let \(t:=n\) and observe that \({\mathbb {E}}[{\mathcal {C}}(Z_i)]\le {\mathcal {C}}_\mu \), we immediately have that

$$\begin{aligned} {\mathbb {P}}\left( \sum _{i=1}^n\frac{{\mathcal {C}}(Z_i)}{n}\le 2 K_C+{\mathcal {C}}_\mu \right) \le 1-2\exp \left( -c n\right) . \end{aligned}$$
(43)

If we invoke Assumption 4 again given the event that \(\left\{ \sum _{i=1}^n\frac{{\mathcal {C}}(Z_i)}{n}\le 2 K_C+{\mathcal {C}}_\mu \right\} \), we have that for any \({\mathbf{X}}_1,{{\mathbf {X}}}_2\in {\mathcal {S}}_p\),

$$\begin{aligned}&\max _{\begin{array}{c} {{\mathbf {X}}}_1,\,{{\mathbf {X}}}_2\in {\mathcal {S}}_p \\ \cap \{{{\mathbf {X}}}:\,\sigma _{\max }({{\mathbf {X}}})\le R,\\ \,\Vert {\mathbf {X}}_1-{\mathbf {X}}_2\Vert \le \tau \} \end{array}}\left| \frac{1}{n}\sum _{i=1}^n f({{\mathbf {X}}}_1,\, Z_i)-\frac{1}{n}\sum _{i=1}^n f({{\mathbf {X}}}_2,\, Z_i)\right| \\&\quad \le \,\max _{\begin{array}{c} {{\mathbf {X}}}_1,\,{\mathbf{X}}_2\in {\mathcal {S}}_p \\ \cap \{{{\mathbf {X}}}:\,\sigma _{\max }({{\mathbf {X}}})\le R,\\ \,\Vert {\mathbf {X}}_1-{\mathbf {X}}_2\Vert \le \tau \} \end{array}} \frac{1}{n}\sum _{i=1}^n\Vert f({{\mathbf {X}}}_1,\, Z_i)- f({\mathbf{X}}_2,\, Z_i)\Vert \\&\quad \le \,\max _{\begin{array}{c} {{\mathbf {X}}}_1,\,{{\mathbf {X}}}_2\in {\mathcal {S}}_p \\ \cap \{{{\mathbf {X}}}:\,\sigma _{\max }({{\mathbf {X}}})\le R,\\ \,\Vert {\mathbf {X}}_1-{\mathbf {X}}_2\Vert \le \tau \} \end{array}}\frac{1}{n}\sum _{i=1}^n {\mathcal {C}}(Z_i)\Vert {\mathbf{X}}_1-{{\mathbf {X}}}_2\Vert \le (2K_C+ {\mathcal {C}}_\mu )\cdot \tau \end{aligned}$$

We have the desired result by combining the above with (43). \(\square \)

Lemma 26

Under Assumption 4, for all

$$\begin{aligned} {{\mathbf {X}}}_1,\,{{\mathbf {X}}}_2\in \mathcal S_p:\,\,\max \{\sigma _{\max }({{\mathbf {X}}}_1),\,\sigma _{\max }({\mathbf{X}}_2)\}\le R, \end{aligned}$$

it holds that

$$\begin{aligned} \left| {\mathbb {E}}[ {\mathcal {F}}_n({{\mathbf {X}}}_1, \mathbf{Z}_1^n)]-{\mathbb {E}}[{\mathcal {F}}_n({{\mathbf {X}}}_2, \mathbf{Z}_1^n)]\right| \le {\mathcal {C}}_\mu \cdot \Vert {\mathbf{X}}_1-{{\mathbf {X}}}_2\Vert . \end{aligned}$$
(44)

Proof

This proof follows a closely similar lemma by [22]. Again, a similar proof has also been provided by [13], but some subtle differences make it necessary to conduct the repetition herein. As per Assumption 4, it holds that

$$\begin{aligned} {\mathbb {E}}\left[ \vert {\mathcal {F}}_n({{\mathbf {X}}}_1, \mathbf{Z}_1^n)-{\mathcal {F}}_n({{\mathbf {X}}}_2, \mathbf{Z}_1^n)\vert \right] \le {\mathbb {E}}\left[ \sum _{i=1}^n\frac{\mathcal C(Z_i)}{n}\Vert {{\mathbf {X}}}_1-{{\mathbf {X}}}_2\Vert \right] . \end{aligned}$$

Due to the convexity of the function \(\vert \cdot \vert \), it therefore holds that

$$\begin{aligned} \left| {\mathbb {E}}\left[ {\mathcal {F}}_n({{\mathbf {X}}}_1, {\mathbf {Z}}_1^n)\right] -{\mathbb {E}}\left[ {\mathcal {F}}_n({\mathbf{X}}_2, {\mathbf {Z}}_1^n)\right] \right|&\le {\mathbb {E}}\left[ \sum _{i=1}^n\frac{{\mathcal {C}}(Z_i)}{n}\Vert {{\mathbf {X}}}_1-{\mathbf{X}}_2\Vert \right] \\&= {\mathbb {E}}\left[ \sum _{i=1}^n\frac{{\mathcal {C}}(Z_i)}{n}\right] \cdot \Vert {{\mathbf {X}}}_1-{{\mathbf {X}}}_2\Vert . \end{aligned}$$

Invoking Assumption 4 again, it holds that \({\mathbb {E}}\left[ \sum _{i=1}^n\frac{{\mathcal {C}}(Z_i)}{n}\right] = \frac{\sum _{i=1}^n{\mathbb {E}}[{\mathcal {C}}(Z_i)]}{n}\le \mathcal C_\mu \) for all \(i=1,...,n\), which immediately leads to the desired result. \(\square \)

Lemma 27

Denote that \( {{\mathbf {X}}}^{\ell _1}_{\lambda }\in \underset{{\mathbf{X}}\in {\mathcal {S}}^p}{\arg \,\min }\, {\mathcal {F}}_{n}({\mathbf {X}},\, {\mathbf {Z}}_1^n)+\lambda \left\| {\mathbf {X}} \right\| _{*},\) it holds that \({\mathcal {F}}_{n,\lambda }( {\mathbf{X}}^{\ell _1}_{\lambda },\,{\mathbf {Z}}_1^n)\le \mathcal F_{n,\lambda }({{\mathbf {X}}}^*,\, {\mathbf {Z}}_1^n)+\lambda \Vert {{\mathbf {X}}}^* \Vert _*\).

Proof

This proof generalizes a similar one in [13] from sparsity-inducing penalty to low-rankness-inducing penalty; that is, from \(\ell _1\) regularization to nuclear norm-based regularization. As per Assumption 4, it holds that We first invoke the definition of \(P_\lambda \) to obtain

$$\begin{aligned} 0\le P_\lambda (t)=\int _0^{t}\frac{[a\lambda -\theta ]_+}{a}d\theta \le \int _0^{t}\frac{a\lambda }{a}d\theta =\lambda \cdot t. \end{aligned}$$
(45)

for all \(t\ge 0\). Secondly, by the definition of \( {\mathbf{X}}^{\ell _1}_{\lambda }\),

$$\begin{aligned} {\mathcal {F}}_{n}( {{\mathbf {X}}}^{\ell _1}_{\lambda },\, \mathbf{Z}_1^n)+\lambda \Vert {{\mathbf {X}}}^{\ell _1}_{\lambda } \Vert _*\le {\mathcal {F}}_{n}({{\mathbf {X}}}^{*},\, {\mathbf {Z}}_1^n)+\lambda \Vert {{\mathbf {X}}}^{*} \Vert _*. \end{aligned}$$
(46)

Combining (45) and (46), it holds that

$$\begin{aligned} {\mathcal {F}}_{n}( {{\mathbf {X}}}^{\ell _1}_{\lambda },\, \mathbf{Z}_1^n)+\sum _{j=1}^pP_\lambda \left( \vert \sigma _j( {\mathbf{X}}^{\ell _1}_{\lambda })\vert \right)&\le {\mathcal {F}}_{n}( {{\mathbf {X}}}^{\ell _1}_{\lambda },\, \mathbf{Z}_1^n)+\sum _{j=1}^p\lambda \cdot \vert \sigma _j( {\mathbf{X}}^{\ell _1}_{\lambda })\vert \\&\le {\mathcal {F}}_{n}({{\mathbf {X}}}^{*},\, {\mathbf {Z}}_1^n)+ \sum _{j=1}^pP_\lambda \left( \vert \sigma _j({{\mathbf {X}}}^*)\vert \right) +\lambda \Vert {{\mathbf {X}}}^{*} \Vert _*, \end{aligned}$$

as desired. \(\square \)

Lemma 28

Let \(S_{r,R}:=\{X\in \Re ^{p\times p}:\,\mathbf{rk}(X)\le r,\,\sigma _{\max }(X)\le R \}\). Then, in terms of the Frobenius norm, there exists an \(\epsilon \)-net \({\bar{S}}_r\) obeying \(\vert {\bar{S}}_r\vert \le \left( \frac{9\sqrt{r}R}{\epsilon }\right) ^{(2p+1)r}\).

Proof

The proof is closely similar to Lemma 3.1 of [4] with minor differences. We still present the proof here for completeness and for ensuring the minor difference would not result in a gap in our theory. Denote by \(X:=U\Sigma V^\top \) the singular value decomposition (SVD) of a matrix in \(S_{r,R}\). Let D be the set of rank-r diagonal matrices with nonnegative diagonal entries and nuclear norm smaller than R, and thus any matrix within set D has the Frobenius norm smaller than \(\sqrt{r}\cdot R\). We take \({\bar{D}}\) be an \(\frac{\epsilon }{3}\)-net (in terns of Frobenius norm) for D with \(\vert {\bar{D}}\vert \le \left( \frac{9\sqrt{r}R}{\epsilon }\right) ^r\).

Let \(O_{p,r}:=\{U\in \Re ^{p\times r}:\,U^\top U=I\}\). For the convenience of analysis on \(O_{p,r}\), we may as well consider \({\widehat{Q}}_{p,r}:=\{ X\in \Re ^{p\times r}:\,\Vert X\Vert _{1,2}\le 1\}\) and \(\Vert X\Vert _{1,2}=\max _j\Vert X_j\Vert \), where \(X_j\) denotes the jth column of X. Verifiably, \( O_{p,r}\subset {\widehat{Q}}_{p,r}\). We may create an \(\frac{\epsilon }{3\sqrt{r}R}\)-net for \({\widehat{Q}}_{p,r}\), denoted by \({\bar{O}}_{p,r}\), which satisfies that \(\vert \bar{O}_{p,r}\vert \le (9\sqrt{r}R/\epsilon )^{pr}\).

For any \(X\in S_{r,R}\), one may decompose X and obtain \(X=U\Sigma V^\top \). There exists \({\bar{X}}={\bar{U}}{\bar{\Sigma }}\bar{V}^\top \in {\bar{S}}_{r,R}\) with \({\bar{U}},\,{\bar{V}}\in \bar{O}_{p,r}\), and \({\bar{\Sigma }}\in {\bar{D}}\) such that \(\Vert U-\bar{U}\Vert _{1,2}\le \epsilon /(3\sqrt{r}R)\), \(\Vert V-\bar{V}\Vert _{1,2}\le \epsilon /(3\sqrt{r}R)\), and \(\Vert \Sigma -{\bar{\Sigma }}\Vert _F\le \epsilon /3\). This gives \(\Vert X-\bar{X}\Vert _F=\Vert U\Sigma V^\top -{\bar{U}}{\bar{\Sigma }}\bar{V}^\top \Vert _F=\Vert U\Sigma V^\top -{\bar{U}}\Sigma V^\top +\bar{U}\Sigma V^\top -{\bar{U}}{\bar{\Sigma }} V^\top +{\bar{U}}{\bar{\Sigma }} V^\top -{\bar{U}}{\bar{\Sigma }}{\bar{V}}^\top \Vert _F\le \Vert (U-\bar{U})\Sigma V^\top \Vert _F+\Vert {\bar{U}}(\Sigma -{\bar{\Sigma }}) V^\top \Vert _F+\Vert {\bar{U}}{\bar{\Sigma }}(V-{\bar{V}})\Vert _F\). Since V is orthonormal matrix, \(\Vert (U-{\bar{U}})\Sigma V^\top \Vert _F=\Vert (U-{\bar{U}})\Sigma \Vert _F=\sqrt{\sum _{1\le j\le r}[\sigma _j(X)]^2\cdot \Vert {\bar{U}}_j- U_j\Vert _2^2}\le \sqrt{ \Vert \Sigma \Vert _F^2\cdot \Vert U-{\bar{U}}\Vert _{1,2}^2}\le \epsilon /3\), where \(U_j\) is the jth column of U. By a symmetric argument, we may also obtain that \(\Vert {\bar{U}}{\bar{\Sigma }}(V-{\bar{V}})^\top \Vert _F\le \epsilon /3\). To bound the second term, we also notice that \(\Vert {\bar{U}}(\Sigma -{\bar{\Sigma }}) V^\top \Vert _F=\Vert \Sigma -{\bar{\Sigma }}\Vert _F\le \epsilon /3\). Combining the above provides the desired result. \(\square \)

1.2 Proof of results concerning the computation of an \(\hbox {S}^3\)ONC solution

Proof of Theorem 18

To show the closed-form solution. Without loss of generality, we may reduce (20) into the following problem: For a fixed \({\mathbf {Y}}\in {\mathcal {S}}_p\), solve the minimization problem of

$$\begin{aligned} \min _{{\mathbf {X}}\in {\mathcal {S}}_p^+}G(\mathbf{X}):=\frac{L}{2}\Vert {\mathbf {X}}-{\mathbf {Y}}\Vert _F^2+\sum _{j=1}^p P_\lambda (\sigma _j({\mathbf {X}})). \end{aligned}$$
(47)

In fact, when \({\mathbf {Y}}:={\mathbf {X}}^0-\frac{1}{L}\nabla f({\mathbf {X}}^0)\), we recover the exact form of (20).

We will divide the rest of the proof of the closed-form solution into three steps.

Step 1. This step shows a useful inequality \({\widehat{\sigma }}_j\notin (0,\,a\lambda )\) for all \(j=1,...,p\), where \({\widehat{\sigma }}_j\) is the optimal solution to (53).

Observe that the optimal solution must satisfy the second-order necessary conditions, when the second-order derivative exists. In particular, when \({\widehat{\sigma }}_j\in (0,a\lambda )\), the second-order necessary condition is written as \(\left[ \frac{\partial ^2\left[ \frac{L}{2}\left( \sigma _j(\mathbf{Y})-\sigma _j\right) ^2+P_\lambda (\sigma _j)\right] }{\partial \sigma _j^2}\right] _{\sigma _j={\widehat{\sigma }}_j}=L-\frac{1}{a}\ge 0\). However, the last inequality here contradicts with the assumption that \(\frac{1}{a}>L\). This contradiction indicates that

$$\begin{aligned} {\widehat{\sigma }}_j\notin (0,\,a\cdot \lambda ), \end{aligned}$$
(48)

as desired in this step.

Step 2. We will further derive a sequence of equivalent reformulations to (20). These reformulations will eventually lead to the conclusion that (20) is equivalent to solving a sequence of one-dimensional problems.

Because \({\mathbf {Y}}\) is symmetric, it admits eigndecomposition \({\mathbf {Y}}=Q\Sigma _{{\mathbf {Y}}}Q^{-1}\). By the orthogonal invariance of the Frobenius norm, (47) is reformulated into

$$\begin{aligned} \min _{{\mathbf {X}}\in {\mathcal {S}}_p^+}\frac{L}{2}\Vert \Sigma _{{\mathbf {Y}}}-Q^{-1}{\mathbf {X}} Q \Vert _F^2+\sum _{j=1}^{p}P_\lambda (\sigma _j({\mathbf {X}})). \end{aligned}$$
(49)

In view of the results from Step 1 and the fact that \(P_\lambda (t)=\frac{a\lambda ^2}{2}\) for all \(t\ge a\lambda \), the reformulation in (49) is equivalent to

$$\begin{aligned} \min _{{\mathbf {X}}\in {\mathcal {S}}_p^+}\frac{L}{2}\Vert \Sigma _{{\mathbf {Y}}}-Q^{-1}{\mathbf {X}} Q \Vert _F^2+\sum _{j=1}^p\frac{a\lambda ^2}{2}\cdot {\mathbb {I}}( \sigma _j({\mathbf {X}})\ne 0). \end{aligned}$$
(50)

where \( {\mathbb {I}}( \sigma _j({\mathbf {X}})\ne 0)\) is the index function that outputs the value 1 if \(\sigma _j({\mathbf {X}})\ne 0\) and outputs 0, otherwise.

The optimal solution \(\widehat{{\mathbf {X}}}\) is an element of \({\mathcal {S}}_p^+\) (that is, \(\widehat{{\mathbf {X}}}\) must be symmetric and positive semidefinite) and thus admits an eigendecomposition. We may verify that \(\widehat{{\mathbf {X}}}=Q \Lambda _{\widehat{{\mathbf {X}}}} Q^{-1}\), where the same Q is shared by both decompositions. This is because, for any feasible solution \({\mathbf {X}}\), we may write \(Q^{-1}{\mathbf {X}} Q:= \Lambda _{{{\mathbf {X}}}} +H\), where \(\Lambda _{{{\mathbf {X}}}}\) is a diagonal matrix and H is a hollow matrix, and thus \(\Vert \Sigma _{{\mathbf {Y}}}-\Lambda _{{{\mathbf {X}}}}\Vert _F^2=\Vert \Sigma _{{\mathbf {Y}}}-\Lambda _{{{\mathbf {X}}}}\Vert _F^2+\Vert H\Vert ^2_F\). If \(\Vert H\Vert _F\ne 0\), then one may always construct a solution that has a smaller objective function value for (50). Therefore, (47) is equivalently rewritten as

$$\begin{aligned}&\min _{\Sigma _{{{\mathbf {X}}}}}\Bigg \{\frac{L}{2}\Vert \Sigma _{{\mathbf {Y}}}-\Sigma _{{\mathbf{X}}}\Vert _F^2+\sum _{j=1}^{p}P_\lambda (\sigma _j({\mathbf {X}})):\,\nonumber \\&\quad \quad \Sigma _{{{\mathbf {X}}}} \text { is a positive semidefinite and diagonal matrix}\Bigg \}. \end{aligned}$$
(51)

If the optimal solution is \(\Sigma _{\widehat{{\mathbf {X}}}}\), then the optimal solution to (47) can be recovered as \(Q\Sigma _{\widehat{{\mathbf {X}}}}Q^{-1}\). By noticing that both \(\Sigma _{{\mathbf {Y}}}\) and \(\Sigma _{\widehat{{\mathbf {X}}}}\) are diagonal matrices, we may further reformulate the above problem into the below:

$$\begin{aligned} \min _{(\sigma _j)}\left\{ \frac{L}{2}\sum _{j=1}^p\left( \sigma _j(\mathbf{Y})-\sigma _j\right) ^2+\sum _{j=1}^{p}P_\lambda (\sigma _j):\, \sigma _j\ge 0,\,\forall j=1,...,p\right\} . \end{aligned}$$
(52)

Let \({\widehat{\sigma }}_j\), \(j=1,...,p\), be the optimal solution. Then, the optimal solution to (47) can be recovered as \(Q\,\text {diag}(\{{\widehat{\sigma }}_j:\,j=1,...,p\})Q^{-1}\). Furthermore, (52) can be solved by solving a sequence of one-dimensional optimization problems: For all \(j=1,...,p\),

$$\begin{aligned} \min \left\{ g(\sigma _j):=\frac{L}{2}\left( \sigma _j(\mathbf{Y})-\sigma _j\right) ^2+P_\lambda (\sigma _j):\, \sigma _j\ge 0\right\} . \end{aligned}$$
(53)

This completes Step 1 of our proof. This one-dimensional optimization formulation will be essential to the rest of the proof.

Step 3. We now start proving the correctness of the claimed closed form solution. We will consider three different cases.

  • Case 3.1. If \(\sigma _j({\mathbf {Y}})<a\lambda \), we will show below that it must hold that \({\widehat{\sigma }}_j=0.\) Suppose the otherwise; that is, \({\widehat{\sigma }}_j>0\). Then (48) implies that \({\widehat{\sigma }}_j\ge a\lambda \). Then (recalling g defined in (53)) \(g({\widehat{\sigma }}_j)-g(0)=\frac{L}{2}(\sigma _j({\mathbf {Y}})-{\widehat{\sigma }}_j)^2+\frac{a\lambda ^2}{2}-\frac{L}{2} [\sigma _j({\mathbf {Y}})]^2=\frac{L}{2}{\widehat{\sigma }}_j^2-L\sigma _j({\mathbf {Y}})\cdot {\widehat{\sigma }}_j+\frac{a\lambda ^2}{2}>\frac{L}{2}{\widehat{\sigma }}_j^2-La\lambda \cdot {\widehat{\sigma }}_j+\frac{a^2\lambda ^2\cdot L}{2}=\frac{L}{2}({\widehat{\sigma }}_j-a\lambda )^2\), where the second from last relationship is due to \(\sigma _j({\mathbf {Y}})<a\lambda \) and \(aL<1\) by assumption. To summarize, we have shown that \(g({\widehat{\sigma }}_j)>g(0)\), which means that we have identified a strictly better solution 0 than a solution \({\widehat{\sigma }}_j> 0\). Further invoking (48) again, we know that the only choice of the optimal solution is then \({\widehat{\sigma }}_j=0\).

  • Case 3.2. If \(\sigma _j({\mathbf {Y}})\ge a\lambda \), we will show below that \({\widehat{\sigma }}_j=\sigma _j({\mathbf {Y}})\). To that end, we will first show that \(\sigma _j({\mathbf {Y}})\) is a better solution than 0. This can be seen by observing that \(g(\sigma _j({\mathbf {Y}}))-g(0)=\frac{a\lambda ^2}{2}-\frac{L}{2}\cdot [\sigma _j({\mathbf {Y}})]^2\). Because \(\sigma _j({\mathbf {Y}})\ge a\lambda \), we know that \(g(\sigma _j({\mathbf {Y}}))-g(0)=\frac{a\lambda ^2}{2}-\frac{L}{2}a^2\cdot \lambda ^2>0\), where the last inequality is due to the assumption that \(aL<1\).

    Now that \(\sigma _j({\mathbf {Y}})\) is a better solution than 0, we can further verify the closed form that \({\widehat{\sigma }}_j=\sigma _j({\mathbf {Y}})\). Because \({\widehat{\sigma }}_j\ne 0\), by (48), it must hold that \({\widehat{\sigma }}_j\ge a\lambda \). In such a case, Problem (52), is equivalent to the optimal solution to \(\min \left\{ \frac{L}{2}\left( \sigma _j(\mathbf{Y})-\sigma _j\right) ^2+\frac{a\lambda ^2}{2}:\, \sigma _j\ge 0\right\} \), which can be further rewritten into \(\min \left\{ \frac{L}{2}\left( \sigma _j(\mathbf{Y})-\sigma _j\right) ^2:\, \sigma _j\ge 0\right\} \) by removing a constant term. Therefore, the optimal solution must be \(\widehat{\sigma }_j=\sigma ({\mathbf {Y}})\).

This then completes the proof of the closed-form solution.

To show the satisfaction of \(\hbox {S}^3\)ONC. For the second part of the theorem, we observe that the closed form solution obeys that \(\sigma _j(\widehat{{\mathbf {X}}})\notin (0,\,a\lambda )\) for all j. By definition, it is an \(\hbox {S}^3\)ONC solution. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, H.Y., Hernandez, C. & Liu, H. Regularized sample average approximation for high-dimensional stochastic optimization under low-rankness. J Glob Optim 85, 257–282 (2023). https://doi.org/10.1007/s10898-022-01206-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-022-01206-3

Keywords

Navigation