Skip to main content

Robust non-parametric regression via incoherent subspace projections

Abstract

This paper establishes the algorithmic principle of alternating projections onto incoherent low-rank subspaces (APIS) as a unifying principle for designing robust regression algorithms that offer consistent model recovery even when a significant fraction of training points are corrupted by an adaptive adversary. APIS offers the first algorithm for robust non-parametric (kernel) regression with an explicit breakdown point that works for general PSD kernels under minimal assumptions. APIS also offers, as straightforward corollaries, robust algorithms for a much wider variety of well-studied settings, including robust linear regression, robust sparse recovery, and robust Fourier transforms. Algorithms offered by APIS enjoy formal guarantees that are frequently sharper than (especially in non-parametric settings) or competitive to existing results in these settings. They are also straightforward to implement and outperform existing algorithms in several experimental settings.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. 1.

    https://dlmf.nist.gov/5.6.

References

  1. Agarwal, A., Negahban, S. N., & Wainwright, M. J. (2012). Fast global convergence of gradient methods for high-dimensional statistical recovery. Annals of Statistics, 40(5), 2452–2482.

    MathSciNet  Article  Google Scholar 

  2. Bafna, M., Murtagh, J., & Vyas, N. (2018). Thwarting adversarial examples: an \(L_0\)-robust sparse Fourier transform. In Proceedings of the 32nd annual conference on neural information processing systems (NIPS).

  3. Baraniuk, R. G., Cevher, V., Duarte, M. F., & Hegde, C. (2010). Model-based compressive sensing. IEEE Transactions on Information Theory, 56(4), 1982–2001. https://doi.org/10.1109/TIT.2010.2040894

    MathSciNet  Article  MATH  Google Scholar 

  4. Barchiesi, D., & Plumbley, M. D. (2013) Learning incoherent subspaces for classification via supervised iterative projections and rotations. In IEEE international workshop on machine learning for signal processing (MLSP). IEEE, pp 1–6.

  5. Barchiesi, D., & Plumbley, M. D. (2015). Learning incoherent subspaces: classification via incoherent dictionary learning. Journal of Signal Processing Systems, 79(2), 189–199.

    Article  Google Scholar 

  6. Bhatia, K., Jain, P., & Kar, P. (2015). Robust regression via hard thresholding. In Proceedings of the 29th annual conference on neural information processing systems (NIPS).

  7. Bouwmans, T., Javed, S., Zhang, H., Lin, Z., & Otazo, R. (2018). On the applications of robust PCA in image and video processing. Proceedings of the IEEE, 106(8), 1427–1457. https://doi.org/10.1109/JPROC.2018.2853589

    Article  Google Scholar 

  8. Candes, E. J., & Wakin, M. B. (2008). An introduction to compressive sampling. IEEE Signal Processing Magazine, 25(2), 21–30.

    Article  Google Scholar 

  9. Chen, X., & De, A. (2020). Reconstruction under outliers for Fourier-sparse functions. In Proceedings of the ACM-SIAM symposium on discrete algorithms (SODA).

  10. Chen, Y. (2015). Incoherence-optimal matrix completion. IEEE Transactions on Information Theory, 61(5), 2909–2923.

    MathSciNet  Article  Google Scholar 

  11. Chen Y, Bhojanapalli S, Sanghavi S, Ward R (2014) Coherent matrix completion. In: Proceedings of the 31 st international conference on machine learning (ICML).

  12. Cizek, P., & Sadikoglu, S. (2020). Robust nonparametric regression: A review. WIREs Computational Statistics, 12(3), e1492.

    MathSciNet  Article  Google Scholar 

  13. Coifman, R., Geshwind, F., & Meyer, Y. (2001). Noiselets. Applied and Computational Harmonic Analysis, 10(1), 27–44.

    MathSciNet  Article  Google Scholar 

  14. Dabov, K., Foi, A., Katkovnik, V., & Egiazarian, K. (2007). Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing, 16(8), 2080–2095. https://doi.org/10.1109/TIP.2007.901238

    MathSciNet  Article  Google Scholar 

  15. Diakonikolas, I., Kamath, G., Kane, D., Li, J., Steinhardt, J., Stewart, A. (2019). Sever: A robust meta-algorithm for stochastic optimization. In 36th international conference on machine learning (ICML).

  16. Du SS, Wang Y, Balakrishnan S, Ravikumar P, Singh A (2018) Robust nonparametric regression under Huber’s \(\epsilon\)-contamination model. arXiv:1805.10406 [math.ST]

  17. Fan, J., Hu, T. C., & Truong, Y. K. (1994). Robust non-parametric function estimation. Scandinavian Journal of Statistics, 21(4), 433–446.

    MathSciNet  MATH  Google Scholar 

  18. Fan, L., Zhang, F., Fan, H., & Zhang, C. (2019). Brief review of image denoising techniques. Visual Computing for Industry, Biomedicine, and Art, 2(1), 7.

    Article  Google Scholar 

  19. Fasshauer, G. E. (2011). Positive definite kernels: past, present and future. Dolomites Research Notes on Approximation, 4, 21–63.

    Google Scholar 

  20. Foucart, S., & Rauhut, H. (2013). A mathematical introduction to compressive sensing. Birkhäuser: Applied and Numerical Harmonic Analysis.

    Book  Google Scholar 

  21. Getreuer, P. (2012). Rudin–Osher–Fatemi total variation denoising using split Bregman. Image Processing on Line, 2, 74–95.

    Article  Google Scholar 

  22. Gu, S., Zhang, L., Zuo, W., & Feng, X. (2014) Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2862–2869).

  23. Guruswami, V., & Zuckerman, D. (2016) Robust Fourier and polynomial curve fitting. In Proceedings of the 57th IEEE annual symposium on foundations of computer science (FOCS).

  24. Hegde, C., & Baraniuk, R. G. (2012). Signal recovery on incoherent manifolds. IEEE Transactions on Information Theory, 58(12), 7204–7214. https://doi.org/10.1109/TIT.2012.2210860

    MathSciNet  Article  MATH  Google Scholar 

  25. Krahmer, F., & Ward, R. (2014). Stable and robust sampling strategies for compressive imaging. IEEE Transactions on Image Processing, 23(2), 612–622.

    MathSciNet  Article  Google Scholar 

  26. McCoy, M. B., & Tropp, J. A. (2014). Sharp recovery bounds for convex demixing, with applications. Foundations of Computational Mathematics, 14(3), 503–567.

    MathSciNet  Article  Google Scholar 

  27. Micchelli, C. A., Xu, Y., & Zhang, H. (2006). Universal kernels. Journal of Machine Learning Research, 7, 2651–2667.

    MathSciNet  MATH  Google Scholar 

  28. Minh, H. Q., Niyogi, P., & Yao, Y. (2006). Merce’s theorem, feature maps, and smoothing. In Proceedings of the international conference on computational learning theory (COLT).

  29. Mukhoty, B., Gopakumar, G., Jain, P., & Kar, P. (2019). Globally-convergent iteratively reweighted least squares for robust regression problems. In Proceedings of the 22nd international conference on artificial intelligence and statistics (AISTATS).

  30. Prasad, A., Suggala, A. S., Balakrishnan, S., & Ravikumar, P. (2018). Robust estimation via robust gradient estimation. arXiv:1802.06485 [stat.ML].

  31. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. MIT Press.

  32. Rosasco, L., Belkin, M., & Vito, E. D. (2010). On learning with integral operators. Journal of Machine Learning Research, 11, 905–934.

    MathSciNet  MATH  Google Scholar 

  33. Schnass, K., & Vandergheynst, P. (2010). Classification via incoherent subspaces. arXiv:1005.1471 [cs.CV].

  34. Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.

    Article  Google Scholar 

  35. Zhang, H., Zhou, Y., & Liang, Y. (2015). Analysis of robust PCA via local incoherence. In Proceedings of the 29th annual conference on neural information processing systems (NIPS).

  36. Zhang, K., Zuo, W., Chen, Y., Meng, D., & Zhang, L. (2017). Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Transactions on Image Processing, 26(7), 3142–3155.

    MathSciNet  Article  Google Scholar 

  37. Zhou, Y., Zhang, H., & Liang, Y. (2016). On compressive orthonormal sensing. In 54th annual allerton conference on communication, control, and computing (Allerton).

Download references

Acknowledgements

The authors thank the anonymous reviewers for comments that helped improve the presentation of the paper. B.M. thanks the Research-I Foundation for financial support. The work of S.D. was partially supported by the DST-SERB grant ECR/2017/000374. P.K. thanks Microsoft Research India and Tower Research for research grants.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Purushottam Kar.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editors: Annalisa Appice, Sergio Escalera, Jose A. Gamez, Heike Trautmann.

Appendices

Appendix

A A generic recovery guarantee for APIS: a proof of Theorem 1

In this section, we will prove Theorem 1. We will present the proof in two parts, presenting the main proof ideas in Lemma 1 with the special case of \(P = 1\) (the so-called known signal support case (Chen and De 2020)) where the union \(\mathscr {A}\) consists of a single subspace A. Recall that we denote using P (resp. Q), the number of subspaces in the union \(\mathscr {A}= \bigcup _{i=1}^P A_i\) in which the uncorrupted signal \({\mathbf{a}}^*\) resides (resp the union \(\mathscr {B}= \bigcup _{j=1}^Q B_j\) in which the corruption \({\mathbf{b}}^*\) resides). We also recall that the known signal support case does capture linear regression and low-rank kernel ridge regression. Note however, that even in the known signal support case, we may still have \(Q > 1\) i.e. \(\mathscr {B}\) may still be a general non-trivial union of subspaces e.g. the set of k-sparse vectors which has \(Q = \left( {\begin{array}{c}n\\ k\end{array}}\right)\). We will then extend the proof to the general case in Lemma 2 where both \(P, Q \ge 1\). We reproduce the APIS algorithm below for ease of reading.

figurec

A.1 Convergence analysis for \(P = 1\) i.e. \(\mathscr {A}= A\) but still \(Q \ge 1\)

We now present the proof in the case of known signal support.

Lemma 1

Suppose we obtain data as described in Eq. (1) where the two unions \(\mathscr {A}, \mathscr {B}\) are \(\mu\)-incoherent with \(\mu < \frac{1}{3}\) and in addition, the union \(\mathscr {A}\) contains a single subspace (the “known signal support” model). Then, for any \(\epsilon > 0\) within \(T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)\) iterations, APIS offers \(\left\| \mathbf{a}^T - {\mathbf{a}}^* \right\| _2 \le \epsilon\).

Proof

To simplify notation, we denote \(\mathbf{a}^t {=}{:} \mathbf{a}, \mathbf{b}^t {=}{:} \mathbf{b}, \mathbf{a}^{t+1} {=}{:} {\mathbf{a}^+}, \mathbf{b}^{t+1} {=}{:} {\mathbf{b}^+}, B^{t+1} {=}{:} {B^+}\) (please refer to Algorithm 2 for notation). Let \(\mathfrak {Q}{:}{=} {B^+}\cap {B}^*\) denote the meet of the two subspaces, as well as denote the symmetric difference subspaces \(\mathfrak {P}{:}{=} {B^+}\cap ({B^*})^\perp\) and \(\mathfrak {R}= {B}^*\cap ({B^+})^\perp\) (recall that \(A \ni {\mathbf{a}}^*, {B}^*\ni {\mathbf{b}}^*\)).

Denote \(\mathbf{p}= \varPi _A({\mathbf{b}}^*- \mathbf{b})\) and \({\mathbf{p}^+}= \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+})\). In this case we have \({\mathbf{a}^+}= \varPi _A(\mathbf{y}- {\mathbf{b}^+}) = {\mathbf{a}}^*+ \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+})\) (since \({\mathbf{a}}^*\in A\) and orthonormal projections are idempotent) and thus, \(\left\| {\mathbf{a}^+}- {\mathbf{a}}^* \right\| _2 = \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2 = \left\| {\mathbf{p}^+} \right\| _2\). We will show below that \(\left\| {\mathbf{p}^+} \right\| _2 \le 3\mu \cdot \left\| \mathbf{p} \right\| _2\) which will establish, if \(\mu < \frac{1}{3}\), a linear rate of convergence since we will have \(\left\| {\mathbf{a}^+}- {\mathbf{a}}^* \right\| _2 = \left\| {\mathbf{p}^+} \right\| _2 \le 3\mu \cdot \left\| \mathbf{p} \right\| _2 = 3\mu \cdot \left\| \mathbf{a}- {\mathbf{a}}^* \right\| _2\).

We have

$$\begin{aligned} {\mathbf{b}^+}= \varPi _{{B^+}}({\mathbf{a}}^*+ {\mathbf{b}}^*- \mathbf{a}) = \varPi _{{B^+}}({\mathbf{b}}^*- \varPi _A({\mathbf{b}}^*- \mathbf{b})) = \varPi _{{B^+}}({\mathbf{b}}^*- \mathbf{p}), \end{aligned}$$

and thus \({\mathbf{b}}^*- {\mathbf{b}^+}= {\mathbf{b}}^*- \varPi _{{B^+}}({\mathbf{b}}^*- \mathbf{p}) = \varPi _\mathfrak {R}({\mathbf{b}}^*) + \varPi _{{B^+}}(\mathbf{p})\). This gives us, by an application of the triangle inequality,

$$\begin{aligned} \left\| {\mathbf{p}^+} \right\| _2 = \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2 \le \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 + \left\| \varPi _A(\varPi _{B^+}(\mathbf{p})) \right\| _2 \end{aligned}$$

Now, the projection step assures us that projecting onto \({B^+}\) was the best option out of all the subspaces in \(\mathscr {B}\) and thus, if we denote \(\mathbf{z}= {\mathbf{b}}^*- \mathbf{p}\), then we have, for any subspace \(B \in \mathscr {B}\),

$$\begin{aligned} \left\| \varPi _{B^+}(\mathbf{z}) - \mathbf{z} \right\| _2^2 \le \left\| \varPi _B(\mathbf{z}) - \mathbf{z} \right\| _2^2. \end{aligned}$$

Now, \(\varPi _B(\mathbf{z}) - \mathbf{z}= \varPi _B^\perp (\mathbf{z})\). Using this, in particular, we have, setting \(B = {B}^*\)

$$\begin{aligned} \left\| \varPi _{B^+}^\perp (\mathbf{z}) \right\| _2^2 \le \left\| \varPi _{B^*}^\perp (\mathbf{z}) \right\| _2^2 \end{aligned}$$

Canceling components in the subspace \(({B^+})^\perp \cap ({B}^*)^\perp\), as well as those in the subspace \(\mathfrak {Q}\) gives us

$$\begin{aligned} \left\| \varPi _\mathfrak {R}(\mathbf{z}) \right\| _2^2 \le \left\| \varPi _\mathfrak {P}(\mathbf{z}) \right\| _2^2 = \left\| \varPi _\mathfrak {P}(\mathbf{p}) \right\| _2^2 \end{aligned}$$

since \(\varPi _{B^*}^\perp ({\mathbf{b}}^*) = \mathbf{0}\). Now, \(\varPi _\mathfrak {R}(\mathbf{z}) = \varPi _\mathfrak {R}({\mathbf{b}}^*) - \varPi _\mathfrak {R}(\mathbf{p})\) since projections are linear operators. Applying the triangle inequality gives us \(\left\| \varPi _\mathfrak {R}(\mathbf{z}) \right\| _2 \ge \left\| \varPi _\mathfrak {R}({\mathbf{b}}^*) \right\| _2 - \left\| \varPi _\mathfrak {R}(\mathbf{p}) \right\| _2\). This gives us

$$\begin{aligned} \left\| \varPi _\mathfrak {R}({\mathbf{b}}^*) \right\| _2&\le \left\| \varPi _\mathfrak {P}(\mathbf{p}) \right\| _2 + \left\| \varPi _\mathfrak {R}(\mathbf{p}) \right\| _2\\&\le \left\| \varPi _{B^+}(\mathbf{p}) \right\| _2 + \left\| \varPi _{B}^*(\mathbf{p}) \right\| _2, \end{aligned}$$

where the second step follows since orthonormal projections are always non-expansive. Applying incoherence results now tells us that, since \(\mathbf{p}\in \mathscr {A}\), we have

$$\begin{aligned} \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 \le \sqrt{\mu }\cdot \left\| \varPi _\mathfrak {R}({\mathbf{b}}^*) \right\| _2 = \sqrt{\mu }(\left\| \varPi _{B^+}(\mathbf{p}) \right\| _2 + \left\| \varPi _{B}^*(\mathbf{p}) \right\| _2) \le 2\mu \cdot \left\| \mathbf{p} \right\| _2 \end{aligned}$$

This gives us, upon applying contraction due to incoherence,

$$\begin{aligned} \left\| {\mathbf{p}^+} \right\| _2&\le \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 + \left\| \varPi _A(\varPi _{B^+}(\mathbf{p})) \right\| _2\\&\le 3\mu \cdot \left\| \mathbf{p} \right\| _2 \end{aligned}$$

Thus, in the known signal support case, APIS offers a linear rate of convergence whenever \(\mu < \frac{1}{3}\). Now, APIS initializes \(\mathbf{a}^0 = \mathbf{0}\) which means that initially, we have

$$\begin{aligned} \mathbf{p}^1 = \varPi _A({\mathbf{b}}^*- \mathbf{b}^1) = \varPi _A({\mathbf{b}}^*- \varPi _{{B^+}}({\mathbf{a}}^*+ {\mathbf{b}}^*)) \end{aligned}$$

and thus, \(\left\| \mathbf{p}^1 \right\| _2 \le \left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2\) since projections are always non-expansive. The linear rate of convergence implies that within \(T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)\) iterations, we will have \(\left\| \mathbf{p}^T \right\| _2 \le \epsilon\). Using our earlier observation \(\left\| \mathbf{a}^T - {\mathbf{a}}^* \right\| _2 = \left\| \mathbf{p}^T \right\| _2\) then finishes the proof.

A.2 Convergence analysis for general case i.e both \(P, Q \ge 1\)

We now present the proof in the general case.

Lemma 2

Suppose we obtain data as described in Eq. (1) where the two unions \(\mathscr {A}, \mathscr {B}\) are \(\mu\)-incoherent with \(\mu < \frac{1}{9}\) (we allow both \(P, Q > 1\) in this case). Then, for any \(\epsilon > 0\) within \(T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)\) iterations, APIS offers \(\left\| \mathbf{a}^T - {\mathbf{a}}^* \right\| _2 \le \epsilon\).

Proof

As before, to simplify notation, we denote \(\mathbf{a}^t {=}{:} \mathbf{a}, \mathbf{b}^t {=}{:} \mathbf{b}, \mathbf{a}^{t+1} {=}{:} {\mathbf{a}^+}, \mathbf{b}^{t+1} {=}{:} {\mathbf{b}^+}, A^{t+1} {=}{:} {A^+}, B^{t+1} {=}{:} {B^+}\). Let \(\mathfrak {Q}{:}{=} {B^+}\cap {B}^*\) denote the meet of the two subspaces, as well as denote the symmetric difference subspaces \(\mathfrak {P}{:}{=} {B^+}\cap ({B}^*)^\perp\) and \(\mathfrak {R}= {B}^*\cap ({B^+})^\perp\) (recall that \({B}^*\ni {\mathbf{b}}^*\)). Also let \({\mathfrak {M}}{:}{=} {A^+}\cap {A}^*\) denote the meet of the two subspaces, as well as denote the symmetric difference subspaces \({\mathfrak {L}}{:}{=} {A^+}\cap ({A}^*)^\perp\) and \({\mathfrak {N}}= {A}^*\cap ({A^+})^\perp\) (recall that \({A}^*\ni {\mathbf{a}}^*\)). We also introduce the additional notation \(p {:}{=} \max _{A \in \mathscr {A}}\ \left\| \varPi _A({\mathbf{b}}^*- \mathbf{b}) \right\| _2, p^+ {:}{=} \max _{A \in \mathscr {A}}\ \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2\) as well as \(q {:}{=} \max _{B \in \mathscr {B}}\ \left\| \varPi _B({\mathbf{a}}^*- \mathbf{a}) \right\| _2, q^+ {:}{=} \max _{B \in \mathscr {B}}\ \left\| \varPi _B({\mathbf{a}}^*- {\mathbf{a}^+}) \right\| _2\).

Note that the update step gives us \({\mathbf{a}^+}= \varPi _{A^+}({\mathbf{a}}^*+ {\mathbf{b}}^*- {\mathbf{b}^+})\) which gives us

$$\begin{aligned} {\mathbf{a}^+}-{\mathbf{a}}^*= \varPi _{\mathfrak {N}}({\mathbf{a}}^*) + \varPi _{A^+}({\mathbf{b}^+}- {\mathbf{b}}^*), \end{aligned}$$

i.e.

$$\begin{aligned} \left\| {\mathbf{a}^+}-{\mathbf{a}}^* \right\| _2 \le \left\| \varPi _{\mathfrak {N}}({\mathbf{a}}^*) \right\| _2 + \left\| \varPi _{A^+}({\mathbf{b}^+}- {\mathbf{b}}^*) \right\| \le \left\| \varPi _{\mathfrak {N}}({\mathbf{a}}^*) \right\| _2 + p^+, \end{aligned}$$

by applying the triangle inequality. A similar analysis of the projection step, as we did to analyze the special case for \(P = 1\), then gives us

$$\begin{aligned} \left\| \varPi _{\mathfrak {N}}({\mathbf{a}}^*) \right\| _2 \le \left\| \varPi _{A^+}({\mathbf{b}^+}- {\mathbf{b}}^*) \right\| _2 + \left\| \varPi _{A}^*({\mathbf{b}^+}- {\mathbf{b}}^*) \right\| _2 \le 2p^+, \end{aligned}$$

giving us

$$\begin{aligned} \left\| {\mathbf{a}^+}-{\mathbf{a}}^* \right\| _2 \le 3p^+. \end{aligned}$$

We now show that we have \(p^+ \le 9\mu \cdot p\) i.e. the quantity p decreases at a linear rate whenever \(\mu < \frac{1}{9}\). Since the update step gives us \({\mathbf{b}^+}= \varPi _{A^+}({\mathbf{b}}^*+ {\mathbf{a}}^*- \mathbf{a})\), an analysis similar to the one done for the special case for \(P = 1\) gives us, for any \(A \in \mathscr {A}\),

$$\begin{aligned} \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2&\le \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 + \left\| \varPi _A(\varPi _{B^+}(\mathbf{a}- {\mathbf{a}}^*)) \right\| _2\\&\le \sqrt{\mu }(\left\| \varPi _\mathfrak {R}({\mathbf{b}}^*) \right\| _2 + q). \end{aligned}$$

Going as before also gives us

$$\begin{aligned} \left\| \varPi _\mathfrak {R}({\mathbf{b}}^*) \right\| _2 \le \left\| \varPi _{B^+}(\mathbf{a}- {\mathbf{a}}^*) \right\| _2 + \left\| \varPi _{B}^*(\mathbf{a}- {\mathbf{a}}^*) \right\| _2 \le 2q, \end{aligned}$$

and thus, putting the results together gives us

$$\begin{aligned} p \le \max _{A \in \mathscr {A}}\ \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2 \le 3\sqrt{\mu }\cdot q, \end{aligned}$$

or considering this result for a different iterate, we get \(p^+ \le 3\sqrt{\mu }\cdot q^+\). Since the updates w.r.t \(\mathbf{a}\) and \(\mathbf{b}\) are absolutely symmetric, a similar analysis to the above also gives us \(q^+ \le 3\sqrt{\mu }\cdot p\) and consequently, \(p^+ \le 9\mu \cdot p\). Thus, APIS offers a linear rate of convergence in the general case whenever \(\mu < \frac{1}{9}\). A similar analysis as before confirms \(p^1 = \max _{A \in \mathscr {A}}\ \left\| \varPi _A({\mathbf{b}}^*- \mathbf{b}^1) \right\| _2 = \mathscr {O}\left( \left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2\right)\) and that within \(T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)\) iterations, we would have \(p^T \le \frac{\epsilon }{3}\). Since we already saw above that \(\left\| \mathbf{a}^T - {\mathbf{a}}^* \right\| _2 \le 3p^T\), this confirms the upper bound on the number of iterations required.

B Robust linear regression using APIS

We recall that in this case, we have known signal support i.e. \(P = 1\) with \(\mathscr {A}= A\) being the row span of the covariate matrix \(X \in {\mathbb{R}}^{d \times n}\) and \(\mathscr {B}\) being the union of subspaces of k-sparse vectors.

Lemma 3

If the corruption vectors are (adaptive adversarial) k-sparse vectors and the covariates \(\mathbf{x}^i \in {\mathbb{R}}^d, i \in [n]\) are sampled i.i.d. from a standard Gaussian i.e. \(\mathbf{x}^i \sim \mathscr {N}(\mathbf{0}, I_d)\) and \(n = \varOmega \left( d\right)\), then with probability at least \(1 - \frac{1}{d^2}\), APIS offers exact recovery at a linear rate if \(k < \frac{n}{154}\). Moreover, the projection operation \(\varPi _\mathscr {A}(\cdot )\) can be performed in \(\mathscr {O}\left( nd\right)\) time in this case.

Proof

Let V denote the d right singular vectors of the covariate matrix \(X \in {\mathbb{R}}^{d \times n}\). Then the projection operator \(\varPi _A\) is given as \(\varPi _A(\mathbf{z}) = VV^\top \mathbf{z}\) where \(VV^\top = X^\top (XX^\top )^\dagger X\). Note that this can also be accomplished by simply solving a least squares problem which can be done in \(\mathscr {O}\left( nd\right)\) time using various (conjugate, stochastic) gradient descent techniques. This settles the time complexity of the projection operation \(\varPi _\mathscr {A}(\cdot )\). However, \(\varPi _A(\mathbf{z}) = VV^\top \mathbf{z}\) also gives us the following expression for the SU-incoherence constant.

$$\begin{aligned} \mu = \max _{\begin{array}{c} \mathbf{u},\mathbf{v}\in S^{n-1}\\ \left\| \mathbf{v} \right\| _0 \le k \end{array}}\left( \mathbf{v}^\top VV^\top \mathbf{u}\right) ^2 \le \max _{\begin{array}{c} \mathbf{v}\in S^{n-1}\\ \left\| \mathbf{v} \right\| _0 \le k \end{array}}\left\| VV^\top \mathbf{v} \right\| _2^2 = \max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\frac{\left\| X_S \right\| _2^2}{\lambda _{\min }(XX^\top )}. \end{aligned}$$

The above constants are readily available from prior works e.g. TORRENT (Bhatia et al. 2015) and are reproduced here (see Bhatia et al. 2015, Lemma 14 and Theorem 15). For any \(\delta \in (0,1)\), with probability at least \(1- \delta\), we have

  1. 1.

    \(\lambda _{\min }(XX^\top ) \ge n - 3\sqrt{513dn + 178n\log \frac{2}{\delta }}\)

  2. 2.

    \(\max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\left\| X_S \right\| ^2_2 \le k\left( 1+3e\sqrt{6\log \frac{en}{k}}\right) + 3\sqrt{513dk + 178k\log \frac{1}{\delta }}\)

To simplify the above bounds, we set \(\delta = \frac{1}{d^2}\) and notice that for large enough d, we have \(\log (2d^2) < \frac{d}{100}\) so that we get \(\lambda _{\min }(XX^\top ) \ge n - 3\sqrt{515dn}\) and \(\max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\left\| X_S \right\| ^2_2 \le k\left( 1+3e\sqrt{6\log \frac{en}{k}}\right) + 3\sqrt{515dk}\), each with confidence at least \(1 - \frac{1}{d^2}\). We also assume that n is large enough so that \(\sqrt{515dn}< \frac{n}{300}\) (\(n > 300^2\cdot 515\cdot d\) i.e. \(n = \varOmega \left( d\right)\) suffices to ensures this) so that we get \(\lambda _{\min }(XX^\top ) \ge \frac{99n}{100}\) and \(\max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\left\| X_S \right\| ^2_2 \le k\left( \frac{101}{100} + 3e\sqrt{6\log \frac{en}{k}}\right)\).

This gives us

$$\begin{aligned} \mu \le \left( \frac{100}{99}\right) \frac{k}{n}\left( \frac{101}{100} + 3e\sqrt{6\log \frac{en}{k}}\right) \end{aligned}.$$

Elementary calculations show that we have \(\mu < \frac{1}{3}\) whenever \(k \le \frac{n}{154}\). Since Theorem 1 assures a linear rate of convergence for APIS in the known support case whenever \(\mu < \frac{1}{3}\), this finishes the proof.

However, we note that similar breakdown points can be obtained even if the data covariates come from other nice distributions, for example, sub-Gaussian distributions that include all distributions with bounded support, arbitrary (non-standard) Gaussian distributions, mixtures of Gaussian distributions, and many more. The following result sketches that APIS offers a linear rate of convergence even in this general setting. However, the breakdown point is less explicit due to the generality of the result.

Lemma 4

If the corruption vectors are (adaptive adversarial) k-sparse vectors and the covariates \(\mathbf{x}^i \in {\mathbb{R}}^d, i \in [n]\) are sampled i.i.d. from a sub-Gaussian distribution with sub-Gaussian norm R and covariance matrix \(\varSigma \in {\mathbb{R}}^{d \times d}\), and \(n = \varOmega \left( d\right)\), then with probability at least \(1 - \frac{1}{d^2}\), APIS offers exact recovery at a linear rate if \(k < \frac{n}{\mathscr {O}\left( 1\right) }\). The constants hidden in the \(\mathscr {O}\left( \cdot \right) , \varOmega \left( \cdot \right)\) notations used in this statement are either universal or depend only on the sub-Gaussian norm R of the distribution.

Proof

We note that the projection operation \(\varPi _\mathscr {A}(\cdot )\) can still be performed in \(\mathscr {O}\left( nd\right)\) time in this case (by solving a least squares problem). As before, we have

$$\begin{aligned} \mu = \max _{\begin{array}{c} \mathbf{u},\mathbf{v}\in S^{n-1}\\ \left\| \mathbf{v} \right\| _0 \le k \end{array}}\left( \mathbf{v}^\top VV^\top \mathbf{u}\right) ^2 \le \max _{\begin{array}{c} \mathbf{v}\in S^{n-1}\\ \left\| \mathbf{v} \right\| _0 \le k \end{array}}\left\| VV^\top \mathbf{v} \right\| _2^2 = \max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\frac{\left\| X_S \right\| _2^2}{\lambda _{\min }(XX^\top )}, \end{aligned}$$

where \(VV^\top = X^\top (XX^\top )^\dagger X\). For the case of sub-Gaussian distributions, the following relevant results are available (see Bhatia et al. 2015, Lemma 16 and Theorem 17). For any \(\delta \in (0,1)\), with probability at least \(1- \delta\), we have the following where cC are universal constants that depend only on the sub-Gaussian norm R of the distribution.

  1. 1.

    \(\lambda _{\min }(XX^\top ) \ge n \cdot \lambda _{\min }(\varSigma ) - C\cdot \sqrt{dn} - \sqrt{\frac{n}{c}\log \frac{2}{\delta }}\)

  2. 2.

    \(\max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\left\| X_S \right\| ^2_2 \le k\left( \lambda _{\max }(\varSigma ) + \sqrt{\frac{n}{ck}\log \frac{en}{k}}\right) + C\cdot \sqrt{kd} + \sqrt{\frac{n}{c}\log \frac{2}{\delta }}\)

As before, to simplify the above bounds, we set \(\delta = \frac{1}{d^2}\) and notice that for large enough d and \(n = \varOmega \left( d\right)\), we have \(\lambda _{\min }(XX^\top ) \ge \frac{99n}{100}\cdot \lambda _{\min }(\varSigma )\) and \(\max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\left\| X_S \right\| ^2_2 \le k\left( \lambda _{\max }(\varSigma )\cdot \frac{101}{100} + \sqrt{\frac{n}{ck}\log \frac{en}{k}}\right)\) which gives us

$$\begin{aligned} \mu \le \left( \frac{100}{99}\right) \frac{k}{n}\left( \frac{\lambda _{\max }(\varSigma )}{\lambda _{\min }(\varSigma )}\cdot \frac{101}{100} + \frac{1}{\lambda _{\min }(\varSigma )}\sqrt{\frac{n}{ck}\log \frac{en}{k}}\right) \end{aligned}.$$

Assuming w.l.o.g. \(\lambda _{\max }(\varSigma ) \ge 1\) and denoting \({\boldsymbol{\kappa }}{:}{=} \frac{\lambda _{\max }(\varSigma )}{\lambda _{\min }(\varSigma )}\) as the condition number of the covariance matrix \(\varSigma\) gives us

$$\begin{aligned} \mu \le \mathscr {O}\left( {\boldsymbol{\kappa }}\cdot \frac{k}{n}\left( 1 + \sqrt{\frac{n}{k}\log \frac{en}{k}}\right) \right) , \end{aligned}$$

which can be shown to assure \(\mu < \frac{1}{3}\) when \(k \le \mathscr {O}\left( \frac{n}{{\boldsymbol{\kappa }}}\right)\). Now notice that the above breakdown point depends on the condition number of the covariance matrix. This dependence is superfluous and can be removed, as we show below.

Notice that if we let \({\tilde{X}}= \varSigma ^{-\frac{1}{2}}X\) where X is the covariate matrix used by the algorithm and \(\varSigma\) is the covariance matrix of the distribution generating the covariates, then we have

$$\begin{aligned} VV^\top = X^\top (XX^\top )^\dagger X = {\tilde{X}}^\top ({\tilde{X}}{\tilde{X}}^\top )^\dagger {\tilde{X}}\end{aligned},$$

where \({\tilde{X}}\) is now a matrix of covariates assumed to be sampled from a (still) sub-Gaussian distribution but with identity covariance. This allows us to use the following improved upper bound on the incoherence constant

$$\begin{aligned} \mu = \max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\frac{\left\| {\tilde{X}}_S \right\| _2^2}{\lambda _{\min }({\tilde{X}}{\tilde{X}}^\top )}, \end{aligned}$$

as well as

  1. 1.

    \(\lambda _{\min }({\tilde{X}}{\tilde{X}}^\top ) \ge n - C\cdot \sqrt{dn} - \sqrt{\frac{n}{c}\log \frac{2}{\delta }}\)

  2. 2.

    \(\max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\left\| {\tilde{X}}_S \right\| ^2_2 \le k\left( 1+ \sqrt{\frac{n}{ck}\log \frac{en}{k}}\right) + C\cdot \sqrt{kd} + \sqrt{\frac{n}{c}\log \frac{2}{\delta }}\)

The above in turn give us

$$\begin{aligned} \mu \le \mathscr {O}\left( \frac{k}{n}\left( 1 + \sqrt{\frac{n}{k}\log \frac{en}{k}}\right) \right) , \end{aligned}$$

which can be shown to assure \(\mu < \frac{1}{3}\) when \(k < \frac{n}{\mathscr {O}\left( 1\right) }\) where the constants hidden in the \(\mathscr {O}\left( \cdot \right)\) notation are either universal or depend only on the sub-Gaussian norm R of the distribution. Note that the algorithm does not need to know \(\varSigma\) at all (either exactly or even approximately) for the above trick to work. The algorithm can continue to perform the \(\varPi _\mathscr {A}(\cdot )\) projections using \(VV^\top = X^\top (XX^\top )^\dagger X\) but the analysis uses the (equivalent) \(VV^\top = {\tilde{X}}^\top ({\tilde{X}}{\tilde{X}}^\top )^\dagger {\tilde{X}}\) instead.

C Robust low-rank kernel regression using APIS

We recall that in this case, the uncorrupted signal satisfies \({\mathbf{a}}^*= G{{\boldsymbol{\alpha }}}^*\) where \(G \in {\mathbb{R}}^{n \times n}\) be the Gram matrix with \(G_{ij} = K(\mathbf{x}^i,\mathbf{x}^j)\) corresponding to a Mercer kernel \(K: {\mathbb{R}}^d \times {\mathbb{R}}^d \rightarrow {\mathbb{R}}\) such as the RBF kernel. Moreover, \({{\boldsymbol{\alpha }}}^*\) belongs to the span of the some s eigenvectors of G i.e. \({{\boldsymbol{\alpha }}}^*= V{\boldsymbol{\gamma }}^*\) where \(\left\| {\boldsymbol{\gamma }}^* \right\| _0 \le s\) and \(V = [\mathbf{v}^1,\ldots ,\mathbf{v}^r] \in {\mathbb{R}}^{n \times r}\) is the matrix of eigenvectors of G and r is the rank of G. As we will see, APIS offers the strongest guarantees in the case when \({{\boldsymbol{\alpha }}}^*\in \text {span}(\mathbf{v}^1,\ldots ,\mathbf{v}^s)\), i.e., when \({{\boldsymbol{\alpha }}}^*\) lies in the span of the the top eigenvectors.

Thus, in this case, we have known signal support i.e. \(P = 1\) with \(\mathscr {A}= A\) being the span of the top s eigenvectors of G and \(\mathscr {B}\) being the union of subspaces of k-sparse vectors. Here we derive breakdown points for the case of kernel ridge regression. Lemma 5 presents this result for general Mercer kernels, whereas Lemma 6 will yield a specific breakdown point for the special case of the RBF kernel.

Lemma 5

If the corruption vectors are (adaptive adversarial) k-sparse vectors and the uncorrupted signal lies in the span of the top s eigenvectors of a Gram matrix G corresponding to a Mercer kernel, then APIS offers exact recovery at a linear rate if \(3\cdot \Lambda ^{\text {unif}}_k(G) < \lambda _s(G)\) where \(\lambda _s(G)\) is the \(s^\mathrm{th}\)-largest eigenvalue of G and for any \(k > 0\), \(\Lambda ^{\text {unif}}_k(G)\) denotes the largest eigenvalue of any principal \(k \times k\) sub-matrix of G. Moreover, the projection operation \(\varPi _\mathscr {A}(\cdot )\) can be performed in \(\mathscr {O}\left( ns\right)\) time in this case apart from a one-time cost of \(\mathscr {O}\left( n^2s\right)\).

Proof

Let \(\mathbf{v}^1,\ldots ,\mathbf{v}^s \in {\mathbb{R}}^n\) be the top-s eigenvectors of G i.e. \({\tilde{V}}= [\mathbf{v}^1,\ldots ,\mathbf{v}^s] \in {\mathbb{R}}^{n \times s}\). Also, let the diagonal matrix containing the corresponding top-s eigenvalues \(\lambda _1 \ge \lambda _2 \ge \ldots \ge \lambda _s > 0\) be denoted by \({\tilde{\varSigma }}= {{\,\mathrm{diag}\,}}(\lambda _1,\ldots ,\lambda _s) \in {\mathbb{R}}^{s\times s}\). The time complexity of the projection step is settled by noting that the projection operator \(\varPi _A\) is given as \(\varPi _A(\mathbf{z}) = {\tilde{V}}{\tilde{V}}^\top \mathbf{z}\). Calculating \({\tilde{V}}\) takes a one-time cost of \(\mathscr {O}\left( n^2s\right)\) whereas applying the projection operator requires two multiplications with an \(n \times s\) matrix which takes \(\mathscr {O}\left( ns\right)\) time.

Consider the matrix \({\tilde{X}}= {\tilde{\varSigma }}^{-\frac{1}{2}}{\tilde{V}}^\top G\). It is easy to see that

$$\begin{aligned} {\tilde{X}}^\top ({\tilde{X}}{\tilde{X}}^\top )^{-1}{\tilde{X}}= {\tilde{V}}{\tilde{V}}^\top \end{aligned}.$$

Notice the parallels between the above and a similar expression derived in the linear regression case in the proof of Lemma 3. An identical analysis then gives us

$$\begin{aligned} \mu = \max _{\begin{array}{c} \mathbf{u},\mathbf{v}\in S^{n-1}\\ \left\| \mathbf{v} \right\| _0 \le k \end{array}}\left( \mathbf{v}^\top {\tilde{V}}{\tilde{V}}^\top \mathbf{u}\right) ^2 \le \max _{\begin{array}{c} \mathbf{v}\in S^{n-1}\\ \left\| \mathbf{v} \right\| _0 \le k \end{array}}\left\| {\tilde{V}}{\tilde{V}}^\top \mathbf{v} \right\| _2^2 = \max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\frac{\left\| {\tilde{X}}_S \right\| _2^2}{\lambda _{\min }({\tilde{X}}{\tilde{X}}^\top )} \end{aligned}.$$

Now, clearly we have \(\lambda _{\min }({\tilde{X}}{\tilde{X}}^\top ) \ge \lambda _s\) which lower bounds the denominator in the last expression. To upper bound the numerator, notice that \({\tilde{X}}_S = [{\tilde{\mathbf{x}}}^i]_{i \in S} \in {\mathbb{R}}^{s \times k}\) where \({\tilde{\mathbf{x}}}^i = {\tilde{\varSigma }}^{-\frac{1}{2}}{\tilde{V}}^\top G_i\) where \(G_i\) is the \(i^\mathrm{th}\) column of the matrix G. Now consider \({\hat{\mathbf{x}}}^i = \varSigma ^{-\frac{1}{2}}V^\top G_i\) where \(\varSigma \in {\mathbb{R}}^{r \times r}\) is the diagonal matrix of all the eigenvalues of G, not just the top-s ones (assuming G is of rank r) and \(V \in {\mathbb{R}}^{n \times r}\) is the matrix of all the eigenvectors of G and let \({\hat{X}}= [{\hat{\mathbf{x}}}^i]_{i = 1}^n \in {\mathbb{R}}^{s \times n}\) and, in particular, \({\hat{X}}_S = [{\hat{\mathbf{x}}}^i]_{i \in S} \in {\mathbb{R}}^{s \times k}\).

Since \({\tilde{X}}_S\) is a projection of \({\hat{X}}_S\) onto the top-s eigenvectors of G, we conclude that \(\left\| {\tilde{X}}_S \right\| _2^2 \le \left\| {\hat{X}}_S \right\| _2^2\). However, notice that \({\hat{X}}^\top {\hat{X}}= G\) and thus, \(\left\| {\hat{X}}_S \right\| _2^2\) is upper bounded by the largest eigenvalue of the principal sub-matrix \(G^S_S\). Thus, we have

$$\begin{aligned} \mu \le \max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\frac{\left\| {\tilde{X}}_S \right\| _2^2}{\lambda _{\min }({\tilde{X}}{\tilde{X}}^\top )} \le \frac{\Lambda ^{\text {unif}}_k(G)}{\lambda _s(G)} \end{aligned}.$$

This concludes the proof upon noting that we get \(\mu < \frac{1}{3}\) as desired by Theorem 1 for APIS to offer exact recovery at a linear rate whenever \(3\cdot \Lambda ^{\text {unif}}_k(G) < \lambda _s(G)\).

We note that the above proof does not use anywhere the fact that the signal has support only among the top-s eigenvectors of the Gram matrix. However, if we start considering other sets of s eigenvectors as possible support, we will run into adverse incoherence constants. Specifically, if the set of eigenvectors contains the smallest eigenvector of G as well, then we would have

$$\begin{aligned} \mu \le \max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\frac{\left\| {\tilde{X}}_S \right\| _2^2}{\lambda _{\min }(G)} \end{aligned}.$$

Notice that the denominator now has \(\lambda _{\min }(G)\) instead of \(\lambda _s(G)\). Since the eigenvalues of Gram matrices w.r.t popular kernels such as RBF decay rapidly (see proof of Lemma 6 below), this would mean that \(\mu\) could take a very large value and it may be impossible to satisfy \(\mu < \frac{1}{9}\) no matter how small the value of k. That is why we restrict the support to the top-s eigenvectors. However, Appendix E.1 shows that APIS offers recovery even if signals are not totally represented by the top-s eigenvectors but merely well-approximated by them.

C.1 Breakdown point derivations for the RBF kernel

Our goal in this discussion will be to establish the following breakdown point result for robust kernel ridge regression settings.

Lemma 6

If the corruption vectors are (adaptive adversarial) k-sparse vectors and the uncorrupted signal lies in the span of the top s eigenvectors of a Gram matrix G corresponding to the RBF kernel \(\kappa (\mathbf{x},\mathbf{y}) = \exp \left( -\frac{\left\| \mathbf{x}-\mathbf{y} \right\| _2^2}{h^2}\right)\) with \(\mathbf{x}, \mathbf{y}\in {\mathbb{R}}^d\) for \(d > 1\) and h being the bandwidth parameter of the kernel, with the data covariates \(\mathbf{x}^1,\ldots ,\mathbf{x}^n \in {\mathbb{R}}^d\) sampled from the uniform distribution over the unit sphere \(S^{d-1}\), then APIS offers exact recovery at a linear rate in the following settings. We note that these conditions are neither exhaustive nor necessary but merely some sufficient conditions in which recovery is guaranteed by APIS.

  1. 1.

    Case 1: \(d = 2\), \(s \ge e\): if \(k \le \sqrt{n}, s^s \le n^{\frac{1}{5}}\) (i.e. \(s \le \mathscr {O}\left( \log n/\log \log n\right)\)), and \(h \in \left[ \sqrt{\frac{40}{\log n}}, \frac{1.13}{(20.4)^{\frac{2.5}{\log n}}}\right]\), then with probability at least \(1 - 4\exp (-n^{\frac{2}{5}})\), we have \(3\cdot \Lambda ^{\text {unif}}_k(G) < \lambda _s(G)\) i.e. \(\mu < \frac{1}{3}\) as guaranteed by Lemma 5.

  2. 2.

    Case 2: \(d > 2\), \(s \ge e\): if \(k \le \sqrt{n}, \frac{(s+\frac{d}{2}-1)^{s+\frac{d}{2}-1}}{\exp (s- 1)\left( \frac{d}{2}\right) ^{\frac{d+1}{2}}} \le n^{\frac{1}{5}}\), and \(h \in \left[ \sqrt{\frac{40}{\log n}}, \left( \frac{n^{\frac{1}{20}}}{18.8}\right) ^{\frac{1}{2s}}\right]\), then with probability at least \(1 - 4\exp (-n^{\frac{2}{5}})\), we have \(3\cdot \Lambda ^{\text {unif}}_k(G) < \lambda _s(G)\) i.e. \(\mu < \frac{1}{3}\) as guaranteed by Lemma 5.

Note that in both cases, the range which the bandwidth is allowed to take while ensuring recovery expands with n. For example, in the \(d = 2\) case, in the limit \(n \rightarrow \infty\), the range expands to [0, 1.13] since \((20.4)^{\frac{2.5}{\log n}} \rightarrow 1\) as \(n \rightarrow \infty\) since the exponent \(\frac{2.5}{\log n} \rightarrow 0\).

Sample complexity Before giving derivations for the above results, we put in a word about the sample complexity.

  1. 1.

    Case 1: \(d = 2\), \(s \ge e\): \(n = \varOmega \left( 1\right)\) samples and \(s \le \mathscr {O}\left( \log n/\log \log n\right)\) clearly suffice in this case.

  2. 2.

    Case 2: \(d > 2\), \(s \ge e\): we first simplify the expression \(\frac{(s+\frac{d}{2}-1)^{s+\frac{d}{2}-1}}{\exp (s- 1)\left( \frac{d}{2}\right) ^{\frac{d+1}{2}}}\) using simple inequalities such as \((x+y)^p \le 2^p(x^p + y^p)\) for any \(x, y \in {\mathbb{R}}_+, p \in \mathbb {N}\) to obtain the following inequality (using the shorthand \(D {:}{=} \frac{d}{2} - 1\) to avoid clutter)

    $$\begin{aligned} 2^D\left( D^s + \frac{s^s}{D^D} + \left( \frac{s}{D}\right) ^D\right) < n^{\frac{1}{5}} \end{aligned}$$

    Simple calculations show that \(n = (\varOmega \left( 1\right) )^d\) as well as \(s < \mathscr {O}\left( \log _d(n)\right)\) suffice to satisfy the above requirement.

Note that in both cases, we can tolerate upto \(k \le \sqrt{n}\) corruptions.

C.2 Some pre-calculations

Let the data points \(\{\mathbf{x}^i\}\) be sampled from the uniform distribution over \(S^{d-1}\) with \(d > 1\), and the RBF kernel, \(\kappa (\mathbf{x}^i, \mathbf{x}^j)=\exp \left(- \frac{\left\| \mathbf{x}^i - \mathbf{x}^j \right\| ^2}{h^2}\right)\). Let \(\pi _r\) be the \(r^{th}\) largest, \(r\in \mathbb {N}\cup \{0\}\) distinct eigenvalue of the integral transform operator corresponding to the kernel function \(\kappa\), then (Minh et al. 2006, Theorem 2) states

$$\begin{aligned} \pi _r= \exp \left( -\frac{2}{h^2}\right) h^{d-2}I_{r+\frac{d}{2}-1}\left( \frac{2}{h^2}\right) \Gamma \left( \frac{d}{2}\right) \end{aligned},$$
(3)

where, I denotes the modified Bessel function of the first kind. Here each \(\pi _r\) occurs with multiplicity \(\frac{(2r+d-2)(r+d-3)!}{r! (d-2)!}\). The eigenvalues also satisfy

$$\begin{aligned} \left( \frac{2e}{h^2}\right) ^r \frac{A_1}{(2r+d-2)^{r+\frac{d-1}{2}}}< \pi _r < \left( \frac{2e}{h^2}\right) ^r \frac{A_2}{(2r+d-2)^{r+\frac{d-1}{2}}} \end{aligned},$$
(4)

where \(A_1, A_2\) being independent of r are given as follows:

$$\begin{aligned} A_1= \frac{2^{\frac{d}{2} -1}}{\sqrt{\pi }}\exp \left( -\frac{2}{h^2} - \frac{1}{12}+ \frac{d}{2} -1\right) \Gamma \left( \frac{d}{2}\right) \\ A_2= \frac{2^{\frac{d}{2} -1}}{\sqrt{\pi }}\exp \left( -\frac{2}{h^2}+\frac{1}{h^4}+ \frac{d}{2} -1\right) \Gamma \left( \frac{d}{2}\right) \end{aligned},$$

with \(\Gamma\) denoting the Gamma function.

Let \(\lambda _r^{(n)}\) be the \(r^{th}\)-largest eigenvalue of the \(n \times n\) gram matrix \(G_{ij}=k(\mathbf{x}^i,\mathbf{x}^j)\) over n data points. We have from Rosasco et al. (2010, Theorems 5 and 7) that for a normalized mercer kernel \(\kappa (\mathbf{x}^i, \mathbf{x}^i) \le 1\) (which the RBF kernel does satisfy), with probability \(1 - 2\exp (-\tau )\),

$$\begin{aligned} \left| {\lambda _r^{(n)} - n\pi _r} \right| \le 2 \sqrt{2n\tau }, \end{aligned}$$

Hence, for a given principal sub-matrices of size k,

$$\begin{aligned}&\Pr \left( \lambda _0^{(k)}> k\pi _0 + 2\sqrt{2k\tau _1}\right) \le 2e^{-\tau _1} \nonumber \\&\quad \implies \Pr \left( \bigcup _{\text {sub-matrices of size } k} \{\lambda _0^{(k)}> k\pi _0 + 2\sqrt{2k\tau _1}\}\right) \le {n \atopwithdelims ()k} 2e^{-\tau _1} \le \left( \frac{ne}{k}\right) ^{k} 2e^{-\tau _1}\nonumber \\&\quad \Longleftrightarrow \Pr \left( \Lambda _{k}^{unif}> k\pi _0 + 2\sqrt{2k\tau _1}\right) \le \left( \frac{ne}{k}\right) ^{k} 2e^{-\tau _1} {=}{:}\frac{\delta }{2} \nonumber \\&\quad \Longleftrightarrow \Pr \left( \Lambda _{k}^{unif} > k\pi _0 + 2\sqrt{2k\left( k\ln \left( \frac{ne}{k}\right) +\ln \frac{4}{\delta }\right) }\right) \le \frac{\delta }{2} \quad \text {putting, } \tau _1=k \ln \left( \frac{ne}{k}\right) + \ln \frac{4}{\delta }. \end{aligned}$$
(5)

Also we have,

$$\begin{aligned} \Pr \left( \lambda _s^{(n)} < n\pi _s - 2\sqrt{2n\ln \frac{4}{\delta }}\right) \le \frac{\delta }{2} \end{aligned}$$
(6)

Combining Eqs. (5) and (6):

$$\begin{aligned}&\Pr \left( \left[ 3\Lambda _{k}^{unif} > 3k\pi _0 + 6\sqrt{2k(k\ln \left( \frac{ne}{k}\right) +\ln \frac{4}{\delta })}\right] \cup \left[ \lambda _s^{(n)} < n\pi _s - 2\sqrt{2n\ln \frac{4}{\delta }}\right] \right) \le \delta \\&\quad \Longleftrightarrow \Pr \left( \left[ 3\Lambda _{k}^{unif} \le 3k\pi _0 + 6\sqrt{2k(k\ln \left( \frac{ne}{k}\right) +\ln \frac{4}{\delta })}\right] \cap \left[ n\pi _s - 2\sqrt{2n\ln \frac{4}{\delta }}\le \lambda _s^{(n)}\right] \right) \ge 1-\delta \\&\quad \Longleftrightarrow \Pr \left( 3\Lambda _{k}^{unif} \le \lambda _s^{(n)}\right) \ge 1-\delta \end{aligned}$$

whenever,

$$\begin{aligned}& 3k \pi _0 + 6 {\sqrt{2k\left(k\ln {\frac{ne}{k}}+\ln {\frac{4}{\delta }}\right)}} \le n\pi _s - 2{\sqrt{2n\ln {\frac{4}{\delta }}}} \nonumber \\ &\quad \Longleftrightarrow {\frac{3k}{n}}\left( \pi _0 + 2{\sqrt{2}}{\sqrt{\ln {\frac{ne}{k}}+{\frac{1}{k}}\ln {\frac{4}{\delta }}}}\right) + 2{\sqrt{{\frac{2}{n}}\ln {\frac{4}{\delta }}}}\le \pi _s \nonumber \\ &\quad \Longleftarrow \frac{3k}{n}\left( \pi _0 + 2{\sqrt{2}}{\sqrt{\ln {\frac{ne}{k}}}}\right) +2{\sqrt{2}}{\sqrt{{\frac{1}{n}}\ln {\frac{4}{\delta }}}}\left(3{\sqrt{{\frac{k}{n}}}}+1\right) \le \pi _s \quad {\text{using}}, \,{\sqrt{a+b}}\le {\sqrt{a}}+{\sqrt{b}} \end{aligned}$$
(7)

We break the remaining proof into the two cases \(d = 2\) and \(d > 2\) in the following two subsections.

C.3 Case 1: \(d = 2\), \(s \ge e\)

From Eq. (3) and using \(I_{0}(x)=\frac{1}{\pi }\int \limits _{0}^{\pi } \exp (x \cos (\theta ))d\theta \le \exp (x)\) we have,

$$\begin{aligned} \pi _0= \exp \left( -\frac{2}{h^2}\right) I_{0}\left( \frac{2}{h^2}\right) \le \exp \left( -\frac{2}{h^2}\right) \exp \left( \frac{2}{h^2}\right) \le 1 \end{aligned}$$

From Eq. (4) we have, for \(s\ge e\) and \(s^s\le n^{\epsilon _2}\)

$$\begin{aligned} \pi _s&\ge \left( \frac{2e}{h^2}\right) ^s \frac{\exp \left( -\frac{2}{h^2} - \frac{1}{12}\right) }{\sqrt{\pi }(2s)^{s+\frac{1}{2}}} = \frac{\exp \left( -\frac{2}{h^2}\right) }{h^{2s}}\frac{\exp (s- \frac{1}{12})}{\sqrt{2\pi } s^{s+\frac{1}{2}}}\\&\ge \frac{\exp \left( -\frac{2}{h^2}\right) }{h^{2s}}\frac{\exp \left( \frac{11}{12}\right) }{\sqrt{2\pi }}\frac{\exp (s- 1)}{s^{s+\frac{1}{2}}}\\&\ge \frac{\exp \left( -\frac{2}{h^2}\right) }{h^{2s}}\frac{\exp \left( \frac{11}{12}\right) }{\sqrt{2\pi }}\frac{1}{s^{s-\frac{1}{2}}}\quad \text {using, } \exp (s-1) \ge s\\&\ge \frac{\exp \left( -\frac{2}{h^2}\right) }{h^{2s}}\frac{1}{s^{s}} \quad \text {using, } s \ge e\\&\ge \frac{\exp \left( -\frac{2}{h^2}\right) }{h^{2s}}\frac{1}{n^{\epsilon _2}} \end{aligned}$$

From Eq. (7) we require:

$$\begin{aligned} \frac{1}{n^{\epsilon _2}}&\ge h^{2s}\exp \left( \frac{2}{h^2}\right) \left( \frac{3k}{n} + \frac{6\sqrt{2} k}{n} \sqrt{\ln \frac{ne}{k}}+2\sqrt{2} \sqrt{\frac{1}{n}\ln \frac{4}{\delta }}(3\sqrt{\frac{k}{n}}+1)\right) \\ \end{aligned}$$

Let \(k \le n^{\epsilon _3}\). To satisfy the above requirement we break it into following cases:

$$\begin{aligned} {\frac{1}{(9+8\sqrt{2}) n^{\epsilon _2}}}&\ge h^{2s}\exp \left( \frac{2}{h^2}\right) {\frac{3k}{n}}\nonumber \\ \Longleftarrow n^{1-\epsilon _2 - \epsilon _3}&\ge (9+8{\sqrt{2}}) h^{2s}\exp \left({\frac{2}{h^2}}\right) \end{aligned}$$
(8)

Since \(\frac{k}{n}\sqrt{\ln \frac{ne}{k}} \le \left( \frac{k}{n}\right)^{\frac{3}{5}} \le n^{\frac{3(\epsilon _3-1)}{5}}\), for \(0\le \frac{k}{n} \le 0.5\)

$$\begin{aligned}&{\frac{6{\sqrt{2}}}{(9+8{\sqrt{2}})}}{\frac{1}{n^{\epsilon _2}}} \ge 6{\sqrt{2}}{\frac{k}{n}}{\sqrt{\ln {\frac{ne}{k}}}} h^{2s}\exp \left({\frac{2}{h^2}}\right) \nonumber \\&\quad \Longleftarrow n^{{\frac{3}{5}}(1-\epsilon _3) -\epsilon _2} \ge (9+8{\sqrt{2}}) h^{2s}\exp \left({\frac{2}{h^2}}\right) \end{aligned}$$
(9)

Assume, \(\frac{1}{n^{\epsilon _4}}\sqrt{\ln \frac{4}{\delta }}= 1\) so that, \(\delta = 4\exp (-n^{2\epsilon _4})\)with, \(\quad 0< \epsilon _4 < \frac{1}{2}\)

$$\begin{aligned}& {\frac{6+2{\sqrt{2}}}{(9+8{\sqrt{2}})}}\frac{1}{n^{\epsilon _2}} \ge 2{\sqrt{2}}h^{2s}\exp \left({\frac{2}{h^2}}\right) {\sqrt{{\frac{1}{n}}\ln {\frac{4}{\delta }}}}\left(3\left({\frac{k}{n}}\right)^{0.5}+1\right)\nonumber \\&\quad \Longleftarrow {\frac{6+2{\sqrt{2}}}{(9+8{\sqrt{2}})}} {\frac{1}{n^{\epsilon _2}}} \ge h^{2s}\exp \left({\frac{2}{h^2}}\right) {\sqrt{{\frac{1}{n}}\ln {\frac{4}{\delta }}}}(6+2\sqrt{2})\quad {\text{since}},\, {\frac{1}{2}} \ge {\frac{k}{n}}\nonumber \\&\quad \Longleftrightarrow n^{{\frac{1}{2}}-\epsilon _4-\epsilon _2} \ge (9+8{\sqrt{2}}) \,h^{2s}\exp \left({\frac{2}{h^2}}\right) \,{\text{using}},\, {\frac{1}{n^{\epsilon _4}}}{\sqrt{\ln {\frac{4}{\delta}} }}= 1 \end{aligned}$$
(10)

We now summarize, last three conditions in order to satisfy Eq. (7)

  • Breakdown point: we set \(\epsilon _3=\frac{1}{2}\), so as to obtain \(\frac{k}{n}\le n^{\epsilon _3-1}=n^{-\frac{1}{2}}\)

  • Confidence bound: we set \(\frac{1}{2}-\epsilon _4-\epsilon _2 = \frac{3}{5}(1-\epsilon _3) -\epsilon _2\), so that we get \(\epsilon _4=\frac{1}{5}\). This gives us, \(\delta =4\exp (-n^{2\epsilon _4})=4\exp (-n^{\frac{2}{5}})\)

  • Generality: we need \(\frac{1}{2}-\epsilon _4-\epsilon _2 \ge 0 \implies \frac{3}{10}-\epsilon _2 \ge 0\) Set \(\epsilon _2=\frac{2}{10}\), so that \(s^s\le n^{\epsilon _2}=n^{\frac{1}{5}}\)

  • Bandwidth: Using \(s \le s\ln s \le \epsilon _2 \ln (n)=\frac{\ln (n)}{5}\). We require, \(n^\frac{1}{10} \ge 20.4 h^{\frac{\ln (n)}{2.5}}\exp \left( \frac{2}{h^2}\right)\), which is satisfied if:

    $$\begin{aligned} n^\frac{1}{20} \ge \exp \left( \frac{2}{h^2}\right) \quad&\text {and} \quad n^\frac{1}{20} \ge 20.4 h^\frac{\ln (n)}{2.5}\\ \sqrt{\frac{40}{\ln (n)}}&\le h \le \frac{1.13}{(20.4)^\frac{2.5}{\ln (n)}} \end{aligned}$$

    Note that the permissible range for h improves with n.

C.4 Case: \(d > 2\), \(s \ge e\)

For \(d > 2\) we have from eq. 4,

$$\begin{aligned} \pi _0 < \frac{(2e)^{\frac{d}{2} -1}\exp \left( -\frac{2}{h^2}+\frac{1}{h^4}\right) \Gamma \left( \frac{d}{2}\right) }{\sqrt{\pi }(d-2)^{\frac{d-1}{2}}} \le 2\exp \left( \frac{1}{h^4}\right) , \end{aligned}$$

where we have used that for \(d > 2\), we always have \(\frac{(2e)^{\frac{d}{2} -1}\Gamma \left( \frac{d}{2}\right) }{\sqrt{\pi }(d-2)^{\frac{d-1}{2}}} \le 2\). A short proof of this is given below. FromFootnote 1, we deduce that we always have \(\Gamma (x)\le \sqrt{2 \pi } x^{x-\frac{1}{2}}\exp \left( \frac{1}{12x}-x\right)\) so that,

$$\begin{aligned} \frac{(2e)^{\frac{d}{2} -1}\Gamma \left( \frac{d}{2}\right) }{\sqrt{\pi }(d-2)^{\frac{d-1}{2}}}&\le \frac{2^{\frac{d}{2} -1}\sqrt{2\pi }\exp \left( \frac{d}{2}-1 +\frac{1}{6d}-\frac{d}{2}\right) \left( \frac{d}{2(d-2)}\right) ^{\frac{d-1}{2}}}{\sqrt{\pi }}\\&= \exp \left( \frac{1}{6d}-1\right) \left( \frac{d}{d-2}\right) ^{\frac{d-1}{2}}\\&\le 3\exp \left( \frac{1}{18}-1\right) \text { since, both are strictly decreasing on }\,d>2\\&= 1.11 < 1.2 \end{aligned}$$

Coming back to the original argument, assume, \(\frac{\exp (s- 1)\left( \frac{d}{2}\right) ^{\frac{d+1}{2}}}{(s+\frac{d}{2}-1)^{s+\frac{d}{2}-1}} \ge \frac{1}{n^{\epsilon _2}}\) and \(s\ge e\) so that,

$$\begin{aligned} \pi _s&> \left( \frac{2e}{h^2}\right) ^s \frac{(2e)^{\frac{d}{2} -1}\exp \left( -\frac{2}{h^2} - \frac{1}{12}\right) \Gamma \left( \frac{d}{2}\right) }{\sqrt{\pi }(2s+d-2)^{s+\frac{d-1}{2}}} \ge \left( \frac{e}{h^2}\right) ^s \frac{(e)^{\frac{d}{2} -1}\exp \left( -\frac{2}{h^2} - \frac{1}{12}\right) \sqrt{2\pi \frac{d}{2}}\left( \frac{d}{2e}\right) ^\frac{d}{2}}{\sqrt{2\pi }(s+\frac{d}{2}-1)^{s+\frac{d-1}{2}}}\\&=\frac{1}{h^{2s}}\frac{\exp (s-\frac{2}{h^2} - \frac{13}{12})\left( \frac{d}{2}\right) ^{\frac{d+1}{2}}}{(s+\frac{d}{2}-1)^{s+\frac{d-1}{2}}}\\&\ge \frac{\exp \left( -\frac{2}{h^2}\right) }{h^{2s}} \frac{\exp (s- 1)\left( \frac{d}{2}\right) ^{\frac{d+1}{2}}}{(s+\frac{d}{2}-1)^{s+\frac{d}{2}-1}} \quad \text {using, } \exp \left( -\frac{1}{12}\right) \sqrt{s+\frac{d}{2}-1} \ge \exp \left( -\frac{1}{12}\right) \sqrt{e} \ge 1\\&\ge \frac{\exp \left( -\frac{2}{h^2}\right) }{h^{2s}} \frac{1}{n^{\epsilon _2}} \end{aligned}$$

From Eq. (7), we require,

$$\begin{aligned} \frac{1}{n^{\epsilon _2}} \ge h^{2s}\exp \left( \frac{2}{h^2}\right) \left( \frac{3k}{n}\left( 1.2\exp (\frac{1}{h^4}) + 2\sqrt{2}\sqrt{\ln \frac{ne}{k}}\right) +2\sqrt{2} \sqrt{\frac{1}{n}\ln \frac{4}{\delta }}(3\left( \frac{k}{n}\right) ^{\frac{1}{2}}+1)\right) \\ \end{aligned}$$

Let \(k \le n^{\epsilon _3}\). In order to satisfy the above requirement we break it into following three cases:

$$\begin{aligned} {\frac{1.4}{18.8}} {\frac{1}{n^{\epsilon _2}}}&\ge 3.6h^{2s}\exp \left({\frac{2}{h^2}}+{\frac{1}{h^4}}\right) \frac{k}{n}\nonumber \\ \Longleftarrow n^{1-\epsilon _2 - \epsilon _3}&\ge 18.8 h^{2s}\exp \left( \left( 1+{\frac{1}{h^2}}\right) ^2\right) \end{aligned}$$
(11)

Since \(\frac{k}{n}\sqrt{\ln \left( \frac{ne}{k}\right) } \le \left( \frac{k}{n}\right) ^\frac{3}{5} \le n^\frac{3(\epsilon _3-1)}{5}\),

$$\begin{aligned} {\frac{8.5}{18.8}}{\frac{1}{n^{\epsilon _2}}}&\ge 6{\sqrt{2}}{\frac{k}{n}}{\sqrt{\ln \left({\frac{ne}{k}}\right)}} h^{2s}\exp \left({\frac{2}{h^2}}\right) \nonumber \\ \Longleftarrow n^{{\frac{3}{5}}(1-\epsilon _3) -\epsilon _2}&\ge 18.8 h^{2s}\exp \left({\frac{2}{h^2}}\right) \end{aligned}$$
(12)

Assume, \(\frac{1}{n^{\epsilon _4}}\sqrt{\ln \frac{4}{\delta }}= 1\) so that, \(\delta = 4\exp (-n^{2\epsilon _4})\)with, \(\quad 0< \epsilon _4 < \frac{1}{2}\)

$$\begin{aligned} {\frac{8.9}{18.8}} {\frac{1}{n^{\epsilon _2}}}&\ge 2{\sqrt{2}}h^{2s}\exp \left({\frac{2}{h^2}}\right) {\sqrt{{\frac{1}{n}}\ln {\frac{4}{\delta }}}}\left(3\left({\frac{k}{n}}\right)^{0.5}+1\right)\nonumber \\ \Longleftarrow {\frac{8.9}{18.8}}{\frac{1}{n^{\epsilon _2}}}&\ge h^{2s}\exp \left({\frac{2}{h^2}}\right) {\sqrt{{\frac{1}{n}}\ln {\frac{4}{\delta }}}}(6+2{\sqrt{2}})\quad {\text{since}},\, {\frac{1}{2}} \ge {\frac{k}{n}}\nonumber \\ \Longleftrightarrow n^{{\frac{1}{2}}-\epsilon _4-\epsilon _2}&\ge 18.8 \,h^{2s}\exp \left({\frac{2}{h^2}}\right) \,{\text{using}},\, {\frac{1}{n^{\epsilon _4}}}{\sqrt{\ln {\frac{4}{\delta}} }}= 1 \end{aligned}$$
(13)

Below we instantiate the variables \(\epsilon _2,\epsilon _3\) and \(\epsilon _4\) which satisfies the above three conditions simultaneously.

  • Breakdown point: set \(\epsilon _3=\frac{1}{2}\), \(\frac{k}{n}\le n^{\epsilon _3-1}=n^{-\frac{1}{2}}\)

  • Confidence bound: set \(\frac{1}{2}-\epsilon _4-\epsilon _2 = \frac{3}{5}(1-\epsilon _3) -\epsilon _2\), so that \(\epsilon _4=\frac{1}{5}\). This gives, \(\delta =4\exp (-n^{2\epsilon _4})=4\exp (-n^\frac{2}{5})\)

  • Universality: we need \(\frac{1}{2}-\epsilon _4-\epsilon _2 \ge 0 \implies \frac{3}{10}-\epsilon _2 \ge 0\). Set \(\epsilon _2=\frac{2}{10}\), so that \(\frac{(s+\frac{d}{2}-1)^{s+\frac{d}{2}-1}}{\exp (s- 1)\left( \frac{d}{2}\right) ^{\frac{d+1}{2}}} \le n^{\epsilon _2}=n^\frac{1}{5}\)

  • Bandwidth: instantiating Eq. (13) ,\(n^\frac{1}{10} \ge 18.8 h^{2s}\exp \left( \frac{2}{h^2}\right)\). we require:

    $$\begin{aligned} n^{\frac{1}{20}} \ge \exp \left({\frac{2}{h^2}}\right)&\qquad {\text{and,}}\quad n^{\frac{1}{20}} \ge 18.8 h^{2s}\\ \Longleftrightarrow h \ge {\sqrt{{\frac{40}{\ln (n)}}}}&\qquad {\text{and,}} \quad h \le \left({\frac{n^{\frac{1}{20}}}{18.8}}\right)^{\frac{1}{2s}}\\ \end{aligned}$$

    Assume Eq. (13) holds. In order to satisfy Eq. (11) we further require,

    $$\begin{aligned} n^{\frac{3}{10}}&\ge 18.8 h^{2s}\exp \left( \left( 1+{\frac{1}{h^2}}\right) ^2\right) \\ \Longleftarrow n^{\frac{3}{10}}&\ge 18.8 h^{2s}\exp \left({\frac{2}{h^2}}\right) \exp \left( 1+{\frac{1}{h^4}}\right) \\ \Longleftarrow n^{\frac{3}{10}}&\ge n^{\frac{1}{10}}\exp \left( 1+{\frac{1}{h^4}}\right) \\ \Longleftarrow n^{\frac{1}{5}}&\ge \exp \left({\frac{1}{h^4}}\right) \quad \Longleftarrow h \ge \left({\frac{5}{\ln (n)-5}}\right)^{\frac{1}{4}} \end{aligned}$$

    Hence to satisfy all conditions on the bandwidth we require,

    $$\begin{aligned} \max \left\{ \sqrt{\frac{40}{\ln (n)}},\left( \frac{5}{\ln (n)-5}\right) ^\frac{1}{4}\right\}&\le h \le \left( \frac{n^\frac{1}{20}}{18.8}\right) ^\frac{1}{2s} \end{aligned}$$

    Note that here as well, the acceptable range of bandwidth improves with s that in turn improves with n.

D Robust signal transforms using APIS

Table 4 This table is a subset of Table 1 and presents only the rows concerning signals that have a sparse representation in a basis such as Fourier, wavelet etc with the corruption being either a sparse vector or having a sparse representation in the noiselet basis

Table 4 is a subset of Table 1, it only presents the rows concerning signals that have a sparse representation in a basis such as Fourier, wavelet, etc., with the corruption being either a sparse vector or having a sparse representation in the noiselet basis.

A proof of the breakdown points for examples in the first row i.e. when the signal has an s-sparse representation in Fourier, Hadamard, or noiselet bases, is given below. The proof is quite generic and holds for all transformations. The \(n \times n\) design matrix has all its entries of magnitude \(\mathscr {O}\left( \frac{1}{\sqrt{n}}\right)\) which is true of the design matrices of the Fourier, Hadamard, and noiselet transforms.

Lemma 7

Consider the \(n \times n\) (orthonormal) design matrix U corresponding to a transformation such as Fourier etc. Let \(\mathscr {A}\) be the union of subspaces of all signals that have an s-sparse representation in this basis i.e. \(\mathscr {A}= \left\{ {\mathbf{a}}^*: {\mathbf{a}}^*= U{{\boldsymbol{\alpha }}}^*, \left\| {{\boldsymbol{\alpha }}}^* \right\| _0 \le s\right\}\). Also let \(\mathscr {B}\) be the union of subspaces corresponding to k-sparse vectors i.e. \(\mathscr {B}= \left\{ {\mathbf{b}}^*: \left\| {\mathbf{b}}^* \right\| _0 \le k\right\}\). Then the pair \((\mathscr {A}, \mathscr {B})\) is \(\mu\)-SU incoherent (see Sect. 6.1) for \(\mu \le \frac{sk}{n}\) if every entry of the design matrix U satisfies \(| {U_{ij}} | \le \frac{1}{\sqrt{n}}\).

Proof

Note that to bound the SU incoherence constant, we only need to bound

$$\begin{aligned} \max _{\begin{array}{c} {\boldsymbol{\alpha }}\in S^{n-1}, \left\| {\boldsymbol{\alpha }} \right\| _0 \le s\\ \mathbf{b}\in S^{n-1}, \left\| \mathbf{v} \right\| _0 \le k \end{array}} \left( (U{\boldsymbol{\alpha }})^\top \mathbf{b}\right) ^2 \end{aligned}$$

Since \({\boldsymbol{\alpha }},\mathbf{b}\) are sparse vectors, we have \({\boldsymbol{\alpha }}^\top U^\top \mathbf{b}\le \left\| U_S^K \right\| _2\) where \(S = {{\,\mathrm{supp}\,}}({\boldsymbol{\alpha }})\) and \(K = {{\,\mathrm{supp}\,}}(\mathbf{b})\) are the supports of \({\boldsymbol{\alpha }}, \mathbf{b}\). Now, for any matrix \(A \in {\mathbb{R}}^{s \times k}\), we have \(\left\| A \right\| _2 \le \sqrt{sk}\cdot \left\| A \right\| _\infty\). Since \(U_S^K\) is effectively an \(s \times k\) matrix since its other rows and columns are zeroed out, this gives us \(\left\| U_S^K \right\| _2 \le \sqrt{sk}\cdot \nu\) where \(\nu {:}{=} \left\| U_S^K \right\| _\infty\). However, by assumption, \(\left\| U \right\| _\infty \le \frac{1}{\sqrt{n}}\) which gives us \((U{\boldsymbol{\alpha }})^\top \mathbf{b}\le \sqrt{\frac{sk}{n}}\) and thus, \(\mu \le \max _{\begin{array}{c} {\boldsymbol{\alpha }}\in S^{n-1}, \left\| {\boldsymbol{\alpha }} \right\| _0 \le s\\ \mathbf{b}\in S^{n-1}, \left\| \mathbf{v} \right\| _0 \le k \end{array}} \left( (U{\boldsymbol{\alpha }})^\top \mathbf{b}\right) ^2 \le \frac{sk}{n}\) which finishes the proof.

An equivalent incoherence bound can also be derived from the results of Foucart and Rauhut (2013, Ch. 12) who effectively show that \(\nu \le \frac{1}{\sqrt{n}}\), but we presented the above proof in our notation for the sake of convenience.

Corollary 1

APIS offers a linear rate of recovery when the signal is s-sparse in either the Fourier, Hadarmard or noiselet bases and the corruption is a k-sparse vector, whenever \(sk < \frac{n}{9}\).

Proof

Lemma 7 shows that the SU-incoherence constant in these cases is bounded by \(\mu \le \frac{sk}{n}\). Theorem 1 shows that APIS has a linear rate of recovery when \(\mu < \frac{1}{9}\). Combining the two finishes the proof.

A proof of the breakdown points in the second and the third rows of Table 4 i.e. when the signal has an s-sparse representation in the Fourier or wavelet (Haar, Daubechies D4/D8) bases and the corruption has a k-sparse representation in the noiselet basis or the vice-versa, is given below. We note that corruptions having a sparse representation in the noiselet, wavelet, or Fourier bases can nevertheless be dense as vectors i.e. have \(\left\| {\mathbf{b}}^* \right\| _0 = n\).

Lemma 8

APIS offers a linear rate of recovery when the signal is s-sparse in Fourier or wavelet (Haar, Daubechies D4/D8) bases and the corruption has a k-sparse representation in the noiselet basis, or vice versa, whenever \(sk < \frac{n}{27}\).

Proof

The proof is a generalization of the one used for Lemma 7. Notice that here we have two bases involved, one for the signal (e.g., wavelet) and one for the corruption (e.g., noiselet). Let UV denote the design matrices corresponding to these two bases. Then it is easy to see that calculating the SU-incoherence constant \(\mu\) requires us to bound

$$\begin{aligned} \max _{\begin{array}{c} \mathbf{u}\in S^{n-1}, \left\| \mathbf{u} \right\| _0 \le s\\ \mathbf{v}\in S^{n-1}, \left\| \mathbf{v} \right\| _0 \le k \end{array}} \left( \mathbf{u}^\top U^\top V\mathbf{v}\right) ^2 \end{aligned}$$

Going as before, we can see that since \(\mathbf{u},\mathbf{v}\) are sparse vectors, we have \(\mathbf{u}^\top U^\top V\mathbf{v}\le \left\| U_S^\top V_K \right\| _2\) where \(S = {{\,\mathrm{supp}\,}}(\mathbf{u})\) and \(K = {{\,\mathrm{supp}\,}}(\mathbf{v})\) are the supports of \(\mathbf{u}, \mathbf{v}\) respectively. Now, as \(U_S^\top V_K\) is effectively an \(s \times k\) matrix since all its other rows and columns are zeroed out, we have \(\left\| U_S^\top V_K \right\| _2 \le \sqrt{sk}\cdot \nu\) where \(\nu {:}{=} \left\| U_S^\top V_K \right\| _\infty\). Now, results from Candes and Wakin (2008), Foucart and Rauhut (2013) show us that \(\nu \le 3\) for the (wavelet-noiselet) and (Fourier-noiselet) systems where wavelet could either be the Haar or Daubechies D4/D8 variants. Proceeding similarly as in Lemma 7 and then Corollary 1 finishes the proof.

E Handling unmodelled errors with APIS

Recall that in this case, we modify Eq. (1) to include an unmodelled error term.

$$\begin{aligned} \mathbf{y}= {\tilde{\mathbf{a}}}+ {\mathbf{b}}^*+ {\mathbf{e}}^*, \end{aligned}$$

where \({\tilde{\mathbf{a}}}\in \mathscr {A}, {\mathbf{b}}^*\in \mathscr {B}\) and \({\mathbf{a}}^*= {\tilde{\mathbf{a}}}+ {\mathbf{e}}^*\). We make no assumptions on \({\mathbf{e}}^*\) such as requiring it to belong to any union of subspaces etc. \({\mathbf{e}}^*\) can be completely arbitrary; in particular it can be dense \(\left\| {\mathbf{e}}^* \right\| _0 = n\) and need not have a sparse representation in any particular basis. A useful case is when \({\tilde{\mathbf{a}}}\) can be taken to be the best approximation of \({\mathbf{a}}^*\) in the union of subspaces \(\mathscr {A}\). Below, we offer a recovery guarantee for APIS in this case. As in Appendix A, we will first present the main proof ideas with the special case of \(P = 1\) (the so-called known signal support case (Chen and De 2020)) where the union \(\mathscr {A}\) consists of a single subspace A. We will then extend the proof to the general case where both \(P, Q \ge 1\). Recall that we denote using P (resp. Q), the number of subspaces in the union \(\mathscr {A}= \bigcup _{i=1}^P A_i\) in which the signal \({\tilde{\mathbf{a}}}\) resides (resp the union \(\mathscr {B}= \bigcup _{j=1}^Q B_j\) in which the corruption \({\mathbf{b}}^*\) resides).

E.1 Convergence analysis for \(P = 1\) i.e. \(\mathscr {A}= A\) but still \(Q \ge 1\)

We now present the proof in the case of known signal support.

Lemma 9

Suppose we obtain data as described in Eq. (2) where the two unions \(\mathscr {A}, \mathscr {B}\) are \(\mu\)-SU incoherent with \(\mu < \frac{1}{3}\) and in addition, the union \(\mathscr {A}\) contains a single subspace (the known signal support model). Then, for any \(\epsilon > 0\) within \(T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)\) iterations, APIS offers \(\left\| \mathbf{a}^T - {\tilde{\mathbf{a}}} \right\| _2 \le \epsilon + \frac{4\sqrt{\mu }}{1 - 3\mu }\cdot \max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2 + 2\cdot \left\| \varPi _A({\mathbf{e}}^*) \right\| _2\).

Proof

As in the proof of Lemma 1, denote \(\mathbf{p}= \varPi _A({\mathbf{b}}^*- \mathbf{b})\) and \({\mathbf{p}^+}= \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+})\). Let \(\mathfrak {Q}{:}{=} {B^+}\cap {B}^*\) denote the meet of the two subspaces, as well as denote the symmetric difference subspaces \(\mathfrak {P}{:}{=} {B^+}\cap ({B}^*)^\perp\) and \(\mathfrak {R}= {B}^*\cap ({B^+})^\perp\) (recall that \(A \ni {\mathbf{a}}^*, {B}^*\ni {\mathbf{b}}^*\)). We also use the shorthand \(\mathbf{r}{:}{=} {\mathbf{e}}^*- \varPi _A({\mathbf{e}}^*)\). In this case, we have \({\mathbf{a}^+}= \varPi _A({\tilde{\mathbf{a}}}+ {\mathbf{b}}^*+ {\mathbf{e}}^*- {\mathbf{b}^+})\). Since \(\varPi _A({\tilde{\mathbf{a}}}) = {\tilde{\mathbf{a}}}\), we get \({\mathbf{a}^+}- {\tilde{\mathbf{a}}}= \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}+ {\mathbf{e}}^*)\) . Applying the triangle inequality gives us \(\left\| {\mathbf{a}^+}- {\tilde{\mathbf{a}}} \right\| _2 \le \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2 + \left\| \varPi _A({\mathbf{e}}^*) \right\| _2 = \left\| {\mathbf{p}^+} \right\| _2 + \left\| \varPi _A({\mathbf{e}}^*) \right\| _2\). Now, we have

$$\begin{aligned} {\mathbf{b}^+}= \varPi _{{B^+}}({\tilde{\mathbf{a}}}+ {\mathbf{b}}^*+ {\mathbf{e}}^*- \mathbf{a}) = \varPi _{{B^+}}({\mathbf{b}}^*+ {\mathbf{e}}^*- \varPi _A({\mathbf{b}}^*+ {\mathbf{e}}^*- \mathbf{b})) = \varPi _{{B^+}}({\mathbf{b}}^*- \mathbf{p}+ \mathbf{r}), \end{aligned}$$

and thus \({\mathbf{b}}^*- {\mathbf{b}^+}= \varPi _\mathfrak {R}({\mathbf{b}}^*) + \varPi _{{B^+}}(\mathbf{p}- \mathbf{r})\). The triangle inequality then gives us

$$\begin{aligned} \left\| {\mathbf{p}^+} \right\| _2 = \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2 \le \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 + \left\| \varPi _A(\varPi _{{B^+}}(\mathbf{p}- \mathbf{r})) \right\| _2 \end{aligned}$$

Now, if we denote \(\mathbf{z}= {\mathbf{b}}^*- \mathbf{p}+ \mathbf{r}\), the projection step assures us, as before, that,

$$\begin{aligned} \left\| \varPi _\mathfrak {R}(\mathbf{z}) \right\| _2^2 \le \left\| \varPi _\mathfrak {P}(\mathbf{z}) \right\| _2^2 = \left\| \varPi _\mathfrak {P}(\mathbf{p}- \mathbf{r}) \right\| _2^2 \end{aligned}$$

since \(\varPi _{B^*}^\perp ({\mathbf{b}}^*) = \mathbf{0}\). Going as before gives us

$$\begin{aligned} \left\| \varPi _\mathfrak {R}({\mathbf{b}}^*) \right\| _2 \le \left\| \varPi _{B^+}(\mathbf{p}- \mathbf{r}) \right\| _2 + \left\| \varPi _{B}^*(\mathbf{p}- \mathbf{r}) \right\| _2 \end{aligned}$$

Applying incoherence results now tells us that

$$\begin{aligned} \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 \le \sqrt{\mu }\cdot \left\| \varPi _\mathfrak {R}({\mathbf{b}}^*) \right\| _2&= \sqrt{\mu }(\left\| \varPi _{B^+}(\mathbf{p}- \mathbf{r}) \right\| _2 + \left\| \varPi _{B}^*(\mathbf{p}- \mathbf{r}) \right\| _2)\\&\le 2\mu \left\| \mathbf{p} \right\| _2 + 2\sqrt{\mu }\cdot \max _{B \in \mathscr {B}}\left\| \varPi _B(\mathbf{r}) \right\| _2 \end{aligned}$$

Putting things together gives us

$$\begin{aligned} \left\| {\mathbf{p}^+} \right\| _2&\le \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 + \left\| \varPi _A(\varPi _{{B^+}}(\mathbf{p}- \mathbf{r})) \right\| _2\\&\le 3\mu \left\| \mathbf{p} \right\| _2 + 3\sqrt{\mu }\cdot \max _{B \in \mathscr {B}}\left\| \varPi _B(\mathbf{r}) \right\| _2\\&\le 3\mu \left\| \mathbf{p} \right\| _2 + 3\sqrt{\mu }\cdot \max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2, \end{aligned}$$

where the last step follows since \(\mathbf{r}= \varPi _A^\perp ({\mathbf{e}}^*)\) and projections are always non-expansive. Now, APISinitializes \(\mathbf{a}^0 = \mathbf{0}\) which means that initially, we have (using \({\mathbf{a}}^*= {\tilde{\mathbf{a}}}+ {\mathbf{e}}^*\))

$$\begin{aligned} \mathbf{p}^1 = \varPi _A({\mathbf{b}}^*- \mathbf{b}) = \varPi _A({\mathbf{b}}^*- \varPi _{{B^+}}({\mathbf{a}}^*+ {\mathbf{b}}^*)) \end{aligned}$$

and thus, \(\left\| \mathbf{p}^1 \right\| _2 \le \left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2\) since projections are always non-expansive. Thus, if \(\mu < \frac{1}{3}\), then the linear rate of convergence implies that within \(T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)\) iterations, we will have

$$\begin{aligned} \left\| \mathbf{p}^T \right\| _2 \le \epsilon + \frac{4\sqrt{\mu }}{1 - 3\mu }\cdot \max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2 + \left\| \varPi _A({\mathbf{e}}^*) \right\| _2. \end{aligned}$$

Using our earlier observation \(\left\| \mathbf{a}^T - {\tilde{\mathbf{a}}} \right\| _2 = \left\| \mathbf{p}^T \right\| _2 + \left\| \varPi _A({\mathbf{e}}^*) \right\| _2\) then finishes the proof.

E.2 Application to simultaneous sparse corruptions and dense Gaussian noise case

Consider the robust linear regression problem with the true linear model being \({\mathbf{w}}^*\in {\mathbb{R}}^d\) where, apart from k adversarially corrupted points, all n points get Gaussian noise i.e. \(\mathbf{y}= X^\top {\mathbf{w}}^*+ {\mathbf{b}}^*+ {\mathbf{e}}^*\), where \({\mathbf{e}}^*\sim \mathscr {N}(\mathbf{0}, \sigma ^2\cdot I_n)\). It is easy to see that for any fixed r-dimensional subspace S, we have \(\left\| \varPi _S({\mathbf{e}}^*) \right\| _2 \le \mathscr {O}\left( \sqrt{r}\right)\). Thus, \(\left\| \varPi _A({\mathbf{e}}^*) \right\| _2 \le \mathscr {O}\left( \sqrt{d}\right)\) and taking a union bound over all \(\left( {\begin{array}{c}n\\ k\end{array}}\right)\) subspaces of k-sparse vectors tells us that \(\max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2 \le \mathscr {O}\left( \sqrt{k \log n}\right)\).

Lemma 9 shows that within \(T = \mathscr {O}\left( \log n\right)\) iterations, APIS guarantees a model vector \(\mathbf{w}^T\) such that \(\left\| X\mathbf{w}^t - X{\mathbf{w}}^* \right\| _2 \le \mathscr {O}\left( \sqrt{d} + \sqrt{k \log n}\right)\). Using \(\mathbf{w}^T - {\mathbf{w}}^*= X^\dagger (X\mathbf{w}^t - X{\mathbf{w}}^*)\) and the lower bounds on the eigenvalues of \(XX^\top\) from the proof of Lemma 3 tell us that \(\left\| \mathbf{w}^T - {\mathbf{w}}^* \right\| _2 = \mathscr {O}\left( \frac{\sqrt{d} + \sqrt{k \log n}}{\sqrt{n}}\right)\). Squaring both sides tells us that \(\left\| \mathbf{w}^T - {\mathbf{w}}^* \right\| _2^2 \le \mathscr {O}\left( \sigma ^2\left( \frac{(d+k)\ln n}{n}\right) \right)\).

Note that as \(n \rightarrow \infty\), the above model recovery error behaves as \(\left\| \mathbf{w}- {\mathbf{w}}^* \right\| _2^2 \le \mathscr {O}\left( k\log n/n\right)\). This guarantees consistent recovery if \(k\log n/n \rightarrow 0\) as \(n \rightarrow \infty\). This is a sharper result than previous works (Bhatia et al. 2015; Mukhoty et al. 2019) that do not offer consistent estimation even if \(k\log n/n \rightarrow 0\).

E.3 Applicability to robust non-parametric kernel ridge regression

The above results are also useful when applying APIS to robust kernel ridge regression. In several cases, the function (signal) we are trying to approximate need not be exactly represented in terms of the top s eigenvectors of the Gram matrix (see Sect. 3). However, Lemma 9 shows that APIS still offers recovery of the s-sparse representation of the signal in terms of the top-s eigenvectors. As discussed in Sect. 6.5, this still constitutes a universal model in the limit, and experiments in Sect. 7 show that APIS offers excellent reconstruction even under adversarial corruptions for sinusoids, polynomials, and their combinations.

E.4 Convergence analysis for general case i.e both \(P, Q \ge 1\)

We now present the proof in the general case.

Lemma 10

Suppose we obtain data as described in Eq. (2) where the two unions \(\mathscr {A}, \mathscr {B}\) are \(\mu\)-SU incoherent with \(\mu < \frac{1}{9}\). Then, for any \(\epsilon > 0\) within \(T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)\) iterations, APIS offers \(\left\| \mathbf{a}^T - {\tilde{\mathbf{a}}} \right\| _2 \le \epsilon + \mathscr {O}\left( \max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2 + \left\| \varPi _A({\mathbf{e}}^*) \right\| _2\right)\).

Proof

The analysis in the general case proceeds by extending the proof of Lemma 9 in a manner similar to how Lemma 2 extended the proof of Lemma 1. We define the quantities \(p {:}{=} \max _{A \in \mathscr {A}}\ \left\| \varPi _A({\mathbf{b}}^*- \mathbf{b}) \right\| _2, p^+ {:}{=} \max _{A \in \mathscr {A}}\ \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2\) and correspondingly \(q {:}{=} \max _{B \in \mathscr {B}}\ \left\| \varPi _B({\tilde{\mathbf{a}}}- \mathbf{a}) \right\| _2, q^+ {:}{=} \max _{B \in \mathscr {B}}\ \left\| \varPi _B({\tilde{\mathbf{a}}}- {\mathbf{a}^+}) \right\| _2\) as in the proof of Lemma 2 and introduce two new notations \(u = \max _{A \in \mathscr {A}}\left\| \varPi _A({\mathbf{e}}^*) \right\| _2\) and \(v = \max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2\). We get the following results

$$\begin{aligned} \left\| {\mathbf{a}^+}-{\tilde{\mathbf{a}}} \right\| _2&\le 3p^+ + 3u\\ p&\le 3\sqrt{\mu }(q + u)\\ q^+&\le 3\sqrt{\mu }(p + v) \end{aligned}$$

The second result above can be rewritten as \(p^+ \le 3\sqrt{\mu }(q^+ + u)\) which gives us \(p^+ \le 9\mu \cdot p + (9\mu \cdot v + 3\sqrt{\mu }\cdot u)\). Thus, we continue to get a linear rate of convergence whenever \(\mu < \frac{1}{9}\) and, since \(p^1 \le \mathscr {O}\left( \left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2\right)\), after \(T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)\) iterations get \(p^T \le \frac{\epsilon }{3}\). Since \(\left\| \mathbf{a}^T -{\tilde{\mathbf{a}}} \right\| _2 \le 3p^T + 3u\) from above, we get

$$\begin{aligned} \left\| {\mathbf{a}^+}-{\mathbf{a}}^* \right\| _2 \le \epsilon + 10\mu \cdot v + 4\sqrt{\mu }\cdot u + 3u \le \epsilon + 5\cdot \max _{A \in \mathscr {A}}\left\| \varPi _A({\mathbf{e}}^*) \right\| _2 + 2\cdot \max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2, \end{aligned}$$

which finishes the proof.

E.5 Application to recovery of compressible signals

The above result has applications in, for example, the sparse signal transform example, where \({\mathbf{e}}^*\) may model components of the signal not captured in the low-rank model. For instance, the signal may not come entirely from any single rank-s subspace \(A \in \mathscr {A}\) but merely have most of its weight concentrated on a single rank-s subspace \({A}^*\in \mathscr {A}\). \({\mathbf{e}}^*\) would then model the component of the signal orthogonal to \({A}^*\).

Consider an image \({\mathbf{a}}^*\) that is not wavelet-sparse, but \((s,\epsilon )\)-approximately wavelet sparse meaning that there exists an image \({\tilde{\mathbf{a}}}\) that is s wavelet-sparse, and \(\left\| {\mathbf{a}}^*- {\tilde{\mathbf{a}}} \right\| _2 \le \epsilon \cdot \left\| {\mathbf{a}}^* \right\| _2\). In particular, \({\tilde{\mathbf{a}}}\) can be taken to be the best s wavelet-sparse approximation of \({\mathbf{a}}^*\). This means that \(\left\| {\mathbf{e}}^* \right\| _2 \le \epsilon \cdot \left\| {\mathbf{a}}^* \right\| _2\). Lemma 10 shows that APIS offers a recovery of \({\tilde{\mathbf{a}}}\) to within \(\mathscr {O}\left( \epsilon \cdot \left\| {\mathbf{a}}^* \right\| _2\right)\) error within \(T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon \cdot \left\| {\mathbf{a}}^* \right\| _2}\right) = \mathscr {O}\left( \log \frac{1}{\epsilon }+ \log \frac{\left\| {\mathbf{b}}^* \right\| _2}{\epsilon \left\| {\mathbf{a}}^* \right\| _2}\right)\) iterations.

F Handling lack of incoherence with APIS

The results of Theorem 1 and Lemmas 7 and 8 rely on the notion of incoherence described in Sect. 6.1. However, in certain situations, the bases in question are not incoherent. For example, when the signal is s-sparse in the Haar wavelet basis and the corruption is k-sparse in the Fourier basis, it precludes the recovery guarantee offered by APIS.

In the following, we sketch an argument, taking the (Haar-Fourier) case as an example, to show, when the signal offers more structure, local incoherence can still be guaranteed and APIS, with suitable modifications made to the signal projection step \(\varPi _\mathscr {A}(\cdot )\) to exploit this additional structure (discussed later), can offer exact recovery at a linear rate.

As before, let UV denote the design matrices corresponding to the signal and corruption bases. As before, calculating the SU-incoherence constant \(\mu\) requires us to bound

$$\begin{aligned} \max _{\begin{array}{c} \mathbf{u}\in S^{n-1}, \left\| \mathbf{u} \right\| _0 \le s\\ \mathbf{v}\in S^{n-1}, \left\| \mathbf{v} \right\| _0 \le k \end{array}} \left( \mathbf{u}^\top U^\top V\mathbf{v}\right) ^2 \end{aligned}$$

Since \(\mathbf{u},\mathbf{v}\) are sparse vectors, we have \(\mathbf{u}^\top U^\top V\mathbf{v}\le \left\| U_S^\top V_K \right\| _2\) where \(S = {{\,\mathrm{supp}\,}}(\mathbf{u})\) and \(K = {{\,\mathrm{supp}\,}}(\mathbf{v})\) are the supports of \(\mathbf{u}, \mathbf{v}\). Now, the proof strategy in Lemma 8 fails here since for the Haar-wavelet pair, we get \(\left\| U_S^\top V_K \right\| _\infty {=}{:} \nu = 1\) where \(S = {{\,\mathrm{supp}\,}}(\mathbf{u})\) and \(K = {{\,\mathrm{supp}\,}}(\mathbf{v})\) are the supports of \(\mathbf{u}, \mathbf{v}\) respectively (see, for example (Zhou et al. 2016) for this lack of incoherence result).

This happens because there are individual basis vectors in the Haar and Fourier bases, say \(\mathbf{m}, \mathbf{n}\) whose inner product is unity i.e. \(\left| {\left\langle \mathbf{m}, \mathbf{n}\right\rangle} \right| = 1\) i.e. \(\mathbf{m}= \pm \mathbf{n}\). This allows a situation where there is a signal that is just 1-sparse in the Haar basis, specifically \({\mathbf{a}}^*= c\cdot \mathbf{m}\) for some \(c\in {\mathbb{R}}\), and the signal then gets corrupted by a corruption vector that is again just 1-sparse in the Fourier basis, specifically \({\mathbf{b}}^*= d\cdot \mathbf{n}\) for some \(d \in {\mathbb{R}}\). Exact recovery is information theoretically impossible since the algorithm would essentially receive \(\mathbf{y}= {\mathbf{a}}^*+ {\mathbf{b}}^*= (c\pm d)\cdot \mathbf{m}= (c \pm d)\cdot \mathbf{n}\) with no way of separating c and d (we use ± since \(\mathbf{m}, \mathbf{n}\) could be parallel or anti-parallel depending on convention).

F.1 Structured anti-concentrated signals

It turns out that one way to avoid the above problem is to ensure that our signal does not concentrate its mass on just a few coordinates (this prevents the signal from being 1-sparse). Although several ways may exist to enforce the above, in the following definitions, we present the notions of anti-concentrated signal with stratified sparsity. Specifically, suppose the uncorrupted signal is \({\mathbf{a}}^*= U\mathbf{u}\) with U being the design matrix of the Haar wavelet transformation.

Definition 3

A signal \({\mathbf{a}}^*= U\mathbf{u}\in {\mathbb{R}}^n\) is said to be \((\gamma ,s)\) anti-concentrated if it is s-sparse i.e. \(\left\| \mathbf{u} \right\| _0 \le s\), and there exists some \(\gamma > 0\), such that \(\left\| \mathbf{u} \right\| _\infty \le \frac{\gamma }{\sqrt{s}}\cdot \left\| \mathbf{u} \right\| _2\).

Note that, in general, all s-sparse vectors are at least \((\sqrt{s},s)\)-anti-concentrated. However, a \((\sqrt{s},s)\)-anti concentrated signal is allowed to put almost all its weight on a single coordinate. In contrast, the most anti-concentrated s-sparse vector, for which all s coordinates have equal magnitude, would be (1, s)-anti-concentrated. Before presenting the notion of stratified sparsity, we need to introduce the notion of strata for the Haar basis. The Haar basis elements can be arranged into \(\log n\)-many strata with the \(i^\mathrm{th}\) stratum containing \(n_i = 2^i\) basis elements (see the proof of Lemma 11 below for details).

Definition 4

A signal \({\mathbf{a}}^*= U\mathbf{u}\in {\mathbb{R}}^n\) is said to be \(\alpha\)-stratified sparse if for some \(\alpha \in (0,1)\), the support of \(\mathbf{u}\) is such that the \(i^\mathrm{th}\) stratum of the Haar basis contains at most \((n_i)^\alpha\) support elements of \(\mathbf{u}\). Note that this implies that the vector \(\mathbf{u}\) is s-sparse with \(s \le n^\alpha\) as well (although it need not necessarily be anti-concentrated as required by Def. 3).

F.2 Local incoherence with structured anti-concentrated signals

Given signals with additional structure as described above in Defs 3 and 4, the following result shows how local incoherence still continues to hold. Note that the following result starts giving vacuous results (\(\mu \rightarrow 1\)) for \((\gamma ,s)\)-anti concentrated vectors, as \(\gamma \rightarrow \sqrt{s}\). This is as expected since \(\gamma \approx \sqrt{s}\) allows signals that are very concentrated e.g. being close to being 1-sparse.

Lemma 11

Suppose the set of signals \(\mathscr {A}\) is the set of s -sparse (w.r.t Haar basis), \(\alpha\) -stratified and \((\gamma ,s)\) -anti-concentrated signals with \(\alpha \in (0,1), s = n^\alpha , \gamma \in [1,\sqrt{s}]\) . Then, with respect to \(\mathscr {B}\) being the set of k -sparse corruption vectors (no further assumptions being imposed on corruption vectors), for some small universal constant \(c > 0\) , the following (local) incoherence bound continues to hold.

$$\begin{aligned} \mu \le c\cdot \gamma ^2\cdot {\left\{ \begin{array}{ll} \frac{k^{2 + 4\alpha }}{s} &{} \alpha < \frac{1}{2} \\ \frac{k^{2 + 4\alpha }}{s} + \frac{k^2}{s}\log ^2\frac{n}{k^2} &{} \alpha = \frac{1}{2} \\ \frac{k^{2 + 4\alpha }}{s} + \frac{sk^2}{n} &{} \alpha > \frac{1}{2} \end{array}\right. } \end{aligned}$$

Proof

Calculating th SU-incoherence constant \(\mu\) now requires us to bound

$$\begin{aligned} \max _{\begin{array}{c} \mathbf{u}\in S^{n-1}, \left\| \mathbf{u} \right\| _0 \le s\\ \mathbf{u}\text { is}\, \alpha -\text {strat.}, (\gamma ,s)-\text {anti-conc.}\\ \mathbf{v}\in S^{n-1}, \left\| \mathbf{v} \right\| _0 \le k \end{array}} \left( \mathbf{u}^\top U^\top V\mathbf{v}\right) ^2 \end{aligned}$$

Then, applying the \(L_1-L_\infty\) Hölder’s inequality gives us

$$\begin{aligned} \left| {\mathbf{u}^\top U^\top V\mathbf{v}} \right| = \left| {\sum _{i \in S}\sum _{j \in K} \mathbf{u}_i\mathbf{v}_j(U^\top V)_{ij}} \right| \le \max _{i \in S, j \in K}\left| {\mathbf{u}_i\mathbf{v}_j} \right| \cdot \sum _{i \in S}\sum _{j \in K} \left| {(U^\top V)_{ij}} \right| \le \gamma k{\sqrt{s}}\cdot \bar{\nu }_{s,k}, \end{aligned}$$

where \((U^\top V)_{ij} = \mathbf{u}_i^\top \mathbf{v}_j\) and \(\bar{\nu }_{s,k}\) is the largest average value of entries in the matrix \(U_S^\top V_K\) for any choice of sets SK of size sk respectively i.e.

$$\begin{aligned} \bar{\nu }_{s,k} = \max _{\begin{array}{c} S, K \subset [n]\\ |S| = s, |K| = k \end{array}}\frac{1}{sk}\sum _{i \in S}\sum _{j \in K} \left| {(U^\top V)_{ij}} \right| \end{aligned}$$

The above step is perhaps the most crucial in the proof since it shows that the incoherence constant \(\underline{\mu}\) depends on the average of the \(\underline{\left| {(U^\top V)_{ij}} \right|}\) values rather than the largest values, which are always \(\underline{\varOmega \left( 1\right)}\) for the Haar-Fourier system.

The result of Krahmer and Ward (2014, Lemma 6.1) shows that upon indexing the Haar basis elements by \(i \in [1, \log n - 1]\) into the \(\log n\) strata and further indexing the \(2^i\) basis elements within the \(i^\mathrm{th}\) stratum using \(l \in [0, 2^i - 1]\), as well as indexing the Fourier basis elements by \(j \in \left[ -\frac{n}{2}+1,\frac{n}{2}\right] \backslash \left\{ 0\right\}\), we get a local incoherence bound

$$\begin{aligned} \left| {\mathbf{u}_{i,l}^\top \mathbf{v}_j} \right| \le \min \left\{ \frac{6\cdot 2^{\frac{i}{2}}}{\left| {j} \right| }, 3\pi \cdot 2^{-\frac{i}{2}}\right\} \le \mathscr {O}\left( \min \left\{ \frac{2^{\frac{i}{2}}}{\left| {j} \right| }, 2^{-\frac{i}{2}}\right\} \right) \end{aligned}$$

Noting that \(2^{-\frac{i}{2}} \le \frac{2^{\frac{i}{2}}}{j}\) iff \(i \ge 2\log j\), elementary calculations show that

$$\begin{aligned} \bar{\nu }_{s,k} \le \mathscr {O}\left( \frac{1}{sk} \sum _{j=1}^k \left( \sum _{i = 1}^{2\log j} 2^{i\alpha } \cdot \frac{2^{\frac{i}{2}}}{j} + \sum _{i = 2\log j}^{\log n} 2^{i\alpha } \cdot 2^{-\frac{i}{2}} \right) \right) \end{aligned}$$

If \(\alpha < \frac{1}{2}\), the second summation is that of a decreasing series. Thus, the second summation can be upper bounded in this case as

$$\begin{aligned} \sum _{i = 2\log j}^{\log n} 2^{i\left( \alpha - \frac{1}{2}\right) } \le \mathscr {O}\left( 2^{2\log j\left( \alpha - \frac{1}{2}\right) }\right) \le \mathscr {O}\left( j^{2\alpha - 1}\right) \end{aligned}$$

If \(\alpha = \frac{1}{2}\) then we have a much simpler summation

$$\begin{aligned} \sum _{i = 2\log j}^{\log n} 2^{0} \le \mathscr {O}\left( \log \frac{n}{j^2}\right) \end{aligned}$$

If \(\alpha > \frac{1}{2}\), the second summation is that of an increasing series. Thus, the second summation can be upper bounded in this case as

$$\begin{aligned} \sum _{i = 2\log j}^{\log n} 2^{i\left( \alpha - \frac{1}{2}\right) } \le \mathscr {O}\left( 2^{\log n\left( \alpha - \frac{1}{2}\right) }\right) \le \mathscr {O}\left( n^{\alpha - \frac{1}{2}}\right) = \mathscr {O}\left( \frac{s}{\sqrt{n}}\right) \end{aligned}$$

Thus, ignoring constant factors, we get

$$\begin{aligned} \sum _{i = 2\log j}^{\log n} 2^{i\left( \alpha - \frac{1}{2}\right) } \le {\left\{ \begin{array}{ll} j^{2\alpha - 1} &{} \alpha < \frac{1}{2} \\ \log \frac{n}{j^2} &{} \alpha = \frac{1}{2} \\ \frac{s}{\sqrt{n}} &{} \alpha > \frac{1}{2} \end{array}\right. } \end{aligned}$$

This gives us

$$\begin{aligned} \sum _{j=1}^k\sum _{i = 2\log j}^{\log n} 2^{i\left( \alpha - \frac{1}{2}\right) } \le {\left\{ \begin{array}{ll} k^{2\alpha } &{} \alpha < \frac{1}{2} \\ k\log \frac{n}{k^2} &{} \alpha = \frac{1}{2} \\ \frac{sk}{\sqrt{n}} &{} \alpha > \frac{1}{2} \end{array}\right. } \end{aligned}$$

Similarly, the first summation can be bounded, ignoring constant factors, as

$$\begin{aligned} \sum _{j=1}^k\left( \frac{1}{j}\cdot \sum _{i = 1}^{2\log j} 2^{i\left( \alpha + \frac{1}{2}\right) }\right) \le \sum _{j=1}^k\left( \frac{1}{j} \left( 2^{2\log j\left( \alpha + \frac{1}{2}\right) }\right) \right) \le \sum _{j=1}^k\left( \frac{1}{j} \cdot j^{2\alpha + 1}\right) \le k^{2\alpha + 1} \end{aligned}$$

Absorbing all constant factors into a single constant \(c > 0\) gives us

$$\begin{aligned} \bar{\nu }_{s,k} \le \frac{c}{sk} \left( k^{2\alpha + 1} + {\left\{ \begin{array}{ll} k^{2\alpha } &{} \alpha< \frac{1}{2} \\ k\log \frac{n}{k^2} &{} \alpha = \frac{1}{2} \\ \frac{sk}{\sqrt{n}} &{} \alpha> \frac{1}{2} \end{array}\right. } \right) = \frac{1}{s}\cdot {\left\{ \begin{array}{ll} k^{2\alpha } &{} \alpha < \frac{1}{2} \\ k^{2\alpha } + \log \frac{n}{k^2} &{} \alpha = \frac{1}{2} \\ k^{2\alpha } + \frac{s}{\sqrt{n}} &{} \alpha > \frac{1}{2} \end{array}\right. } \end{aligned}$$

which in turn gives us (renaming \(c^2 {=}{:} c\)),

$$\begin{aligned} \mu \le \left( \gamma k\sqrt{s}\cdot \bar{\nu }_{s,k}\right) ^2 \le c\cdot \gamma ^2\cdot {\left\{ \begin{array}{ll} \frac{k^{2 + 4\alpha }}{s} &{} \alpha < \frac{1}{2} \\ \frac{k^{2 + 4\alpha }}{s} + \frac{k^2}{s}\log ^2\frac{n}{k^2} &{} \alpha = \frac{1}{2} \\ \frac{k^{2 + 4\alpha }}{s} + \frac{sk^2}{n} &{} \alpha > \frac{1}{2} \end{array}\right. }. \end{aligned}$$

Thus, we do have incoherence when \(k \ll s\) as well as \(sk \ll n\). We can get a stronger result if the corruption is also assured to be anti concentrated. Specifically \({\mathbf{b}}^*= V\mathbf{v}\) where \(\mathbf{v}\) is k-sparse as well as \((\delta , k)\) anti-concentrated for some \(\delta \in [1,\sqrt{k}]\). We present this improved result below and note that it offers superior dependence on k due to the additional structure in the corruption vector.

$$\begin{aligned} \mu \le c\cdot \gamma ^2\delta ^2\cdot {\left\{ \begin{array}{ll} \frac{k^{1 + 4\alpha }}{s} &{} \alpha < \frac{1}{2} \\ \frac{k^{1 + 4\alpha }}{s} + \frac{k}{s}\log ^2\frac{n}{k^2} &{} \alpha = \frac{1}{2} \\ \frac{k^{1 + 4\alpha }}{s} + \frac{sk}{n} &{} \alpha > \frac{1}{2} \end{array}\right. } \end{aligned}$$

This finishes the proof.

F.3 Algorithmic modifications to APIS

Since signals now have additional structure, specifically, anti-concentration and stratified sparsity, we need to modify the projection step \(\varPi _\mathscr {A}(\cdot )\) appropriately to handle both properties. Fortunately, simple modifications to the hard-thresholding operator address both.

figured
figuree

Algorithm 3 gives the recipe to perform projections onto vectors that are s-sparse in the Haar basis, as well as stratified and sup-norm bounded (to ensure anti-concentration). The sup-norm bound M is a new hyperparameter in the algorithm and can be tuned according to the hyperparameter tuning procedure outlined in Sect. 7. After performing an inverse Haar transform, Algorithm 3 breaks up the resulting vector into the \(\log n\) strata offered by the Haar basis and performs Bounded Hard Thresholding (BHT) on each stratum separately.

Algorithm 4 presents BHT, a modified hard thresholding operation that admits the sup-norm restriction in addition to the sparsity restriction. Instead of the traditional hard-thresholding operator HT (see Sect. 4) which simply selects the top t coordinates according to magnitude, BHT instead uses the discounted magnitude of each coordinate to do so. The discounted magnitude d of a value \(v \in {\mathbb{R}}\) given a sup-norm bound \(M > 0\) is defined as

$$\begin{aligned} d = \sqrt{v^2 - (\left| {v} \right| - \min \left\{ \left| {v} \right| , M\right\} )^2} \end{aligned}$$

Note that if there is no sup-norm bound (equivalently if \(M = \infty\)), then the discounted magnitude is simply the magnitude i.e. \(d = \left| {v} \right|\). Thus, in the absence of an sup-norm bound, BHT becomes simply HT.

To prove the optimality of Algorithm 3 it is sufficient to prove the optimality of the BHT procedure since Algorithm 3 simply applies it in a stratum-wise manner. We prove the optimality of BHT below.

Theorem 2

For any vector \(\mathbf{r}\in {\mathbb{R}}^n, t \in [n], M > 0\), let \(\mathbf{p}= \text {BHT} (\mathbf{r}, t, M)\) (see Algorithm 4). Then \(\mathbf{p}\) is t-sparse and satisfies \(\left\| \mathbf{p} \right\| _\infty \le M\). Moreover, let \(\mathbf{q}\in {\mathbb{R}}^n\) be any vector that is also t-sparse and satisfies \(\left\| \mathbf{q} \right\| _\infty \le M\). Then we must have \(\left\| \mathbf{r}- \mathbf{p} \right\| _2^2 \le \left\| \mathbf{r}- \mathbf{q} \right\| _2^2\) i.e. BHT does provide the optimal projection onto sup-norm bounded sparse vectors.

Proof

That \(\mathbf{p}\) is t-sparse and satisfies \(\left\| \mathbf{p} \right\| _\infty \le M\) is immediate from the steps taken by Algorithm 4. To prove the second part, let \(S = {{\,\mathrm{supp}\,}}(\mathbf{p}), T = {{\,\mathrm{supp}\,}}(\mathbf{q})\) be the support of the two vectors. Assume w.l.o.g. that \(|S| = t = |T|\). Now, we create a third vector \(\mathbf{k}\) with the same support as \(\mathbf{q}\) but with possibly different values. Specifically, set \(\mathbf{k}_j = \min \left\{ \left| {\mathbf{r}_j} \right| , M\right\} \cdot {{\,\mathrm{sign}\,}}\left\{ \mathbf{r}_j\right\}\) for \(j \in T\) and \(\mathbf{k}_j = 0\) for \(j \notin T\). Notice that \(\mathbf{k}\) is also t-sparse, \({{\,\mathrm{supp}\,}}(\mathbf{k}) = T\), and it satisfies \(\left\| \mathbf{k} \right\| _\infty \le M\) as well.

It is easy to see that \(\left\| \mathbf{r}- \mathbf{k} \right\| _2^2 \le \left\| \mathbf{r}- \mathbf{q} \right\| _2^2\) which captures our intuition that once we have chosen a t-sized support for our vector, the ideal thing to do is to fill coordinates in the support with the value \(\min \left\{ \left| {\mathbf{r}_j} \right| , M\right\} \cdot {{\,\mathrm{sign}\,}}(\mathbf{r}_j)\) (with the absolute value and sign operations being applied component-wise) which maximally preserves the vector in that coordinate subject to the sup-norm bound.

Now we prove that the choice of support made by BHT is optimal by showing that \(\left\| \mathbf{r}- \mathbf{p} \right\| _2^2 \le \left\| \mathbf{r}- \mathbf{k} \right\| _2^2\). To see this, we consider the following sequence of inequalities. We will find the shorthand \(\mathbf{m}_i {:}{=} \left| {\mathbf{r}_i} \right| - \min \left\{ \left| {\mathbf{r}_i} \right| , M\right\}\) very useful in the following. This is because

$$\begin{aligned} {{\,\mathrm{sign}\,}}(\mathbf{r}_j)\cdot \mathbf{m}_j = \mathbf{r}_j - \min \left\{ \left| {\mathbf{r}_j} \right| , M\right\} \cdot {{\,\mathrm{sign}\,}}(\mathbf{r}_j) \end{aligned}$$

is simply the residual error at any coordinate that is in the support of either \(\mathbf{p}\) or \(\mathbf{k}\). Note that we have

$$\begin{aligned} \left\| \mathbf{r}- \mathbf{k} \right\| _2^2&= \sum _{i \in T}\mathbf{m}_i^2 + \sum _{j \in S \backslash T}\mathbf{r}_j^2 + \sum _{l \notin S \cup T}\mathbf{r}_l^2\\ \left\| \mathbf{r}- \mathbf{p} \right\| _2^2&= \sum _{i \in S}\mathbf{m}_i^2 + \sum _{j \in T \backslash S}\mathbf{r}_j^2 + \sum _{l \notin S \cup T}\mathbf{r}_l^2 \end{aligned}$$

This gives us

$$\begin{aligned} \left\| \mathbf{r}- \mathbf{k} \right\| _2^2 - \left\| \mathbf{r}- \mathbf{p} \right\| _2^2 = \sum _{i \in S\backslash T}(\mathbf{r}_i^2 - \mathbf{m}_i^2) - \sum _{j \in T\backslash S}(\mathbf{r}_j^2 - \mathbf{m}_j^2) = \sum _{i \in S\backslash T}\mathbf{d}_i^2 - \sum _{j \in T\backslash S}\mathbf{d}_j^2, \end{aligned}$$

where \(\mathbf{d}_i = \sqrt{\mathbf{r}_i^2 - \mathbf{m}_i^2}\) is the discounted magnitude of the \(i^\mathrm{th}\) coordinate as defined above. However, since BHT always chooses the t coordinates with highest discounted magnitude, we must have \(\sum _{i \in S\backslash T}\mathbf{d}_i^2 \ge \sum _{j \in T\backslash S}\mathbf{d}_j^2\) since \(|S| = t = |T|\). Thus, we get \(\left\| \mathbf{r}- \mathbf{k} \right\| _2^2 \ge \left\| \mathbf{r}- \mathbf{p} \right\| _2^2\) and since we have \(\left\| \mathbf{r}- \mathbf{k} \right\| _2^2 \le \left\| \mathbf{r}- \mathbf{q} \right\| _2^2\) from the construction of \(\mathbf{k}\) as we saw earlier, this finishes the proof.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mukhoty, B., Dutta, S. & Kar, P. Robust non-parametric regression via incoherent subspace projections. Mach Learn (2021). https://doi.org/10.1007/s10994-021-06045-z

Download citation