Robust high dimensional expectation maximization algorithm via trimmed hard thresholding

Wang, Di; Guo, Xiangyu; Li, Shi; Xu, Jinhui

doi:10.1007/s10994-020-05926-z

Robust high dimensional expectation maximization algorithm via trimmed hard thresholding

Published: 02 November 2020

Volume 109, pages 2283–2311, (2020)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Robust high dimensional expectation maximization algorithm via trimmed hard thresholding

Download PDF

Di Wang¹^na1^an1,
Xiangyu Guo¹^na1^an1,
Shi Li¹ &
…
Jinhui Xu ORCID: orcid.org/0000-0003-4908-0243¹

1627 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In this paper, we study the problem of estimating latent variable models with arbitrarily corrupted samples in high dimensional space (i.e., $d\gg n$) where the underlying parameter is assumed to be sparse. Specifically, we propose a method called Trimmed (Gradient) Expectation Maximization which adds a trimming gradients step and a hard thresholding step to the Expectation step (E-step) and the Maximization step (M-step), respectively. We show that under some mild assumptions and with an appropriate initialization, the algorithm is corruption-proofing and converges to the (near) optimal statistical rate geometrically when the fraction of the corrupted samples $\epsilon$ is bounded by ${\tilde{O}}\bigg (\frac{1}{\sqrt{n}}\bigg )$. Moreover, we apply our general framework to three canonical models: mixture of Gaussians, mixture of regressions and linear regression with missing covariates. Our theory is supported by thorough numerical results.

Robust and sparse regression in generalized linear model by stochastic optimization

Article 11 June 2019

A new method for estimation and model selection: $$\rho $$ -estimation

Article 26 July 2016

Nonparametric estimation of $${\mathbb {P}}(X<Y)$$ from noisy data samples with non-standard error distributions

Article 08 January 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

As one of the most popular techniques for estimating the maximum likelihood of mixture models or incomplete data problems, Expectation Maximization (EM) algorithm has been widely applied to many areas such as genomics (Laird 2010), finance (Faria and Gonçalves 2013), and crowdsourcing (Dawid and Skene 1979). Although EM algorithm is well-known to converge to an empirically good local estimator (Wu et al. 1983), finite sample statistical guarantees for its performance have not been established until recent studies (Balakrishnan et al. 2017b; Zhu et al. 2017; Wang et al. 2015; Yi and Caramanis 2015). Specifically, the first local convergence theory and finite sample statistical rate of convergence for the classical EM and its gradient ascent variant (gradient EM) were established in Balakrishnan et al. (2017b). Later, Wang et al. (2015) extended the classical EM and gradient EM algorithms to the high dimensional sparse setting, and the key idea in their methods is an additional truncation step after the M-step, which can exploit the intrinsic sparse structure of the high dimensional latent variable models. Later on, Yi and Caramanis (2015) also studied the high dimensional sparse EM algorithm and proposed a method which uses a regularized M-estimator in the M-step. Recently, Zhu et al. (2017) considered the computational issue of the previous methods of the problem in high dimensional sparse case. They proposed a method called VRSGEM (Variance Reduced Stochastic Gradient EM) which combines the idea of SVRG (Stochastic Variance Reduced Gradient) (Johnson and Zhang 2013) and the high dimensional gradient EM algorithm. Their method has less gradient complexity while also can achieve almost the same statistical estimation errors as the previous ones.

Although the above methods could achieve (near) optimal minimax rate for some statistical models such as Gaussian mixture model, mixture of regressions and linear regression with missing covariates (see Sect. 3 for details), all of these results need to assume that the data samples have no corruptions and also should satisfy some statistical assumptions, such as sub-Gaussian. This means that some arbitrary corruptions among the data samples may cause the dataset violate these statistical assumptions which are required for convergence of the above methods, or they will even make the above methods achieve unacceptable statistical estimation errors (see Fig. 1 for experimental studies). Thus, the classical EM algorithm and its variants are sensitive to these corruptions. Although statistical estimation with arbitrary corruptions has long been a focus in robust statistics (Huber 2011), it is still unknown that whether there exist some variant of (gradient) EM algorithm which is robust to arbitrary corruptions while also has finite sample statistical guarantees as in the non-corrupted case.

To address the aforementioned issue, in this paper, we study the problem of statistical estimation of latent variable models with arbitrarily corrupted samples in high dimensional space^{Footnote 1} (i.e., $d\gg n$) where the underlying parameter is assumed to be sparse. Specifically, we propose a new algorithm called Trimmed (Gradient) Expectation Maximization, which attaches a trimming gradient and hard thresholding step to the E-step and M-step in each iteration, respectively. We show that under certain conditions, our algorithm is robust against corruption and converges with a statistical estimation error which is (near) statistically optimal. Below is a summary of our main contributions.

1.
We show that, given an appropriate initialization $\beta ^{\text {init}}$, i.e., $\Vert \beta ^{\text {init}}-\beta ^*\Vert \le \kappa \Vert \beta ^*\Vert _2$ for some constant $\kappa \in (0,1)$, if the model satisfies some additional assumptions, the iterative solution sequence $\beta ^{t}$ of our algorithm satisfies $\Vert \beta ^t-\beta ^*\Vert _2\le {\tilde{O}}\left( c_1 \rho ^{t}+\sqrt{s^*}c_2(\epsilon \log (nd)+\sqrt{\frac{\log d}{n}})\right)$ with high probability, where $\rho \in (0,1)$, $c_1, c_2$ are some constants dependent on the model, $\epsilon$ is the fraction of the perturbed samples, and $s^*$ is the sparsity parameter of the underlying parameter $\beta ^*$. Particularly, when $c_2$ is a constant and $\epsilon \le O\left( \frac{1}{\sqrt{n}\log (nd)}\right)$, the above estimation error geometrically converges to $O\left( \sqrt{\frac{s^*\log d}{n}}\right)$, which is statistically optimal. This means that our algorithm is corruption-proofing for a certain level of corruption that is only dependent on the sample size, which is quite useful in the high dimensional setting.
2.
We implement our algorithm on three canonical models: mixture of Gaussians, mixture of regressions and linear regression with missing covariates. Experimental results on these models support our theoretical analysis.

Some background, lemmas and all the proofs are included in the “Appendix”.

2 Related work

There are mainly two perspectives on the study of EM algorithm. The first one focuses on its statistical guarantees (Balakrishnan et al. 2017b; Zhu et al. 2017; Wang et al. 2015; Yi and Caramanis 2015). However, there are many differences compared with our results. Firstly, as we mentioned above, although in this paper we study the same statistical setting as these previous work, our method is corruption-proofing while the performance of their algorithms is heavily affected by outliers. Secondly, in our paper we use a robust version of the gradient instead of the original gradient, this make the proof of our theoretical result different with the above previous papers. Another direction focus on the practical performance, and there are many robust variants of the EM algorithm such as Aitkin and Wilson (1980); Yang et al. (2012). However, we note that these methods are incomparable with ours. Firstly, in this paper we mainly focus on statistical setting and the statistical guarantees while there is no any theoretical guarantees of these methods. Secondly, previous methods can only be used in the low dimension case while we focus on the high dimensional sparse case. Thus, to our best knowledge, there is no previous work on the variants of the EM algorithm that is both robust to some corruptions and also has statistical guarantees. Thus, in the following we will only compare with some other methods that are close to ours.

Diakonikolas et al. (2016, 2017, 2018), Chen et al. (2013) studied the problem of robustly estimating the mixture of distributions. However, some of them are not computationally practical as they rely on the rather time-consuming ellipsoid method. Moreover, these methods in general cannot be extended to the distributed or Byzantine setting (Chen et al. 2017), while ours can be easily extended to such scenarios.

Du et al. (2017), Balakrishnan et al. (2017a), Li (2017), Suggala et al. (2019), Dalalyan and Thompson (2019), Thompson and Dalalyan (2018) studied the robust high dimensional sparse estimation problem for some specified tasks, such as GLM, linear regression, mean and covariance matrix estimation. However, none of them considered estimating the latent variable models and thus is quite different from ours.

Recently, several robust methods have been proposed based on (stochastic) gradient descent, such as Alistarh et al. (2018), Chen et al. (2017), Yin et al. (2018), Prasad et al. (2018), Holland (2018). However, none of them studies the latent variable models and all of them consider only the low dimensional case.

We have to note that the most closed work to ours is given by Liu et al. (2019). Specifically, Liu et al. (2019) recently investigated the robust high dimensional sparse M-estimation problem (such as linear regression and logistic regression) by combining hard thresholding with trimming steps. However, their results are incomparable with ours. Particularly, their method can only be used in the M-estimation, and they only consider the case where the loss function is convex while ours focuses on the latent variable model and the EM algorithm, and the loss function (Q-function) is non-convex. Thus, we cannot use their proofs directly to get our theoretical results.

3 Preliminaries

Let Y and Z be two random variables taking values in the sample spaces ${\mathcal {Y}}$ and ${\mathcal {Z}}$, respectively. Suppose that the pair (Y, Z) has a joint density function $f_{\beta ^*}$ that belongs to some parameterized family $\{f_{\beta ^*}|\beta ^* \in \varOmega \}$. Rather than considering the whole pair of (Y, Z), we observe only component Y. Thus, component Z can be viewed as the missing or latent structure. We assume that the term $h_{\beta }(y)$ is the marginal distribution over the latent variable Z, i.e., $h_\beta (y)=\int _{{\mathcal {Z}}} f_\beta (y, z) dz.$ Let $k_{\beta }(z|y)$ be the density of Z conditional on the observed variable $Y=y$, that is, $k_\beta (z|y)=\frac{f_\beta (y, z)}{h_\beta (y)}.$

Given n observations $y_1, y_2, \ldots , y_n$ of Y, the EM algorithm is to maximize the log-likelihood $\max _{\beta \in \varOmega }\ell _n(\beta ) = \sum _{i=1}^n\log h_\beta (y_i).$ Due to the unobserved latent variable Z, it is often difficult to directly evaluate $\ell _n(\beta )$. Thus, we consider the lower bound of $\ell _n(\beta )$ . By Jensen’s inequality, we have

$$\begin{aligned} \frac{1}{n}[\ell _n(\beta )-\ell _n(\beta ')]\ge & {} \frac{1}{n}\sum _{i=1}^n\int _{{\mathcal {Z}}}k_{\beta '}(z|y_i)\log f_\beta (y_i, z)dz \nonumber \\&- \frac{1}{n}\sum _{i=1}^n\int _{{\mathcal {Z}}}k_{\beta '}(z|y_i)\log {f_{\beta '}(y_i, z)}dz. \end{aligned}$$

(1)

Let $Q_n(\beta ; \beta ')=\frac{1}{n}\sum _{i=1}^n q_i(\beta ;\beta ')$, where

$$\begin{aligned} q_{i}(\beta ; \beta ')=\int _{{\mathcal {Z}}}k_{\beta '}(z|y_i)\log f_\beta (y_i, z)dz. \end{aligned}$$

(2)

Also, it is convenient to let $Q(\beta ; \beta ')$ denote the expectation of $Q_n(\beta ; \beta ')$ w.r.t $\{y_i\}_{i=1}^n$, that is,

$$\begin{aligned} Q(\beta ; \beta ')= {\mathbb {E}}_{y\sim h_{\beta ^*}}\int _{{\mathcal {Z}}}k_{\beta '}(z|y)\log f_\beta (y, z)dz. \end{aligned}$$

(3)

We can see that the second term on the right hand side of (1) is not dependent on $\beta$. Thus, given some fixed $\beta '$, we can maximize the lower bound function $Q_n(\beta ; \beta ')$ over $\beta$ to obtain sufficiently large $\ell _n(\beta )-\ell _n (\beta ')$. Thus, in the t-th iteration of the standard EM algorithm, we can evaluate $Q_n(\cdot ; \beta ^t)$ at the E-step and then perform the operation of $\max _{\beta \in \varOmega }Q_n(\beta ; \beta ^t)$ at the M-step. See McLachlan and Krishnan (2007) for more details.

In addition to the exact maximization implementation of the M-step, we add a gradient ascent implementation of the M-step, which performs an approximate maximization via a gradient descent step.

Gradient EM Procedure (Balakrishnan et al. 2017b) When $Q_n(\cdot ; \beta ^t)$ is differentiable, the update of $\beta ^t$ to $\beta ^{t+1}$ consists of the following two steps.

E-step: Evaluate the functions in (2) to compute $Q_n(\cdot ; \beta ^t)$.
M-step: Update $\beta ^{t+1}=\beta ^t+\eta \nabla Q_n(\beta ^t; \beta ^t)$, where $\nabla$ is the derivative of $Q_n$ w.r.t the first component and $\eta$ is the step size.

Next, we give some examples that use the gradient EM algorithm. Note that they are the typical examples for studying the statistical property of EM algorithm (Wang et al. 2015; Balakrishnan et al. 2017b; Yi and Caramanis 2015; Zhu et al. 2017).

Gaussian Mixture Model Let $y_1, \ldots , y_n$ be n i.i.d. samples from $Y\in {\mathbb {R}}^d$ with

$$\begin{aligned} Y = Z\cdot \beta ^*+V, \end{aligned}$$

(4)

where Z is a Rademacher random variable (i.e., ${\mathbb {P}}(Z=+1)= {\mathbb {P}}(Z=-1)=\frac{1}{2}$), and $V\sim {\mathcal {N}}(0, \sigma ^2 I_d)$ is independent of Z for some known standard deviation $\sigma$. In our high dimensional setting, we assume that $\Vert \beta ^*\Vert _0=s^*$ is sparse.^{Footnote 2}

For Gaussian Mixture Model, we have

$$\begin{aligned} \nabla q_i(\beta ;\beta )=[2w_\beta (y_i)-1]\cdot y_i-\beta , \end{aligned}$$

(5)

where $w_\beta (y)=\frac{1}{1+\exp (-\langle \beta , y\rangle /\sigma ^2)}$.

Mixture of (Linear) Regressions Model Let n samples $(x_1, y_1)$, $(x_2, y_2)$, $\ldots , (x_n, y_n)$ i.i.d.. sampled from $Y\in {\mathbb {R}}$ and $X\in {\mathbb {R}}^d$ with

$$\begin{aligned} Y= Z\langle \beta ^*, X \rangle +V, \end{aligned}$$

(6)

where $X\sim {\mathcal {N}}(0, I_d)$, $V\sim {\mathcal {N}}(0, \sigma ^2)$,^{Footnote 3}Z is a Rademacher random variable, and X, V, Z are independent. In the high dimensional case, we assume that $\Vert \beta ^*\Vert _0=s^*$ is sparse.

In this case, we have

$$\begin{aligned} \nabla q_i(\beta ;\beta )=(2w_\beta (x_i, y_i)-1) \cdot y_i \cdot x_i-x_i x_i^T\cdot \beta , \end{aligned}$$

(7)

where $w_\beta (x_i, y_i)=\frac{1}{1+\exp (-y\langle \beta ,x \rangle /\sigma ^2)}$.

Linear Regression with Missing Covariates We assume that $Y\in {\mathbb {R}}$ and $X\in {\mathbb {R}}^d$ satisfy

$$\begin{aligned} Y= \langle X, \beta ^* \rangle +V, \end{aligned}$$

(8)

where $X\sim {\mathcal {N}}(0, I_d)$ and $V\sim {\mathcal {N}}(0, \sigma ^2)$ are independent. In our high dimensional setting, we assume that $\Vert \beta ^*\Vert _0=s^*$ is sparse. Let $x_1, x_2, \ldots , x_n$ be n observations of X with each coordinate of $x_i$ missing (unobserved) independently with probability $p_m\in [0,1)$.

In this case, we have

$$\begin{aligned} \nabla q_i(\beta ; \beta )= y_i\cdot m_\beta (x_i^{\text {obs}},y_i)-K_\beta (x_i^{\text {obs}}, y_i)\beta , \end{aligned}$$

(9)

where the functions $m_\beta (x_i^{\text {obs}},y_i)\in {\mathbb {R}}^d$ and $K_\beta (x_i^{\text {obs}}, y_i)\in {\mathbb {R}}^{d\times d}$ are defined as:

$$\begin{aligned} m_\beta (x_i^{\text {obs}},y_i)= z_i \odot x_i+\frac{y_i-\langle \beta , z_i\odot x_i\rangle }{\sigma ^2+\Vert (1-z_i)\odot \beta \Vert _2^2}(1-z_i)\odot \beta \end{aligned}$$

(10)

and

$$\begin{aligned} K_\beta (x_i^{\text {obs}}, y_i)= & {} \text {diag}(1-z_i)+ m_\beta (x_i^{\text {obs}},y_i)\cdot [ m_\beta (x_i^{\text {obs}},y_i)]^T \nonumber \\&-[(1-z_i)\odot m_\beta (x_i^{\text {obs}},y_i)]\cdot [(1-z_i)\odot m_\beta (x_i^{\text {obs}},y_i)]^T, \end{aligned}$$

(11)

where vector $z_i \in {\mathbb {R}}^d$ is defined as $z_{i,j}=1$ if $x_{i,j}$ is observed and $z_{i,j}=0$ is $x_{i,j}$ is missing, and $\odot$ denotes the Hadamard product of matrices.

Next, we provide several definitions on the required properties of functions $Q_n(\cdot ; \cdot )$ and $Q(\cdot ; \cdot )$. Note that some of them have been used in the previous studies on EM (Balakrishnan et al. 2017b; Wang et al. 2015; Zhu et al. 2017).

Definition 1

Function $Q(\cdot ; \beta ^*)$ is self-consistent if $\beta ^*=\arg \max _{\beta \in \varOmega }Q(\beta ; \beta ^*).$ That is, $\beta ^*$ maximizes the lower bound of the log likelihood function.

Definition 2

$Q(\cdot ; \cdot )$ is called Lipschitz–Gradient-2($\gamma , {\mathcal {B}}$), if for the underlying parameter $\beta ^*$ and any $\beta \in {\mathcal {B}}$ for some set ${\mathcal {B}}$, the following holds

$$\begin{aligned} \Vert \nabla Q(\beta ; \beta ^*)-\nabla Q(\beta ; \beta )\Vert _2\le \gamma \Vert \beta -\beta ^*\Vert _2. \end{aligned}$$

(12)

We note that there are some differences between the definition of Lipschitz–Gradient-2 and the Lipschitz continuity condition in the convex optimization literature (Nesterov 2013). Firstly, in (12), the gradient is w.r.t the second component, while the Lipschitz continuity is w.r.t the first component. Secondly, the property holds only for fixed $\beta ^*$ and any $\beta$, while the Lipschitz continuity is for all $\beta , \beta '\in {\mathcal {B}}$.

Definition 3

($\mu$-smooth) $Q(\cdot ; \beta ^*)$ is $\mu$-smooth, that is if for any $\beta , \beta '\in {\mathcal {B}}$, $Q(\beta ;\beta ^*)\ge Q(\beta '; \beta ^*)+(\beta -\beta ')^T\nabla Q(\beta ';\beta ^*)-\frac{\mu }{2}\Vert \beta '-\beta \Vert _2^2.$

Definition 4

($\upsilon$-strongly concave) $Q(\cdot ; \beta ^*)$ is $\upsilon$-strongly concave, that is if for any $\beta , \beta '\in {\mathcal {B}}$, $Q(\beta ;\beta ^*)\le Q(\beta '; \beta ^*)+(\beta -\beta ')^T\nabla Q(\beta ';\beta ^*)-\frac{\upsilon }{2}\Vert \beta '-\beta \Vert _2^2.$

Next, we assume that each coordinate of $\nabla q(\beta ; \beta )$ in (2) is sub-exponential for every $\beta \in {\mathcal {B}}$, where $\nabla$ is the derivative of q w.r.t the first component.

Definition 5

A random variable X with mean ${\mathbb {E}}(X)$ is $\xi$-sub-exponential for $\xi >0$ if for all $|t|<\frac{1}{\xi }$, ${\mathbb {E}}\{\exp \bigg (t[X-{\mathbb {E}}(X)] \bigg )\}\le \exp (\frac{\xi ^2t^2}{2}).$

Assumption 1

We assume that $Q(\cdot ; \cdot )$ in (3) is self-consistent, Lipschitz–Gradient-2($\gamma , {\mathcal {B}}$), $\mu$-smooth and $\upsilon$-strongly convex for some ${\mathcal {B}}$. Moreover, we assume that for any fixed $\beta \in {\mathcal {B}}$ with $\Vert \beta \Vert _0\le s$ (where the value of s will be specified later) and $\forall j\in [d]$, the j-th coordinate of $\nabla q(\beta ; \beta )$ (i.e., $[\nabla q(\beta ; \beta )]_j$) is $\xi$-sub-exponential and for each $i\in [n]$, $[\nabla q_i(\beta , \beta )]_j$ is independent with others.

We note that the sub-exponential assumption on each coordinate is stronger than the assumption of Statistical-Error in Wang et al. (2015); Balakrishnan et al. (2017b). However, since the model considered in this paper could have arbitrarily corrupted samples, we will see later that this assumption is necessary.

Finally, we give the definition of the corruption model studied in the paper.

Definition 6

($\epsilon$-corrupted samples) Let $\{y_1, y_2, \ldots , y_n\}$ be n i.i.d. observations with distribution P. We say that a collection of samples $\{z_1, z_2, \ldots , z_n\}$ is $\epsilon$-corrupted if an adversary chooses an arbitrary $\epsilon$-fraction of the samples in $\{y_i\}_{i=1}^n$ and modifies them with arbitrary values.

We note that this is a quite common model in robust estimation or robust statistics. Equivalently, it means that there are $\epsilon$-fraction of samples in the dataset are outliers (or they are corrupted arbitrarily).

4 Trimmed expectation maximization algorithm

To obtain a robust estimator for the high dimensional model with $\epsilon$-corrupted samples, we propose a trimmed EM algorithm, which is based on the gradient EM algorithm. See Algorithm 1 for details.

Note that compared with the previous gradient EM algorithm, Trimmed EM algorithm has two additional steps in each iteration, i.e., the trimming gradient and hard thresholding step. For the trimming gradient step 4 in Algorithm 1, we use the dimensional $\alpha$-trimmed estimator (i.e., $\text {D-Trim}_{\alpha }$) on the gradients $\{\nabla q_i(\beta ^t; \beta ^t)\}_{i=1}^n$. We note that while this operator has also been studied in Liu et al. (2019); Yin et al. (2018) for the M-estimators, we use it for the EM algorithm. Here is the definition of the function $\text {D-Trim}_{\alpha }(\cdot )$.

Definition 7

(Dimensional $\alpha$-trimmed estimator) Given a set of $\epsilon$-corrupted samples in the form of d-dimensional vectors $\{z_i\}_{i=1}^n$, the D-Trim operator $\text {D-Trim}_{\alpha }(\{z_i\}_{i=1}^n)\in {\mathbb {R}}^d$ performs as follows. For each dimension $j\in [d]$, it first removes the largest and the smallest $\alpha$ fraction of elements in the j-th coordinate of $\{z_i\}_{i=1}^n$, i.e., $\{z_{i,j}\}_{i=1}^n$, and then calculates the mean of the remaining terms, where $\alpha =c_0\epsilon$ and $\alpha \le \frac{1}{2}-c_1$ for some constant $c_0\ge 1$ and a small constant $c_1$.

The rationale behind the use of the dimensional trimmed estimator is that due to the existence of $\epsilon$ fraction of corrupted samples, directly calculating the the mean of the gradient could introduce a large error to the population gradient $\nabla Q(\beta ^t; \beta ^t)$ in (3). Also, it can be shown that if each coordinate of $\nabla q_i(\beta ^t; \beta ^t)$ is sub-exponential, it will be robust against the $\epsilon$-corruption for some small $\epsilon$. This motivates us to use the dimensional trimmed operation.

To ensure the sparsity of our estimator, after getting $\beta ^{t+0.5}$, we need to use the hard thresholding operation (Blumensath and Davies 2009). More specifically, we first find the set $\hat{{\mathcal {S}}}^{t+0.5}\subseteq [d]$ of indices j corresponding to the top s largest $|\beta ^{t+0.5}_j|$ (we denote $\hat{{\mathcal {S}}}^{t+0.5}=\text {supp}(\beta ^{t+0.5}, s)$^{Footnote 4}), and make the value of the remaining entries $\beta ^{t+0.5}_j$ for $j\in [d]\backslash \hat{{\mathcal {S}}}^{t+0.5}$ be 0 (we denote $\beta ^{t+1}=\text {trunc}(\beta ^{t+0.5}, \hat{{\mathcal {S}}}^{t+0.5})$^{Footnote 5}). The sparsity level s controls the sparsity of the estimator and the estimation error.

The following main theorem shows that under Assumption 1 and with some proper initial vector $\beta ^{\text {init}}$, the estimator $\beta ^T$ converges to the underlying $\beta ^*$ at a geometric rate with high probability.

Theorem 1

Let ${\mathcal {B}}=\{\beta : \Vert \beta -\beta ^*\Vert _2\le R\}$ be a set with $R=k\Vert \beta ^*\Vert _2$ for some $k\in (0,1)$. Assume that Assumption 1holds for parameters ${\mathcal {B}}, \gamma , \mu , \upsilon , \xi$ satisfying the condition of $1-2\frac{\upsilon -\gamma }{\upsilon +\mu }\in (0,1)$ and the sparsity parameter s is chosen to be

$$\begin{aligned} s=\left\lceil C\max \left\{ \frac{16}{\{1/[1-2(\upsilon -\gamma )/(\upsilon +\mu )]-1\}^2}, \frac{4(1+k)^2}{(1-k)^2}\right\} s^* \right\rceil , \end{aligned}$$

(13)

where C is some absolute constant. Also, assume that $\Vert \beta ^{\text {init}}-\beta ^*\Vert _2\le \frac{R}{2}$ and there exist some absolute constants $C_1$ and $C_2$ satisfying the condition of

$$\begin{aligned}&\frac{1}{\upsilon +\mu } C_2 \left( \sqrt{s}+\frac{C_1\sqrt{s^*}}{\sqrt{1-k}}\right) \xi \left( \epsilon \log (nd)+\sqrt{\frac{\log d}{n}}\right) \nonumber \\&\quad \le \min \big \{\left( 1-\sqrt{1-\frac{2(\upsilon -\gamma )}{\upsilon +\mu }}\right) ^2 R, \frac{(1-k)^2}{2(1+k)}\Vert \beta ^*\Vert _2\big \}. \end{aligned}$$

(14)

Then, if taking $\eta =\frac{2}{\upsilon +\mu }$ in Algorithm 1, the following holds for $t=1, \dots , T$ with probability at least $1-Td^{-3}$

$$\begin{aligned}&\Vert \beta ^t-\beta ^*\Vert _2\le \underbrace{\left( 1-2\frac{\upsilon -\gamma }{\upsilon +\mu }\right) ^{\frac{t}{2}} R}_{\text {Optimization Error}} + \underbrace{\frac{2C_2\xi (\epsilon \log (nd)+\sqrt{\frac{\log d}{n}})}{\upsilon +\mu } \frac{\sqrt{s}+\frac{C_1}{\sqrt{1-k}}\sqrt{s^*}}{1-\sqrt{1-2\frac{\upsilon -\gamma }{\upsilon +\mu }}}}_{\text {Statistical and Corruption Error}}. \end{aligned}$$

(15)

In the above theorem, assumption (13) indicates that the sparsity level s in Algorithm 1 should be sufficiently large but still in the same order as the underlying sparsity $s^*$. Although s seems quite complex, in the experiments, we can see that it is suffcient to set $s=s^*$. Assumption (14) suggests that in order to ensure an upper bound in the hard thresholding step, we need $\sqrt{s^*}\xi (\epsilon \log (nd)+\sqrt{\frac{\log d}{n}})\le O(\Vert \beta ^*\Vert _2)$, which means that n should be sufficiently large and the fraction of corruption $\epsilon$ cannot be too large. In the error bound of (15), there are three types of errors. The first one is caused by optimization, which decreases to zero at a geometric rate of convergence. The second one is the term related to $\epsilon$ [i.e., $O(\xi \sqrt{s^*}\epsilon \log (nd))$], which is caused by estimating the population gradient via the trimming step due the $\epsilon$-corrupted samples. In the special case of no corrupted samples (i.e., $\epsilon =0$), the bound will be zero. The third one is the term $O\bigg (\xi \sqrt{\frac{s^*\log d}{n}}\bigg )$, which corresponds to the statistical error. It is independent of both $\epsilon$ and t and only dependent on the model itself. Even though Theorem 1 requires that the initial estimator be close enough to the optimal one, our experiments show that the algorithm actually performs quite well for any random initialization.

From Theorem 1, we can also see that when the fraction of corruption $\epsilon$ is sufficiently small such that $\epsilon \le O\bigg (\frac{1}{\sqrt{n\log (nd)}}\bigg )$ and the iteration number is sufficiently large, the error bound in (15) becomes $O\bigg (\xi \sqrt{\frac{s^*\log d }{n}}\bigg )$, which is the same as the optimal rate of estimating a high dimensional sparse vector when $\xi$ is some constant. This means that our method has the same rate as the non-corrupted ones in Wang et al. (2015). This rate of corruption also has been appeared in the corrupted sparse linear regression (Dalalyan and Thompson 2019; Liu et al. 2019). Also, we can see that when $\alpha =0$, our algorithm will be reduced to the high dimensional gradient EM algorithm in Wang et al. (2015).

5 Implications for some specific models

In this section, we apply our framework (i.e., Algorithm 1) to the models mentioned in Sect. 3. To obtain results for these models, we only need to find the corresponding ${\mathcal {B}}, \gamma , k, R, \upsilon , \mu , \xi$ to ensure that Assumption 1 and assumptions in Theorem 1 hold.

5.1 Corrupted Gaussian mixture model

The following lemma, which was given in Balakrishnan et al. (2017b), ensures the properties of Lipschitz–Gradient-2($\gamma , {\mathcal {B}}$), smoothness and strongly concave for model (4). It is easy to show that the model is self-consistent (Yi and Caramanis 2015).

Lemma 1

(Balakrishnan et al. 2017b; Yi and Caramanis 2015) If $\frac{\Vert \beta ^*\Vert _2}{\sigma }\ge r$, where r is a sufficiently large constant denoting the minimum signal-to-noise ratio (SNR), then there exists an absolute constant $C>0$ such that the properties of self-consistent, Lipschitz–Gradient-2($\gamma , {\mathcal {B}})$, $\mu$-smoothness and $\upsilon$-strongly concave hold for function $Q(\cdot ; \cdot )$ with $\gamma =\exp (-Cr^2), \mu =\upsilon =1, R=k\Vert \beta ^*\Vert _2, k=\frac{1}{4}, \text { and } {\mathcal {B}}=\{\beta :\Vert \beta -\beta ^*\Vert _2\le R\}.$

Lemma 2

With the same notations as in Lemma 1, for each $\beta \in {\mathcal {B}}$ with $\Vert \beta \Vert _0\le s$, the j-th coordinate of $\nabla q_i(\beta ; \beta )$ is $\xi$-sub-exponential with

$$\begin{aligned} \xi = C_1\sqrt{\Vert \beta ^*\Vert ^2_{\infty }+\sigma ^2}, \end{aligned}$$

(16)

where $C_1$ is some absolute constant. Also, each $[\nabla q_i(\beta ;\beta )]_j$, where $i\in [n]$, is independent of others for any fixed $j\in [d]$.

Theorem 2

In an $\epsilon$-corrupted high dimensional Gaussian Mixture Model with $\epsilon$ satisfying the condition of

$$\begin{aligned} \sqrt{(\Vert \beta ^*\Vert ^2_{\infty }+\sigma ^2)}\sqrt{s^*}\left( \epsilon \log (nd)+\sqrt{\frac{\log d}{n}}\right) \le O(\Vert \beta \Vert _2^*), \end{aligned}$$

(17)

if $\frac{\Vert \beta ^*\Vert _2}{\sigma }\ge r$ for some sufficiently large constant r denoting the minimum SNR and the initial estimator $\beta ^{\text {init}}$ satisfies the inequality of $\Vert \beta ^{\text {init}}-\beta ^*\Vert _2\le \frac{1}{8}\Vert \beta ^*\Vert _2,$ then the output $\beta ^T$ of Algorithm 1 after choosing $s=O(s^*)$ and $\eta =O(1)$ satisfies the following with probability at least $1-Td^{-3}$

$$\begin{aligned}&\Vert \beta ^T-\beta ^*\Vert _2\le \exp (-C T r^2)\Vert \beta ^*\Vert _2 \nonumber \\&\quad +O\left( \sqrt{(\Vert \beta ^*\Vert ^2_{\infty }+\sigma ^2)}\sqrt{s^*}\left( \epsilon \log (nd)+\sqrt{\frac{\log d}{n}}\right) \right) , \end{aligned}$$

(18)

where C is some absolute constant.

From Theorem 2, we can see that when $\epsilon \le {\tilde{O}}\left( \frac{1}{\sqrt{n}}\right)$ and $T=O\left( \log \frac{n}{s^*\log d}\right)$, the output achieves an estimation error of $O\left( \sqrt{\frac{s^*\log d}{n}}\right)$, which matches the best-known error bound of the no-outlier case (Yi and Caramanis 2015; Wang et al. 2015). Also, we assume that the SNR is large, which is reasonable since it has been shown that for Gaussian Mixture Model with low SNR, the variance of noise makes it harder for the algorithm to converge (Ma et al. 2000).

5.2 Corrupted mixture of regressions model

The following lemma, which was given in Balakrishnan et al. (2017b); Yi and Caramanis (2015), shows the properties of Lipschitz–Gradient-2($\gamma , {\mathcal {B}}$), smoothness and strongly concave for model (6).

Lemma 3

(Balakrishnan et al. 2017b; Yi and Caramanis 2015) If $\frac{\Vert \beta ^*\Vert _2}{\sigma }\ge r$, where r is a sufficiently large constant denoting the required minimal signal-to-noise ratio (SNR), then function $Q(\cdot ; \cdot )$ of the Mixture of Regressions Model has the properties of self-consistent, Lipschitz–Gradient-2($\gamma , {\mathcal {B}})$, $\mu$-smoothness, and $\upsilon$-strongly with $\gamma \in (0,\frac{1}{4}), \mu =\upsilon =1, {\mathcal {B}}=\{\beta : \Vert \beta -\beta ^*\Vert _2\le R\}, R=k\Vert \beta ^*\Vert _2$, and $k=\frac{1}{32}.$

Lemma 4

With the same notations as in Lemma 3, for each $\beta \in {\mathcal {B}}$ and $\Vert \beta \Vert _0=s$, the j-th coordinate of $\nabla q_i(\beta ; \beta )$ is $\xi$-sub-exponential with

$$\begin{aligned} \xi = C\max \{\Vert \beta ^*\Vert ^2_2+\sigma ^2, 1, \sqrt{s}\Vert \beta ^*\Vert _2\}, \end{aligned}$$

(19)

where $C>0$ is some absolute constant. Also, each $[\nabla q_i(\beta ;\beta )]_j$, where $i\in [n]$, is independent of others for any fixed $j\in [d]$.

Theorem 3

In an $\epsilon$-corrupted high dimensional Mixture of Regressions Model with $\epsilon$ satisfying the condition of

$$\begin{aligned} \max \{\Vert \beta ^*\Vert _2+\sigma ^2, 1, \sqrt{s^*}\Vert \beta ^*\Vert _2\}\sqrt{s^*} \left( \epsilon \log (nd) +\sqrt{\frac{\log d}{n}}\right) \le O(\Vert \beta \Vert _2^*), \end{aligned}$$

(20)

if $\frac{\Vert \beta ^*\Vert _2}{\sigma }\ge r$ for some sufficiently large constant r denoting the minimum SNR and the initial estimator $\beta ^{\text {init}}$ satisfies the inequality of $\Vert \beta ^{\text {init}}-\beta ^*\Vert _2\le \frac{1}{64}\Vert \beta ^*\Vert _2,$ then the output $\beta ^T$ of Algorithm 1 after choosing $s=O(s^*)$ and $\eta =O(1)$ satisfies the following with probability at least $1-Td^{-3}$

$$\begin{aligned} \Vert \beta ^T-\beta ^*\Vert _2&\le \gamma ^{\frac{T}{2}}\Vert \beta ^*\Vert _2+ O\left( \max \left\{ \Vert \beta ^*\Vert _2+\sigma ^2, 1, \sqrt{s^*}\Vert \beta ^*\Vert _2\right\} \nonumber \right. \\&\quad \left. \times \sqrt{s^*}\left( \epsilon \log (nd)+\sqrt{\frac{\log d}{n}}\right) \right) , \end{aligned}$$

(21)

where $\gamma \in (0, \frac{1}{4})$ is a constant.

Note that in the above theorem, when $\epsilon \le {\tilde{O}}\left( \frac{1}{\sqrt{n}}\right)$ and $T=O\left( \log \frac{\sqrt{n}}{\sqrt{\log d} s^*}\right)$, the estimation error becomes $O\left( s^*\sqrt{\frac{\log d}{n}}\right)$, which differs from the $O\left( \sqrt{\frac{s^*\log d}{n}}\right)$ minimax lower bound by only a factor of $\sqrt{s^*}$. We leave it as an open problem for further improvement. Recently, Chen et al. (2018) shows that in the no-outlier and low dimensional setting, an assumption of $SNR\ge \rho$ for some constant $\rho$ is necessary for achieving the optimal rate $\varTheta \left( \sqrt{\frac{d}{n}}\right)$.

5.3 Corrupted linear regression with missing covariates

Lemma 5

(Balakrishnan et al. 2017b; Yi and Caramanis 2015) If $\frac{\Vert \beta ^*\Vert _2}{\sigma }\le r$ and $p_m<\frac{1}{1+2b+2b^2}$, where r is a constant denoting the required maximum signal-to-noise ratio (SNR) and $b=r^2(1+k)^2$ for some constant $k\in (0,1)$, then function $Q(\cdot ; \cdot )$ of the linear regression with missing covariates has the properties of self-consistent, Lipschitz–Gradient-2($\gamma , {\mathcal {B}})$, $\mu$-smoothness and $\upsilon$-strongly with

$$\begin{aligned} \gamma&=\frac{b+p_m(1+2b+2b^2)}{1+b}<1, \mu =\upsilon =1,\nonumber \\ {\mathcal {B}}&=\{\beta :\Vert \beta -\beta ^*\Vert _2\le R\}, \text { where } R=k\Vert \beta ^*\Vert _2. \end{aligned}$$

(22)

Lemma 6

With the same assumptions as in Lemma 5, for each $\beta \in {\mathcal {B}}$ with $\Vert \beta \Vert _0=s$, $[\nabla q_i(\beta ; \beta )]_j$ is $\xi$-sub-exponential with

$$\begin{aligned} \xi =C[(1+k)(1+kr)^2\sqrt{s}\Vert \beta ^*\Vert _2+\max \{(1+kr)^2, \sigma ^2+\Vert \beta ^*\Vert _2^2\}] \end{aligned}$$

(23)

for some constant $C>0$. Also, each $[\nabla q_i(\beta ;\beta )]_j$, where $i\in [n]$, is independent of others for any fixed $j\in [d]$.

Theorem 4

In an $\epsilon$-corrupted high dimensional linear regression with missing covariates model with $\epsilon$ satisfying the condition of

$$\begin{aligned}&\left[ (1+k)(1+kr)^2\sqrt{s}\Vert \beta ^*\Vert _2+ \max \left\{ (1+kr)^2, \sigma ^2+\Vert \beta ^*\Vert _2^2\right\} \right] \sqrt{s^*}\left( \epsilon \log (nd)+\sqrt{\frac{\log d}{n}}\right) \\&\quad \le O(\Vert \beta ^*\Vert _2) \end{aligned}$$

for some $k\in (0,1)$, if $\Vert \beta ^{\text {init}}-\beta ^*\Vert _2\le \frac{k\Vert \beta ^*\Vert _2^2}{2}$ and the assumptions in Lemma 5hold, then, the output $\beta ^T$ of Algorithm 1 after taking $s=O(s^*)$ and $\eta =O(1)$ satisfies the following with probability at least $1-Td^{-3}$

$$\begin{aligned} \Vert \beta ^T-\beta ^*\Vert _2&\le \gamma ^{\frac{t}{2}}\Vert \beta ^*\Vert _2 + O\left( \max \{\Vert \beta ^*\Vert ^2_2+\sigma ^2, 1, \sqrt{s^*}\Vert \beta ^*\Vert _2\}\nonumber \right. \\&\left. \quad \times \sqrt{s^*}\left( \epsilon \log (nd)+\sqrt{\frac{\log d}{n}}\right) \right) , \end{aligned}$$

(24)

where the Big-O term hides the terms of k and r.

Note that similar to the mixture of regressions model, when $\epsilon \le {\tilde{O}}\left( \frac{1}{\sqrt{n}}\right)$, the estimation error is $O\left( s^*\sqrt{\frac{\log d}{n}}\right)$, which is only a factor of $\sqrt{s^*}$ away from the optimal. However, unlike the previous two models, we assume here that SNR is upper bounded by some constant which is unavoidable as pointed out in Loh and Wainwright (2011).

6 Experiments

In this section, we empirically study the performance of Algorithm 1 on the three models mentioned in the previous section. Since in the paper we mainly focus on the statistical setting and its theoretical behaviors, thus, we will only perform our algorithm on the synthetic data. It is notable that previous papers on the statistical guarantees of EM algorithm all perform their algorithms on synthetic data only such as Balakrishnan et al. (2017b), Wang et al. (2015), Yi and Caramanis (2015). Thus, performing experiments on synthetic data only is enough for the paper.

For each of these models, we generate synthesized datasets according to the underlying distribution. We will use $\Vert \beta -\beta ^*\Vert _2$ to measure the estimation error, and test how it is affected by different parameter settings from two aspects. Firstly, we examine how the underlying sparsity parameter $s^*$ of the model affects the estimation error and whether it is consistent with our theoretical results. Secondly, we test how the corruption fraction $\epsilon$ of the data and the dimensionality d affect the convergence rate, as well as the estimation error. For each experiment, the data is corrupted as follows: We first randomly choose $\epsilon$ fraction of the input data, then we add a Gaussian noise for each of these data samples. The noise is sampled from a multivariate Gaussian distribution ${\mathcal {N}}(0, 50 \Vert X\Vert _\infty I_d)$. All experiments are repeated for 20 runs and the average results are reported.

Parameter setting Throughout the experiments we will follow the setting of the previous related works on high dimensional EM algorithms which have statistical guarantees but are not corruption-proofing (Zhu et al. 2017; Wang et al. 2015; Yi and Caramanis 2015). We fix the dataset size n to be 2000, because using a larger n does not exhibit significant difference. For each model, the experiment is divided into three parts as mentioned previously: The first one (Fig. 2) measures $\Vert \beta -\beta ^*\Vert _2$ versus $\sqrt{n/(s^*\log d)}$ by varying $s^*$ from 3 to 15, with d fixed to be 100, which follows the previous works (Wang et al. 2015; Zhu et al. 2017); The second one (Fig. 3) examines the convergence behavior under different corruption rate $\epsilon$ which varies from 0 to 0.2; The last one (Fig. 4) shows the convergence behavior under different data dimensionality d which ranges from 80 to 240, with fixed $\epsilon =0.2$.

For each experiment, instead of choosing the initial vectors which are close to the optimal ones, we use random initialization. We will set $s=s^*$ in our algorithm, which is also used in the previous methods. Besides the parameter s, there are also two other parameters of the algorithm that need to be specified: the D-Trim parameter $\alpha$ and the step size $\eta$. We are also required to set the “noise level” for each of the three models, which is quantified by $\sigma$ in their definitions. It is notable that the choices of these parameters are quite flexible.

GMM: Corrupted Gaussian Mixture Model (4). We fix $\sigma$ to 0.5, $\alpha$ to 0.2 and $\eta$ to 0.1.
MRM Corrupted Mixture of Regressions Model (6). We fix $\sigma$ to 0.2, $\alpha$ to 0.2 and $\eta$ to 0.1.
RMC Corrupted Linear Regression with Missing Covariates Model (8). We set $\sigma =0.1$, $\alpha =0.3$, and the missing probability $p_m=0.1$, but use three different step sizes $\eta =0.05, 0.1, 0.08$ for the three parts of the experiment, respectively.

Results Firstly, we will mainly show that the classical high dimensional gradient EM algorithm in Wang et al. (2015) is not robust against to the corruptions. Here we conduct the algorithm on the three models. For each experiment, we tune the parameters to be optimal as showed in Wang et al. (2015). We test the algorithm w.r.t to $\sqrt{n/(s^*\log d)}$, iteration and different dimensions d.

As we can see from Fig. 1. In all the three models, the algorithm performs quite well if there is no corruptions ($\epsilon =0$) which also has been showed in the previous papers (Wang et al. 2015; Zhu et al. 2017). However, when there are $\epsilon =0.05$ fraction of the samples are corrupted, the classical high dimensional EM algorithm will achieve a large estimation error. These results motivate us to design some robust high dimensional EM algorithms while also have provable statistical guarantees.

Next, we show the performance of our Algorithm 1. For the first part (Fig. 2), we can see that when $\epsilon$ is small, the final estimation error in each of the three models decreases when the term $\sqrt{n/(s^*\log d)}$ increases, as predicted by Theorem 2. But when $\epsilon$ is relatively large, the trend becomes less obvious for the Gaussian Mixture Model and the Mixture of Regressions model, because now the factor $\epsilon \log (nd)$ comes into play.

Figure 3 shows that our algorithm achieves linear convergence on all three models and all values of $\epsilon$, but the final converged error is heavily affected by $\epsilon$, and especially for the Gaussian Mixture and Linear Regression with Missing Covariates Models. Moreover, when $\epsilon$ is small, the estimation errors are comparable to or even the same as the non-corrupted ones, this is actually reasonable since it is corruption-proofing when $\epsilon$ is small theoretically. In the third part of the experiments (Fig. 4), varying d seems not affect the convergence behavior much, which is reasonable as the error bound depends on d only logarithmically and changes fairly slow. Thus, these results support Theorem 1.

All the results show that our algorithm is robust against to some level of corruption while also could achieve an estimation error that is comparable to the non-corrupted ones.

7 Conclusion

In this paper we study the problem of estimating latent variable models with arbitrarily corrupted samples in the high dimensional sparse case and propose a method called Trimmed Gradient Expectation Maximization. Specifically, we show that our algorithm is corruption-proofing and could achieve the (near) optimal statistical rate for some statistical models under some levels of corruption. Experimental results support our theoretical analysis and also show that our algorithm is indeed robust against to some corrupted samples.

There are still many open problems. Firstly, in this paper, all of our theoretical guarantees need the initial parameter be close enough to the underlying parameter, which is quite strong. So how do we relax this assumption? Second, the three specific models we considered in the paper are quite simple, can we generalize to more models such as multi-component Gaussian Mixture Model or Mixture of Linear Regressions Model? Thirdly, in this paper we assume that the sparsity of the underlying parameter is known, how to deal with the case where it is unknown?

Notes

Since high dimensional sparse case is much more harder than the low dimension case, our algorithm can be easily extended to the low dimension case by using the results in Balakrishnan et al. (2017b). Due to the space limit, we omit it in the paper.
For a vector $v\in {\mathbb {R}}^d$, $\Vert v\Vert _0$ represents the number of entries in v that are non-zero.
$\langle \cdot , \cdot \rangle$ represents the inner product of two vectors.
In general, given a vector $v\in {\mathbb {R}}^d$ and an integer s, function $\text {supp}(v,s)$ returns a set of s number ofis indices corresponding to the top s largest value among $\{|v_j|, j\in [d]\}$.
In general, given a vector $v\in {\mathbb {R}}^d$ and a set of indices ${\mathcal {S}}\subseteq [d]$, function $\text {trunc}(v, {\mathcal {S}})\in {\mathbb {R}}^d$, where $[\text {trunc}(v, {\mathcal {S}})]_j=v_j$ if $j\in {\mathcal {S}}$ and $[\text {trunc}(v, {\mathcal {S}})]_j=0$ otherwise.

References

Aitkin, M., & Wilson, G. T. (1980). Mixture models, outliers, and the EM algorithm. Technometrics, 22(3), 325–331.
Article Google Scholar
Alistarh, D., Allen-Zhu, Z., & Li, J. (2018). Byzantine stochastic gradient descent. In Advances in neural information processing systems, pp 4613–4623.
Balakrishnan, S., Du, S.S., Li, J., & Singh, A. (2017a). Computationally efficient robust sparse estimation in high dimensions. In Conference on learning theory, pp 169–212.
Balakrishnan, S., Wainwright, M. J., Yu, B., et al. (2017b). Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1), 77–120.
Article MathSciNet Google Scholar
Blumensath, T., & Davies, M. E. (2009). Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3), 265–274.
Article MathSciNet Google Scholar
Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford: Oxford University Press.
Book Google Scholar
Chen, Y., Caramanis, C., & Mannor, S. (2013). Robust sparse regression under adversarial corruption. In International conference on machine learning, pp 774–782.
Chen, Y., Su, L., & Xu, J. (2017). Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on measurement and analysis of computing systems, 1(2), 44.
Chen, Y., Yi, X., & Caramanis, C. (2018). Convex and nonconvex formulations for mixed regression with two components: Minimax optimal rates. IEEE Transactions on Information Theory, 64(3), 1738–1766.
Article MathSciNet Google Scholar
Dalalyan, A.S., & Thompson, P. (2019). Outlier-robust estimation of a sparse linear model using $\ell _1$-penalized huber’s $m$-estimator. arXiv preprint arXiv:1904.06288.
Dawid, A. P., & Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1), 20–28.
Google Scholar
Diakonikolas, I., Kamath, G., Kane, D.M., Li, J., Moitra, A., & Stewart, A. (2016). Robust estimators in high dimensions without the computational intractability. In 2016 IEEE 57th annual symposium on foundations of computer science (FOCS), IEEE, pp 655–664.
Diakonikolas, I., Kane, D.M., & Stewart, A. (2017). Statistical query lower bounds for robust estimation of high-dimensional Gaussians and Gaussian mixtures. In 2017 IEEE 58th annual symposium on foundations of computer science (FOCS), IEEE, pp 73–84.
Diakonikolas, I., Kane, D.M., & Stewart, A. (2018). List-decodable robust mean estimation and learning mixtures of spherical gaussians. In Proceedings of the 50th annual ACM SIGACT symposium on theory of computing, ACM, pp 1047–1060.
Du, S.S., Balakrishnan, S., & Singh, A. (2017). Computationally efficient robust estimation of sparse functionals. arXiv preprint arXiv:1702.07709.
Faria, S., & Gonçalves, F. (2013). Financial data modeling by Poisson mixture regression. Journal of Applied Statistics, 40(10), 2150–2162.
Article MathSciNet Google Scholar
Holland, M.J. (2018). Robust descent using smoothed multiplicative noise. arXiv preprint arXiv:1810.06207.
Huber, P. J. (2011). Robust statistics. Berlin: Springer.
Google Scholar
Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp 315–323.
Laird, N.M. (2010). The em algorithm in genetics, genomics and public health. Statistical Science, 25(4), 450–457.
Li, J. (2017). Robust sparse estimation tasks in high dimensions. arXiv preprint arXiv:1702.05860.
Liu, L., Li, T., & Caramanis, C. (2019). High dimensional robust estimation of sparse models via trimmed hard thresholding. arXiv preprint arXiv:1901.08237.
Loh, P.L., & Wainwright, M.J. (2011). High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Advances in neural information processing systems, pp 2726–2734.
Ma, J., Xu, L., & Jordan, M. I. (2000). Asymptotic convergence rate of the EM algorithm for Gaussian mixtures. Neural Computation, 12(12), 2881–2907.
Article Google Scholar
McLachlan, G., & Krishnan, T. (2007). The EM algorithm and extensions (Vol. 382). Hoboken: Wiley.
MATH Google Scholar
Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course (Vol. 87). Berlin: Springer Science & Business Media.
MATH Google Scholar
Prasad, A., Suggala, A.S., Balakrishnan, S., & Ravikumar, P. (2018). Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485.
Suggala, A.S., Bhatia, K., Ravikumar, P., & Jain, P. (2019). Adaptive hard thresholding for near-optimal consistent robust regression. arXiv preprint arXiv:1903.08192.
Thompson, P., & Dalalyan, A.S. (2018). Restricted eigenvalue property for corrupted gaussian designs. arXiv preprint arXiv:1805.08020.
Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027.
Wang, Z., Gu, Q., Ning, Y., Liu, H. (2015). High dimensional EM algorithm: Statistical optimization and asymptotic normality. In Advances in neural information processing systems, pp 2521–2529.
Wu, C. J., et al. (1983). On the convergence properties of the EM algorithm. The Annals of statistics, 11(1), 95–103.
Article MathSciNet Google Scholar
Yang, M. S., Lai, C. Y., & Lin, C. Y. (2012). A robust EM clustering algorithm for Gaussian mixture models. Pattern Recognition, 45(11), 3950–3961.
Article Google Scholar
Yi, X., & Caramanis, C. (2015). Regularized em algorithms: A unified framework and statistical guarantees. In Advances in neural information processing systems, pp 1567–1575.
Yin, D., Chen, Y., Ramchandran, K., & Bartlett, P. (2018). Byzantine-robust distributed learning: Towards optimal statistical rates. arXiv preprint arXiv:1803.01498.
Zhu, R., Wang, L., Zhai, C., & Gu, Q. (2017). High-dimensional variance-reduced stochastic gradient expectation-maximization algorithm. In Proceedings of the 34th international conference on machine learning Volume 70, JMLR. org, pp 4180–4188.

Download references

Funding

The funding was provided by National Science Foundation (Grant Nos. IIS-1910492, CCF-1716400).

Author information

D. Wang and X. Guo have contributed equally.

Authors and Affiliations

Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY, 14260, USA
Di Wang, Xiangyu Guo, Shi Li & Jinhui Xu

Author notes

Editors: Kee-Eung Kim, Vineeth N. Balasubramanian.

Authors

Di Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyu Guo
View author publications
You can also search for this author in PubMed Google Scholar
Shi Li
View author publications
You can also search for this author in PubMed Google Scholar
Jinhui Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinhui Xu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Auxiliary lemmas

In this section, we introduce prerequisite knowledge and technical lemmas in order to prove the main results.

In order to analyze the Dimensional $\alpha$-trimmed estimator, we first give some results for 1-dimensional samples and denote it as $\text {trmean}_\alpha (\cdot )$.

Definition 8

Given a set of $\epsilon$-corrupted samples $\{z_i\}_{i=1}^n\subseteq {\mathbb {R}}$, the trimmed mean estimator $\text {trmean}_\alpha (\{z_i\}_{i=1}^n)\in {\mathbb {R}}$ removes the largest and smallest $\alpha$ fraction of elements in $\{z_i\}_{i=1}^n$ and calculate the mean of the remaining terms. We choose $\alpha =c_0\epsilon$, for some constant $c_0\ge 1$. We also require that $\alpha \le \frac{1}{2}-c_1$ for some small constant $c_1>0$.

For the 1-dimensional trimmed mean estimator, we have the following upper bound on the error w.r.t the population mean.

Lemma 7

[Lemma A.2 in Liu et al. (2019)] Let $\{z_i\}_{i=1}^n\subset {\mathbb {R}}^d$ be $n=\varOmega (\log d)$ $\epsilon$-corrupted samples. If the j-th coordinate, for each $j\in [d]$, of the samples $\{z_{i, j}\}_{i=1}^n$ are i.i.d. $\xi$-exponential with mean $\mu ^j$, then after using the dimensional $\alpha$-trimmed mean estimator, the following upper bound of error holds with probability at least $1-d^{-3}$, for every $j\in [d]$

$$\begin{aligned} |\text {trmean}_\alpha (\{z_{i,j}\}_{i=1}^n)-\mu ^j|\le C_2 \xi \left( \epsilon \log (nd)+\sqrt{\frac{\log d}{n}}\right) , \end{aligned}$$

(25)

where $C_2$ is some constant dependent on $c_1$.

Next, we provide some symmetrization results of random variables, which will be used in our proofs. See Boucheron et al. (2013) for details.

Lemma 8

Let $y_1, y_2, \ldots , y_n$ be the n independent realizations of the random vector $Y\in {\mathcal {Y}}$, and ${\mathcal {F}}$ be a function class defined on ${\mathcal {Y}}$. For any increasing convex function $\phi (\cdot )$, the following holds

$$\begin{aligned} {\mathbb {E}}\left\{ \phi \left[ \sup _{f\in {\mathcal {F}}}|\sum _{i=1}^n f(y_i)-{\mathbb {E}}(f(Y))|\right] \right\} \le {\mathbb {E}}\{\phi [\sup _{f\in {\mathcal {F}}}|\sum _{i=1}^n \epsilon _i f(y_i)|]\}, \end{aligned}$$

where $\epsilon _1, \ldots , \epsilon _n$ are i.i.d. Rademacher random variables that are independent of $y_1, \ldots , y_n$.

Lemma 9

Let $y_1, \ldots , y_n$ be n independent realization of the random vector $Z\in {\mathcal {Z}}$ and ${\mathcal {F}}$ be a function class defined on ${\mathcal {Z}}$. If Lipschitz functions $\{\phi _i(\cdot )\}_{i=1}^n$ satisfy the following for all $v, v'\in {\mathbb {R}}$

$$\begin{aligned} |\phi _i(v)-\phi _i(v')|\le L|v-v'| \end{aligned}$$

and $\phi _i(0)=0$, then for any increasing convex function $\phi (\cdot )$, the following holds

$$\begin{aligned} {\mathbb {E}}\left\{ \phi \left[ |\sup _{f\in {\mathcal {F}}}\sum _{i=1}^n\epsilon _i\phi _i(f(y_i))|\right] \right\} \le {\mathbb {E}}\left\{ \phi \left[ 2|L\sup _{f\in {\mathcal {F}}}\sum _{i=1}^n\epsilon _if(y_i)|\right] \right\} , \end{aligned}$$

where $\epsilon _1, \ldots , \epsilon _n$ are i.i.d. Rademacher random variables that are independent of $y_1, \ldots , y_n$.

Finally we recall some definitions and lemmas on the sub-exponential and sub-Gaussian random variables. See Vershynin (2010) for details.

Definition 9

For a sub-exponential random vector X, its sub-exponential norm $\Vert X\Vert _{\psi _1}$ is defined as

$$\begin{aligned} \Vert X\Vert _{\psi _1} = \sup _{p\ge 1} p^{-1}({\mathbb {E}}|X|^p)^{\frac{1}{p}}. \end{aligned}$$

Lemma 10

Let X be a zero-mean sub-exponential random variable, then there are absolute constants $C, c>0$, such that when $|t|\le \frac{c}{\Vert X\Vert _{\psi _1}}$ ,

$$\begin{aligned} {\mathbb {E}}[\exp (tX)]\le \exp (Ct^2\Vert X\Vert _{\psi _1}^2). \end{aligned}$$

Lemma 11

(Bernstein’s inequality) Let $X_1,\ldots , X_n$ be n i.i.d. realizations of $\upsilon$-sub-exponential random variable X with mean $\mu$. Then,

$$\begin{aligned} \text {Pr}\left( |\frac{1}{n}\sum _{i=1}^n X_i-\mu |\ge t\right) \le 2\exp \left( -n\min \left( -\frac{t^2}{\upsilon ^2}, \frac{t}{2\upsilon }\right) \right) . \end{aligned}$$

Definition 10

A random variable X is sub-Gaussian with variance $\sigma ^2$ if for all $t>0$, the following holds

$$\begin{aligned} \text {Pr}(|X-{\mathbb {E}}X|\ge t)\le 2\exp (-\frac{t^2}{2\sigma ^2}). \end{aligned}$$

Definition 11

For a sub-Gaussian random variable X, its sub-Gaussian norm $\Vert X\Vert _{\psi _2}$ is defined as

$$\begin{aligned} \Vert X\Vert _{\psi _2}= \sum _{p\ge 1}p^{-\frac{1}{2}}({\mathbb {E}}|X|^p)^{\frac{1}{p}}. \end{aligned}$$

Lemma 12

If X is sub-Gaussian or sub-exponential, then $\Vert X-{\mathbb {E}} X\Vert _{\psi _2}\le 2\Vert X\Vert _{\psi _2}$ or $\Vert X-{\mathbb {E}} X\Vert _{\psi _1}\le 2\Vert X\Vert _{\psi _1}$ holds, respectively.

Lemma 13

For two sub-Gaussian random variables $X_1, X_2$, $X_1\cdot X_2$ is a sub-exponential random variable with

$$\begin{aligned} \Vert X_1\cdot X_2\Vert _{\psi _1}\le C\max \{\Vert X_1\Vert _{\psi _2}^2, \Vert X_2\Vert _{\psi _2}^2\}. \end{aligned}$$

Lemma 14

Let $X_1, X_2, \ldots , X_k$ be k independent zero-mean sub-Gaussian random variables, and $X= \sum _{j=1}^kX_j$. Then, X is sub-Gaussian with $\Vert X\Vert _{\psi _2}^2\le C\sum _{j=1}^k\Vert X_j\Vert _{\psi _2}^2$ for some absolute constant $C>0$.

Omitted proofs

1.1 Proof of Theorem 1

By Lemma 7 and our assumption on the $\xi$-sub-exponential property of each coordinate, we have the following in the t-th iteration with probability at least $1-d^{-3}$ for some constant $C_2>0$

$$\begin{aligned} \Vert \nabla {\tilde{Q}}_n(\beta ^t; \beta ^t) - \nabla Q(\beta ^t; \beta ^t)\Vert _\infty \le C_2 \xi \left( \epsilon \log (nd)+\sqrt{\frac{\log d}{n}}\right) . \end{aligned}$$

(26)

For convenience, we let $\alpha =C_2 \xi (\epsilon \log (nd)+\sqrt{\frac{\log d}{n}})$, and assume that for all iterations $t\in [T-1]$, event (26) holds (then all events hold with probability at least $1-Tp^{-3}$).

In the t-th iteration, we define

$$\begin{aligned} {\bar{\beta }}^{t+0.5}= \beta ^t+\eta \nabla Q(\beta ^t; \beta ^t) \end{aligned}$$

(27)

and

$$\begin{aligned} {\bar{\beta }}^{t+1}=\text {trunc}({\bar{\beta }}^{t+0.5}, \hat{{\mathcal {S}}}^{t+0.5}). \end{aligned}$$

(28)

That is, ${\bar{\beta }}^{t+0.5}$ is the gradient update of $\beta ^t$ w.r.t the non-corrupted population gradient of $Q_n(\beta ^t; \beta ^t)$, and ${\bar{\beta }}^{t+1}$ is the estimation after truncating ${\bar{\beta }}^{t+0.5}$ w.r.t set $\hat{{\mathcal {S}}}^{t+0.5}$, which is the set of the s-largest coordinates of $\beta ^{t+0.5}$.

By the definition, we have the following inequalities

$$\begin{aligned} \Vert \beta ^{t+1}-\beta ^*\Vert _2&=\Vert \text {trunc}(\beta ^{t+0.5}, \hat{{\mathcal {S}}}^{t+0.5})-\beta ^*\Vert _2 \nonumber \\&\le \Vert \text {trunc}(\beta ^{t+0.5}, \hat{{\mathcal {S}}}^{t+0.5})-\text {trunc}({\bar{\beta }}^{t+0.5}, \hat{{\mathcal {S}}}^{t+0.5})\Vert _2 \nonumber +\Vert \text {trunc}({\bar{\beta }}^{t+0.5}, \hat{{\mathcal {S}}}^{t+0.5})-\beta ^*\Vert _2\nonumber \\&= \Vert \text {trunc}(\beta ^{t+0.5}, \hat{{\mathcal {S}}}^{t+0.5})-\text {trunc}({\bar{\beta }}^{t+0.5}, \hat{{\mathcal {S}}}^{t+0.5})\Vert _2+\Vert {\bar{\beta }}^{t+1}-\beta ^*\Vert _2\nonumber \\&\le \underbrace{\Vert (\beta ^{t+0.5}-{\bar{\beta }}^{t+0.5})_{\hat{{\mathcal {S}}}^{t+0.5}}\Vert _2}_{A}+ \underbrace{\Vert {\bar{\beta }}^{t+1}-\beta ^*\Vert _2}_{B}. \end{aligned}$$

(29)

For the term A, we have

$$\begin{aligned} \Vert (\beta ^{t+0.5}-{\bar{\beta }}^{t+0.5})_{\hat{{\mathcal {S}}}^{t+0.5}}\Vert _2&\le \sqrt{s}\Vert \beta ^{t+0.5}-{\bar{\beta }}^{t+0.5}\Vert _\infty \nonumber \\&=\eta \sqrt{s}\Vert \nabla {\tilde{Q}}_n(\beta ^t; \beta ^t)- \nabla Q(\beta ^t; \beta ^t)\Vert _\infty . \end{aligned}$$

(30)

Thus, if $\beta ^t\in {\mathcal {B}}$, i.e., $\Vert \beta ^t-\beta ^*\Vert \le k\Vert \beta ^*\Vert _2$) and $\Vert \beta ^t\Vert _0=s$, then by the assumption and (26), we have

$$\begin{aligned} A\le \eta \sqrt{s}\alpha . \end{aligned}$$

(31)

Next, we will bound the term B. To do this, we need the following lemma, which follows (Wang et al. 2015).

Lemma 15

If

$$\begin{aligned} \Vert {\bar{\beta }}^{t+0.5}-\beta ^*\Vert _2\le k \Vert \beta ^*\Vert _2 \end{aligned}$$

(32)

for some $k\in (0,1)$ and

$$\begin{aligned} s\ge \frac{4(1+k)^2}{(1-k)^2}s^* \text { and } \sqrt{s}\Vert \beta ^{t+0.5}-{\bar{\beta }}^{t+0.5}\Vert _\infty \le \frac{(1-k)^2}{2(1+k)}\Vert \beta ^*\Vert _2, \end{aligned}$$

(33)

then, the following holds

$$\begin{aligned} \Vert {\bar{\beta }}^{t+1}-\beta ^*\Vert _2\le \frac{C\sqrt{s^*}}{\sqrt{1-k}}\Vert \beta ^{t+0.5}-{\bar{\beta }}^{t+0.5}\Vert _\infty +(1+4\sqrt{\frac{s^*}{s}})^{1/2}\Vert {\bar{\beta }}^{t+0.5}-\beta ^*\Vert _2. \end{aligned}$$

(34)

Proof of Lemma 15

By assumption (32), we have

$$\begin{aligned} (1-k)\Vert \beta ^*\Vert _2\le \Vert {\bar{\beta }}^{t+0.5}\Vert _2\le (1+k)\Vert \beta ^*\Vert _2. \end{aligned}$$

(35)

We then denote

$$\begin{aligned} {\bar{\theta }}=\frac{{\bar{\beta }}^{t+0.5}}{\Vert {\bar{\beta }}^{t+0.5}\Vert _2}, \theta =\frac{\beta ^{t+0.5}}{\Vert {\bar{\beta }}^{t+0.5}\Vert _2} \text {, and } \theta ^*= \frac{\beta ^*}{\Vert \beta ^*\Vert _2} \end{aligned}$$

(36)

and the sets ${\mathcal {I}}_1, {\mathcal {I}}_2$ and ${\mathcal {I}}_3$ as the follows

$$\begin{aligned} {\mathcal {I}}_1={\mathcal {S}}^*\backslash \hat{{\mathcal {S}}}^{t+0.5}, {\mathcal {I}}_2={\mathcal {S}}^*\bigcap \hat{{\mathcal {S}}}^{t+0.5} \text {, and } {\mathcal {I}}_3=\hat{{\mathcal {S}}}^{t+0.5} \backslash {\mathcal {S}}^*, \end{aligned}$$

(37)

where $S^*=\text {supp}(\beta ^*)$. Let $s_i=|{\mathcal {I}}_i|$ for $i=1, 2, 3$, respectively. Also, we define $\varDelta =\langle {\bar{\theta }}, \theta ^*\rangle$. Note that

$$\begin{aligned} \varDelta =\langle {\bar{\theta }}, \theta ^*\rangle = \sum _{j\in {\mathcal {S}}^*}{\bar{\theta }}_j\theta _j^*=\sum _{j\in {\mathcal {I}}_1}{\bar{\theta }}_j\theta _j^*+\sum _{j\in {\mathcal {I}}_2}{\bar{\theta }}_j\theta _j^*\le \Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2\Vert \theta ^*_{{\mathcal {I}}_1}\Vert _2+\Vert {\bar{\theta }}_{{\mathcal {I}}_2}\Vert _2\Vert \theta ^*_{{\mathcal {I}}_2}\Vert _2. \end{aligned}$$

(38)

By Cauchy–Schwartz inequality, we have

$$\begin{aligned} \varDelta ^2&\le (\Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2\Vert \theta ^*_{{\mathcal {I}}_1}\Vert _2+\Vert {\bar{\theta }}_{{\mathcal {I}}_2}\Vert _2\Vert \theta ^*_{{\mathcal {I}}_2}\Vert _2)^2 \nonumber \\&\le (\Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2^2+\Vert {\bar{\theta }}_{{\mathcal {I}}_2}\Vert _2^2)(\Vert \theta ^*_{{\mathcal {I}}_1}\Vert _2^2+\Vert \theta ^*_{{\mathcal {I}}_2}\Vert _2^2)\nonumber \\&= (1-\Vert {\bar{\theta }}_{{\mathcal {I}}_3}\Vert _2^2)(1-\Vert \theta ^*_{{\mathcal {I}}_3}\Vert _2^2)\le 1-\Vert {\bar{\theta }}_{{\mathcal {I}}_3}\Vert _2^2. \end{aligned}$$

(39)

Since ${\mathcal {I}}_3\subseteq \hat{{\mathcal {S}}}^{t+0.5}$ and ${\mathcal {I}}_1\bigcap \hat{{\mathcal {S}}}^{t+0.5}=\emptyset$, we have

$$\begin{aligned} \frac{\Vert \beta ^{t+0.5}_{{\mathcal {I}}_3}\Vert ^2_2}{\Vert \beta ^{t+0.5}_{{\mathcal {I}}_1}\Vert ^2_2}\ge \frac{s_3}{s_1}, \text { i.e., } \frac{\Vert \theta _{{\mathcal {I}}_3}\Vert _2}{\sqrt{s_3}}\ge \frac{\Vert \theta _{{\mathcal {I}}_1}\Vert _2}{\sqrt{s_1}}. \end{aligned}$$

(40)

We let ${\tilde{\epsilon }}=2\Vert {\bar{\theta }}-\theta \Vert _\infty =2\frac{\Vert {\bar{\beta }}^{t+0.5}-\beta ^{t+0.5}\Vert _\infty }{\Vert {\bar{\beta }}^{t+0.5}\Vert _2}$. Note that we have

$$\begin{aligned}\max \left\{ \frac{\Vert \theta _{{\mathcal {I}}_3}-{\bar{\theta }}_{{\mathcal {I}}_3}\Vert _2 }{\sqrt{s_3}}, \frac{\Vert \theta _{{\mathcal {I}}_1}-{\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2 }{\sqrt{s_1}}\right\} &\le \max \left\{ \Vert \theta _{{\mathcal {I}}_3}-{\bar{\theta }}_{{\mathcal {I}}_3}\Vert _\infty , \Vert \theta _{{\mathcal {I}}_1}-{\bar{\theta }}_{{\mathcal {I}}_1}\Vert _\infty \right\} \nonumber \\& \le \Vert {\bar{\theta }}-\theta \Vert _\infty =\frac{{\tilde{\epsilon }}}{2}, \end{aligned}$$

(41)

which implies that

$$\begin{aligned} \frac{\Vert {\bar{\theta }}_{{\mathcal {I}}_3}\Vert _2}{\sqrt{s_3}}\ge \frac{\Vert \theta _{{\mathcal {I}}_3}\Vert _2}{\sqrt{s_3}}-\frac{\Vert \theta _{{\mathcal {I}}_3}-{\bar{\theta }}_{{\mathcal {I}}_3}\Vert _2 }{\sqrt{s_3}}&\underset{(a)}{\ge } \frac{\Vert \theta _{{\mathcal {I}}_1}\Vert _2}{\sqrt{s_1}}-\frac{\Vert \theta _{{\mathcal {I}}_3}-{\bar{\theta }}_{{\mathcal {I}}_3}\Vert _2 }{\sqrt{s_3}}\nonumber \\&\ge \frac{\Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2}{\sqrt{s_1}}-\frac{\Vert \theta _{{\mathcal {I}}_3}-{\bar{\theta }}_{{\mathcal {I}}_3}\Vert _2 }{\sqrt{s_3}}- \frac{\Vert \theta _{{\mathcal {I}}_1}-{\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2 }{\sqrt{s_1}}\ge \frac{\Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2}{\sqrt{s_1}}-{\tilde{\epsilon }}, \end{aligned}$$

(42)

where inequality (a) is due to (40). Plugging (42) into (39), we have

$$\begin{aligned} \varDelta ^2\le 1-\Vert {\bar{\theta }}_{{\mathcal {I}}_3}\Vert _2^2 \le 1-(\sqrt{\frac{s_3}{s_1}}\Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2-\sqrt{s_3}{\tilde{\epsilon }})^2. \end{aligned}$$

(43)

Solving $\Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2$ in (43), we get

$$\begin{aligned} \Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2\le \sqrt{\frac{s_1}{s_3}}\sqrt{1-\varDelta ^2}+\sqrt{s_1}{\tilde{\epsilon }}\le \sqrt{\frac{s^*}{s}}\sqrt{1-\varDelta ^2}+\sqrt{s^*}{\tilde{\epsilon }}. \end{aligned}$$

(44)

The final inequality is due to the inequality $\frac{s_1}{s_3}\le \frac{s_1+s_2}{s_3+s_2}=\frac{s^*}{s}$, which follows from $\frac{s^*}{s}\le \frac{(1-k)^2}{4(1+k)^2}\le 1$ and $s_3\ge s-s^*\ge s^*\ge s_1$.

In the following, we will prove that the right hand side of (44) is upper bounded by $\varDelta$. To achieve this, it is sufficient to show that

$$\begin{aligned} \varDelta \ge \frac{\sqrt{s^*}{\tilde{\epsilon }}+[s^*{\tilde{\epsilon }}^2-(s^*/s+1)(s^*\tilde{\epsilon ^2-s^*/s)]^{\frac{1}{2}}}}{s^*/s+1}=\frac{\sqrt{s^*}{\tilde{\epsilon }} +[-(s^*{\tilde{\epsilon }})^2/s+(s^*/s+1)s^*/s]^{\frac{1}{2}} }{s^*/s+1}. \end{aligned}$$

(45)

To prove (45), we first note that $\sqrt{s^*}{\tilde{\epsilon }}\le \varDelta$, which is due to

$$\begin{aligned} \sqrt{s^*}{\tilde{\epsilon }}\le \sqrt{s}{\tilde{\epsilon }}=\frac{2\sqrt{s}\Vert {\bar{\beta }}^{t+0.5}-\beta ^{t+0.5}\Vert _\infty }{\Vert \beta ^*\Vert _2}\le \frac{1-k}{1+k}\le \varDelta , \end{aligned}$$

(46)

where the second inequality is due to assumption (33) and the final inequality is due to

$$\begin{aligned} \varDelta= & {} \langle {\bar{\theta }}, \theta ^*\rangle =\frac{\langle {\bar{\beta }}^{t+0.5}, \beta ^*\rangle }{\Vert {\bar{\beta }}^{t+0.5}\Vert _2 \Vert \beta ^*\Vert _2}\overset{(a)}{\ge } \frac{\Vert {\bar{\beta }}^{t+0.5}\Vert _2^2+\Vert \beta ^*\Vert _2^2 -k^2\Vert \beta ^*\Vert _2^2}{2\Vert {\bar{\beta }}^{t+0.5}\Vert _2\Vert \beta ^*\Vert _2}\\\ge & {} \frac{(1-k)^2+1-k^2}{2(1+k)}=\frac{1-k}{1+k}, \end{aligned}$$

where inequality (a) is due to Assumption (32).

Now, we show that (45) holds. By (46), we have

$$\begin{aligned} \sqrt{s}{\tilde{\epsilon }}\le \frac{1-k}{1+k}<1<\sqrt{\frac{s^*+s}{s}}, \end{aligned}$$

(47)

which implies that ${\tilde{\epsilon }}\le \frac{\sqrt{s^*+s}}{s}$.

For the right hand side of (45), we have

$$\begin{aligned} \frac{\sqrt{s^*}{\tilde{\epsilon }} +[-(s^*{\tilde{\epsilon }})^2/s+(s^*/s+1)s^*/s]^{\frac{1}{2}} }{s^*/s+1}&\le \frac{\sqrt{s^*}{\tilde{\epsilon }} +[(s^*/s+1)s^*/s]^{\frac{1}{2}} }{s^*/s+1} \end{aligned}$$

(48)

$$\begin{aligned}&\le 2\sqrt{\frac{s^*}{s^*+s}}\le 2\sqrt{\frac{1}{1+4(1+k)^2/(1-k)^2}} \end{aligned}$$

(49)

$$\begin{aligned}&\le \frac{1-k}{1+k}\le \varDelta . \end{aligned}$$

(50)

Thus, in total, by (44) we can get

$$\begin{aligned} \Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2\le \varDelta . \end{aligned}$$

(51)

From (39), we can see that

$$\begin{aligned} \varDelta \le \Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2\Vert \theta ^*_{{\mathcal {I}}_1}\Vert _2+\Vert {\bar{\theta }}_{{\mathcal {I}}_2}\Vert _2\Vert \theta ^*_{{\mathcal {I}}_2}\Vert _2 \le \Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2\Vert \theta ^*_{{\mathcal {I}}_1}\Vert _2+\sqrt{(1-\Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2^2)}\sqrt{(1-\theta ^*_{{\mathcal {I}}_1}\Vert ^2_2)}, \end{aligned}$$

that is,

$$\begin{aligned} (\varDelta - \Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2\Vert \theta ^*_{{\mathcal {I}}_1}\Vert _2)^2 \le (1-\Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2^2)(1-\theta ^*_{{\mathcal {I}}_1}\Vert ^2_2). \end{aligned}$$

Solving the above inequality, we get

$$\begin{aligned} \Vert \theta ^*_{{\mathcal {I}}_1}\Vert _2\le & {} \Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2\varDelta +\sqrt{1-\Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2^2}\sqrt{1-\varDelta ^2}\le \Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2+\sqrt{1-\varDelta ^2} \nonumber \\\le & {} \sqrt{\frac{s^*}{s}}\sqrt{1-\varDelta ^2}+\sqrt{s^*}{\tilde{\epsilon }}+ \sqrt{1-\varDelta ^2}, \end{aligned}$$

(52)

where the final inequality is due to (44). Combining this with (44) and (52), we have

$$\begin{aligned} \Vert \theta ^*_{{\mathcal {I}}_1}\Vert _2 \Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2\le [\sqrt{\frac{s^*}{s}}\sqrt{1-\varDelta ^2}+\sqrt{s^*}{\tilde{\epsilon }}+ \sqrt{1-\varDelta ^2}]\cdot [\sqrt{\frac{s^*}{s}}\sqrt{1-\varDelta ^2}+\sqrt{s^*}{\tilde{\epsilon }}]. \end{aligned}$$

(53)

Now, by the definition of ${\bar{\theta }}$, we have

$$\begin{aligned} {\bar{\beta }}^{t+1}=\text {trunc}({\bar{\beta }}^{t+0.5}, \hat{{\mathcal {S}}}^{t+0.5})=\text {trunc}({\bar{\theta }}, \hat{{\mathcal {S}}}^{t+0.5})\Vert {\hat{\beta }}^{t+0.5}\Vert _2. \end{aligned}$$

(54)

Therefore, we get

$$\begin{aligned}&\left\langle \frac{{\bar{\beta }}^{t+1}}{\Vert {\bar{\beta }}^{t+0.5}\Vert _2}, \frac{\beta ^*}{\Vert \beta ^*\Vert _2} \right\rangle = \langle \text {trunc}({\bar{\theta }}, \hat{{\mathcal {S}}}^{t+0.5}), \theta ^* \rangle =\langle {\bar{\theta }}_{{\mathcal {I}}_2}, \theta ^*_{{\mathcal {I}}_2}\rangle \ge \langle {\bar{\theta }}, \theta ^*\rangle -\Vert {\bar{\theta }}_{{\mathcal {I}}_1}\Vert _2\Vert \theta ^*_{{\mathcal {I}}_1}\Vert _2. \end{aligned}$$

(55)

Let $\chi =\Vert {\bar{\beta }}^{t+0.5}\Vert _2\Vert \beta ^*\Vert _2$. Then, by (55) and (53) we have

$$\begin{aligned}&\langle {\bar{\beta }}^{t+1}, \beta ^*\rangle \nonumber \\&\quad \ge \langle {\bar{\beta }}^{t+0.5}, \beta ^*\rangle - \left[ \left( \sqrt{\frac{s^*}{s}}+1\right) \sqrt{\chi (1-\varDelta ^2)}+\sqrt{s^*}\sqrt{\chi }{\tilde{\epsilon }}\right] \nonumber \\&\qquad \cdot \left[ \sqrt{\frac{s^*}{s}}\sqrt{\chi (1-\varDelta ^2)}+\sqrt{s^*}\sqrt{\chi }{\tilde{\epsilon }}\right] \nonumber \\&\quad = \langle {\bar{\beta }}^{t+0.5}, \beta ^*\rangle -\left( \sqrt{\frac{s^*}{s}}+\frac{s^*}{s}\right) \chi (1-\varDelta ^2)\nonumber \\&\qquad -\left( 1+2\sqrt{\frac{s^*}{s}}\right) \sqrt{\chi (1-\varDelta ^2)}\sqrt{s^*}\sqrt{\chi }{\tilde{\epsilon }}-(\sqrt{s^*}\sqrt{\chi }{\tilde{\epsilon }})^2. \end{aligned}$$

(56)

For the term $\sqrt{\chi (1-\varDelta ^2)}$, we have

$$\begin{aligned} \sqrt{\chi (1-\varDelta ^2)}\le \sqrt{2\chi (1-\varDelta )}\le \sqrt{2\Vert {\bar{\beta }}^{t+0.5}\Vert _2\Vert \beta ^*\Vert _2-2\langle {\bar{\beta }}^{t+0.5},\beta ^*\rangle }\le \Vert {\bar{\beta }}^{t+0.5}-\beta ^*\Vert _2. \end{aligned}$$

(57)

For the term $\sqrt{\chi }{\tilde{\epsilon }}$, we have

$$\begin{aligned} \sqrt{\chi }{\tilde{\epsilon }}=2\sqrt{\Vert {\bar{\beta }}^{t+0.5}\Vert _2\Vert \beta ^*\Vert _2}\frac{\Vert {\bar{\beta }}^{t+0.5}-\beta ^{t+0.5}\Vert _\infty }{\Vert {\bar{\beta }}^{t+0.5}\Vert _2}\le \frac{2}{\sqrt{1-k}}\Vert {\bar{\beta }}^{t+0.5}-\beta ^{t+0.5}\Vert _\infty . \end{aligned}$$

(58)

Plugging (57) and (58) into (56), we get

$$\begin{aligned}&\langle {\bar{\beta }}^{t+1}, \beta ^*\rangle \ge \langle {\bar{\beta }}^{t+0.5}, \beta ^*\rangle -(\sqrt{\frac{s^*}{s}}+\frac{s^*}{s})\Vert {\bar{\beta }}^{t+0.5}-\beta ^*\Vert _2^2-\nonumber \\&\quad (1+2\sqrt{\frac{s^*}{s}})\Vert {\bar{\beta }}^{t+0.5} -\beta ^*\Vert _2\frac{2\sqrt{s^*}}{\sqrt{1-k}}\Vert {\bar{\beta }}^{t+0.5}-\beta ^{t+0.5}\Vert _\infty -\frac{4s^*}{1-k}\Vert {\bar{\beta }}^{t+0.5}-\beta ^{t+0.5}\Vert _\infty ^2. \end{aligned}$$

(59)

Also, since $\Vert {\bar{\beta }}^{t+1}\Vert _2^2+\Vert \beta ^*\Vert _2^2\le \Vert {\bar{\beta }}^{t+0.5}+\Vert \beta ^*\Vert _2^2$, subtracting (59), we obtain

$$\begin{aligned} \Vert {\bar{\beta }}^{t+1}-\beta ^*\Vert _2^2&\le (1+\sqrt{\frac{s^*}{s}}+\frac{s^*}{s})\Vert {\bar{\beta }}^{t+0.5}-\beta ^*\Vert _2^2 +\frac{8s^*}{1-k}\Vert {\bar{\beta }}^{t+0.5}-\beta ^{t+0.5}\Vert _\infty ^2\nonumber \\&\quad + (1+2\sqrt{\frac{s^*}{s}})\Vert {\bar{\beta }}^{t+0.5} -\beta ^*\Vert _2\frac{4\sqrt{s^*}}{\sqrt{1-k}}\Vert {\bar{\beta }}^{t+0.5}-\beta ^{t+0.5}\Vert _\infty \nonumber \\&\le (1+2\sqrt{\frac{s^*}{s}}+2\frac{s^*}{s})[\Vert {\bar{\beta }}^{t+0.5}-\beta ^*\Vert _2+\frac{2\sqrt{s^*}}{\sqrt{1-k}}\Vert {\bar{\beta }}^{t+0.5}-\beta ^{t+0.5}\Vert _\infty ]^2 \nonumber \\&\quad +\frac{8s^*}{1-k}\Vert {\bar{\beta }}^{t+0.5}-\beta ^{t+0.5}\Vert _\infty ^2. \end{aligned}$$

(60)

Thus, we have

$$\begin{aligned} \Vert {\bar{\beta }}^{t+1}-\beta ^*\Vert _2\le \left( 1+4\sqrt{\frac{s^*}{s}}\right) ^{\frac{1}{2}}\Vert {\bar{\beta }}^{t+0.5}-\beta ^*\Vert _2+\frac{2\sqrt{2}\sqrt{s^*}}{\sqrt{1-k}}\Vert {\bar{\beta }}^{t+0.5}-\beta ^{t+0.5}\Vert _\infty . \end{aligned}$$

(61)

This completes the proof of Lemma 15. $\square$

Next, we bound the term $\Vert {\bar{\beta }}^{t+0.5}-\beta ^*\Vert _2$ in (34).

Lemma 16

Under the assumptions in Theorem 1, the following inequality holds

$$\begin{aligned} \Vert {\bar{\beta }}^{t+0.5}-\beta ^*\Vert _2\le (1-2\frac{\upsilon -\gamma }{\upsilon +\mu })\Vert \beta ^t-\beta ^*\Vert _2. \end{aligned}$$

(62)

Proof of Lemma 16

We first note that the self-consistent property in McLachlan and Krishnan (2007) implies that

$$\begin{aligned} \beta ^*=\arg \max _{\beta }Q(\beta ; \beta ^*), \end{aligned}$$

(63)

which means that $\beta ^*$ is a maximizer of $Q(\beta ; \beta ^*)$. Thus, the proof follows from the convergence rate of the strongly convex and smooth functions $Q(\beta ; \beta ^*)$ in Nesterov (2013). For the step size $\eta =\frac{2}{\mu +\upsilon }$, we have

$$\begin{aligned} \Vert \beta ^t+\eta \nabla Q(\beta ^t; \beta ^*)-\beta ^*\Vert _2\le (\frac{\mu -\upsilon }{\mu +\upsilon })\Vert \beta ^T-\beta ^*\Vert _2. \end{aligned}$$

(64)

Thus, we get

$$\begin{aligned} \Vert {\bar{\beta }}^{t+0.5}-\beta ^*\Vert _2&= \Vert \beta ^t+\eta \nabla Q(\beta ^t; \beta ^t)-\beta ^*\Vert _2 \end{aligned}$$

(65)

$$\begin{aligned}&= \Vert \beta ^t+\eta \nabla Q(\beta ^t; \beta ^*)-\beta ^*\Vert _2+\eta \Vert \nabla Q(\beta ^t; \beta ^*)-\nabla Q(\beta ^t; \beta ^t)\Vert _2 \end{aligned}$$

(66)

$$\begin{aligned}&\le (\frac{\mu -\upsilon }{\mu +\upsilon })\Vert \beta ^T-\beta ^*\Vert _2+\eta \gamma \Vert \beta ^t-\beta ^*\Vert _2. \end{aligned}$$

(67)

Taking $\eta =\frac{2}{\mu +\upsilon }$, we complete the proof. $\square$

Combining Lemmas 15, 16, and Eq. (31), we have the following lemma.

Lemma 17

If

$$\begin{aligned} \Vert {\bar{\beta }}^{t+0.5}-\beta ^*\Vert _2\le k \Vert \beta ^*\Vert _2 \end{aligned}$$

(68)

for some $k\in (0,1)$ and further assuming that

$$\begin{aligned} s\ge & {} \frac{4(1+k)^2}{(1-k)^2}s^* \text { and } \sqrt{s}\alpha \nonumber \\\le & {} \frac{(1-k)^2}{2(1+k)}\Vert \beta ^*\Vert _2, \end{aligned}$$

(69)

then it holds with probability at least $1-d^{-3}$ that

$$\begin{aligned} \Vert \beta ^{t+1}-\beta ^*\Vert _2\le & {} \frac{2}{\upsilon +\mu }\sqrt{s}\alpha \nonumber \\&+\frac{1}{\upsilon +\mu }\frac{4\sqrt{2}\sqrt{s^*}}{\sqrt{1-k}}\alpha +\left( 1+4\sqrt{\frac{s^*}{s}}\right) ^{\frac{1}{2}}\left( 1-2\frac{\upsilon -\gamma }{\upsilon +\mu }\right) \Vert \beta ^t-\beta ^*\Vert _2, \end{aligned}$$

(70)

where $\alpha =C_2\xi (\epsilon \log (nd)+\sqrt{\frac{\log d}{n}})$.

We now prove Theorem 1.

Proof of Theorem 1

By Lemma 17, we know that it is sufficient to prove (68), which can be shown by mathematical induction.

We first prove $\beta ^0\in {\mathcal {B}}$. By assumption, we have $\Vert \beta ^{\text {init}}-\beta ^*\Vert _2\le \frac{R}{2}$. By the same proof of Lemma 15, we can get $\Vert \beta ^0-\beta ^*\Vert _2\le (1+4\sqrt{\frac{s^*}{s}})^\frac{1}{2}\Vert \beta ^{\text {init}}-\beta ^*\Vert _2\le (1+4\sqrt{\frac{1}{4}})^\frac{1}{2}\frac{R}{2}\le R=k\Vert \beta ^*\Vert _2$. Thus, by Lemma 16, we can see that (68) holds for $t=0$.

Now suppose that (68) holds for all $t\le k$. Then, we have

$$\begin{aligned} \Vert \beta ^{k+1}-\beta ^*\Vert _2\le & {} \frac{2}{\upsilon +\mu }\sqrt{s}\alpha +\frac{1}{\upsilon +\mu }\frac{4\sqrt{2}\sqrt{s^*}}{\sqrt{1-k}}\alpha \nonumber \\&+\left( 1+4\sqrt{\frac{s^*}{s}}\right) ^{\frac{1}{2}}\left( 1-2\frac{\upsilon -\gamma }{\upsilon +\mu }\right) \Vert \beta ^k-\beta ^*\Vert _2, \end{aligned}$$

(71)

by assumption we can see that $\left(1+4\sqrt{\frac{s^*}{s}}\right)^{\frac{1}{2}}\left(1-2\frac{\upsilon -\gamma }{\upsilon +\mu }\right)\le \sqrt{1-2\frac{\upsilon -\gamma }{\upsilon +\mu }}$. Thus, we have

$$\begin{aligned} \Vert \beta ^{k+1}-\beta ^*\Vert _2\le \frac{1}{\upsilon +\mu }\frac{\left( 2\sqrt{s}+4\sqrt{2}\sqrt{s^*}/\sqrt{1-k}\right) \alpha }{1-\sqrt{1-2\frac{\upsilon -\gamma }{\upsilon +\mu }}}+\left( \sqrt{1-2\frac{\upsilon -\gamma }{\upsilon +\mu }}\right) ^kR. \end{aligned}$$

(72)

By the assumption of $\frac{1}{\upsilon +\mu }\frac{(2\sqrt{s}+4\sqrt{2}\sqrt{s^*}/\sqrt{1-k})\alpha }{1-\sqrt{1-2\frac{\upsilon -\gamma }{\upsilon +\mu }}}\le \left({1-\sqrt{1-2\frac{\upsilon -\gamma }{\upsilon +\mu }}}\right)R$, we have

$$\begin{aligned} \Vert \beta ^{k+1}-\beta ^*\Vert _2\le \left( {1-\sqrt{1-2\frac{\upsilon -\gamma }{\upsilon +\mu }}}\right) R+\sqrt{1-2\frac{\upsilon -\gamma }{\upsilon +\mu }}R=R. \end{aligned}$$

(73)

Hence, by Lemma 16, we obtain (68) for the case of $t=k+1$. This completes the proof. $\square$

1.2 Proof of Lemma 2

From (5) it is oblivious that $[\nabla q_i(\beta ,\beta ))]_j$ is independent of other $i\in [n]$ for fixed $j\in [d]$. Next, we prove the property of sub-exponential for each coordinate.

Note that

$$\begin{aligned}{}[\nabla q_i(\beta ,\beta ))]_j= [2w_\beta (y_i)-1]y_{i,j}-\beta _j, \end{aligned}$$

and

$$\begin{aligned} {\mathbb {E}}[\nabla q_i(\beta ,\beta ))]_j= {\mathbb {E}}(2w_\beta (Y)Y_j-Y_j)-\beta _j. \end{aligned}$$

For convenience, we let $\nabla q_{i,j}$ denote $[\nabla q_i(\beta ,\beta ))]_j$ and $\nabla q_{j}$ denote ${\mathbb {E}}[\nabla q_i(\beta ,\beta ))]_j$.

By the symmetrization lemma in Lemma 8, we have the following for any $t>0$

$$\begin{aligned} {\mathbb {E}}\{\exp (t|[\nabla q_{i,j}-\nabla q_j]|)\}\le {\mathbb {E}}\{\exp (t|\epsilon [2w_\beta (y_i)-1]y_{i,j}|)\}, \end{aligned}$$

(74)

where $\epsilon$ is a Rademacher random variable.

Next, we use Lemma 9 with $f(y_{i,j})=y_{i,j}$, ${\mathcal {F}}=\{f\}$, $\phi _i(v)=[2w_\beta (y_i)-1]v$ and $\phi (v)=\exp (u\cdot v)$. It is easy to see that $\phi _i$ is 1-Lipschitz. Thus, by Lemma 9 we have

$$\begin{aligned} {\mathbb {E}}\{\exp (t|\epsilon [2w_\beta (y_i)-1]y_{i,j}|)\}\le {\mathbb {E}}\{\exp [2t|\epsilon y_{i,j}|]\}. \end{aligned}$$

(75)

By the formulation of the model, we have $y_{i, j}=z_i \beta ^*_{j}+v_{i,j}$, where $z_i$ is a Rademacher random variable and $v_{i,j}\sim {\mathcal {N}}(0, \sigma ^2)$. It is easy to see that $y_{i,j}$ is sub-Gaussian and

$$\begin{aligned} \Vert y_{i,j}\Vert _{\psi _2}=\Vert z_i\cdot \beta ^*_j+v_{i,j}\Vert _{\psi _2}\le C\cdot \sqrt{\Vert z_i\cdot \beta _j\Vert ^2_{\psi _2}+\Vert v_{i,j}\Vert ^2_{\psi _2}}\le C'\sqrt{|\beta _j^*|^2+\sigma ^2}, \end{aligned}$$

(76)

for some absolute constants $C, C'$, where the last inequality is due to the facts that $\Vert z_j\beta _j^*\Vert _{\psi _2}\le |\beta _j^*|$ and $\Vert v_{i,j}\Vert _{\psi _2}\le C''\sigma ^2$ for some $C''>0$.

Since $|\epsilon y_{i,j}|=|y_{i,j}|$, $\Vert \epsilon y_{i,j}\Vert _{\psi _2}=\Vert y_{i,j}\Vert _{\psi _2}$ and ${\mathbb {E}}(\epsilon y_{i,j})=0$, by Lemma 5.5 in Vershynin (2010) we have that for any $u'$ there exists a constant $C^{(4)}>0$ such that

$$\begin{aligned} {\mathbb {E}}\{\exp (u' \cdot \epsilon \cdot y_{i,j})\}\le \exp (u'^2\cdot C^{(4)} \cdot (|\beta |_j^2+\sigma ^2)). \end{aligned}$$

(77)

Thus, for any $t>0$ we get

$$\begin{aligned} {\mathbb {E}}\{\exp (2t\cdot |\epsilon \cdot y_{i,j}|)\}\le 2\exp (t^2\cdot C^{(5)}\cdot (|\beta |_j^2+\sigma ^2)) \end{aligned}$$

(78)

for some constant $C^{(5)}$. Therefore, in total we have the following for some constant $C^{(6)}>0$

$$\begin{aligned} {\mathbb {E}}\{\exp (t|[\nabla q_{i,j}-\nabla q_j]|)\}\le & {} \exp (t^2\cdot C^{(6)}\cdot (|\beta |_j^2+\sigma ^2))\nonumber \\\le & {} \exp (t^2\cdot C^{(6)}\cdot (|\beta ^*|_\infty ^2+\sigma ^2)). \end{aligned}$$

(79)

Combining this with Lemma 10 and the definition, we know that $\nabla q_{i,j}$ is $O(\sqrt{\Vert \beta ^*\Vert _\infty ^2+\sigma ^2})$-sub-exponential.

1.3 Proof of Lemma 4

From (7) it is oblivious that $[\nabla q_i(\beta ,\beta ))]_j$ is independent of other $i\in [n]$ for any fixed $j\in [d]$. Next, we prove the property of sub-exponential.

Note that ${\mathbb {E}}\nabla q_{i,j} = {\mathbb {E}} 2w_\beta (x, y)y\cdot x_j-\beta _j$. Thus, we have

$$\begin{aligned} \nabla q_{i,j}-\nabla q_j= \underbrace{2w_\beta (x_i, y_i)y_i x_{i,j}- {\mathbb {E}}[]2w_\beta (x,y)y x_j]}_{A}+\underbrace{[x_ix_i^T\beta -\beta ]_j}_{B}- \underbrace{y_ix_{i,j}}_{C}. \end{aligned}$$

(80)

For term A and any $t>0$, we have

$$\begin{aligned} {\mathbb {E}}\{\exp (t|A|)\}\le {\mathbb {E}}\{\exp [t|2\epsilon w_\beta (x_i, y_i)y_i x_{i,j}|]\}. \end{aligned}$$

(81)

Using Lemma 9 on $f(y_ix_{i,j})=y_ix_{i,j}$, ${\mathcal {F}}=f$, $\phi _i(v)= 2w_\beta (x,y)v$ and $\phi (v)=\exp (uv)$, we have

$$\begin{aligned} {\mathbb {E}}\{\exp [t|2\epsilon w_\beta (x_i, y_i)y_i x_{i,j}|]\le {\mathbb {E}}\{\exp [4t|\epsilon y_i x_{i,j}|]\}. \end{aligned}$$

(82)

Note that since $y_i=z_i\langle \beta ^*,x_i\rangle +v_i$ and $\Vert z_i\langle \beta ^*,x_i\rangle \Vert _{\psi _2}=\Vert \langle \beta ^*,x_i\rangle \Vert _{\psi _2}\le C\Vert \beta ^*\Vert _2$ and $\Vert v_i\Vert _{\psi _2}\le C'\sigma$ for some constants $C, C'>0$, by Lemma 14 we know that there exists a constant $C''>0$ such that

$$\begin{aligned} \Vert y_i\Vert _{\psi _2}\le C''\sqrt{\Vert \beta ^*\Vert _2^2+\sigma ^2}. \end{aligned}$$

(83)

Thus, by Lemma 13 we have

$$\begin{aligned} \Vert y_ix_{i,j}\Vert _{\psi _1}\le \max \{C''^2(\Vert \beta ^*\Vert _2^2+\sigma ^2), C'''\}\le C_4\max \{\Vert \beta ^*\Vert _2^2+\sigma ^2,1\}. \end{aligned}$$

(84)

For term B, we have

$$\begin{aligned} {\mathbb {E}}\{\exp [t |B|]\}={\mathbb {E}}\{\exp [t |\sum _{k=1}^d x_{j}x_k\beta _k-\beta _j|]\}, \end{aligned}$$

(85)

where $x_j, x_k\sim {\mathcal {N}}(0, 1)$. Now, by Lemma 13 we have $\Vert x_jx_k\beta _k\Vert _{\psi _1}\le |\beta _k|C^{(5)}$ for some constant $C^{(5)}>0$. Thus, we get $\Vert \sum _{k=1}^d x_jx_k\beta _k\Vert _{\psi _1}\le C^{(5)}\Vert \beta \Vert _1$.

Also, we know that $\Vert \beta \Vert _1\le \sqrt{s}\Vert \beta \Vert _2$, since by assumption $\Vert \beta \Vert _0=s$. Furthermore, we have $\Vert \beta \Vert _2\le \Vert \beta ^*\Vert _2+\Vert \beta ^*-\beta \Vert _2\le (1+\frac{1}{32})\Vert \beta ^*\Vert _2$, since $\beta \in {\mathcal {B}}$ (by assumption). From Lemma 12, we get $\Vert B\Vert _{\psi _1}\le C^{(6)}\sqrt{s}\Vert \beta ^*\Vert _2$ with some constant $C^{(6)}>0$.

Thus, we know that there exist some constants $C^{(7)}>0$ and $C^{(8)}>0$ such that

$$\begin{aligned} \Vert \nabla q_{i,j}-\nabla q_j\Vert _{\psi _1}&\le C^{(7)}\max \{\Vert \beta ^*\Vert _2^2+\sigma ^2, 1\}+ C^{(8)}\sqrt{s}\Vert \beta ^*\Vert _2 \\&\le C^{(9)}\max \{\Vert \beta ^*\Vert _2^2+\sigma ^2, 1, \sqrt{s}\Vert \beta ^*\Vert _2\}. \end{aligned}$$

This means that $\nabla q_{i,j}$ is $O(\max \{\Vert \beta ^*\Vert _2^2+\sigma ^2, 1, \sqrt{s}\Vert \beta ^*\Vert _2\})$ sub-exponential.

1.4 Proof of Lemma 6

For simplicity, we use notations ${\bar{m}}^i=m_\beta (x_i^{\text {obs}}, y_i)$, ${\bar{m}}=\beta (x^{\text {obs}}, y)$, ${\bar{K}}^i=K_\beta (x_i^{\text {obs}}, y_i)$, and ${\bar{K}}=K_\beta (x^{\text {obs}}, y)$. Then, we have

$$\begin{aligned} \nabla q_i -\nabla q=\underbrace{ m_\beta (x_i^{\text {obs}}, y_i)y_i-{\mathbb {E}} [m_\beta (x_i^{\text {obs}}, y_i)y_i]}_{A}+\overbrace{ \left( K_\beta (x_i^{\text {obs}}, y_i)-{\mathbb {E}}{K_\beta (x_i^{\text {obs}}, y_i)}\right) \beta }^{B}. \end{aligned}$$

(86)

For the j-th coordinate of A, we have

$$\begin{aligned} A_j= {\bar{m}}^i_j y_i-{\mathbb {E}} [{\bar{m}}_j y]. \end{aligned}$$

(87)

We note that ${\bar{m}}_j$ is a zero-mean sub-Gaussian random variable with $\Vert {\bar{m}}_j\Vert _{\psi _2}\le C(1+kr)$ (see Lemma B.3 in Wang et al. (2015))

Lemma 18

Under the assumption of Lemma 6, for each $j\in [d]$, ${\bar{m}}_j$ is sub-Gaussian with mean zero and $\Vert {\bar{m}}_j\Vert _{\psi _2}\le C(1+kr)$.

Thus, by Lemma 13 we have

$$\begin{aligned} \Vert {\bar{m}}_j y_i\Vert _{\psi _1}\le C\max \{\Vert {\bar{m}}_j\Vert _{\psi _2}^2, \Vert y\Vert _{\psi _2}^2\}\le C'\max \{(1+kr)^2, \sigma ^2+\Vert \beta ^*\Vert _2^2\}, \end{aligned}$$

(88)

where the last inequality is due to the fact that $y=\langle \beta ^*, x\rangle +v$. Thus, $\Vert y\Vert _{\psi _2}^2\le C_3(\Vert \langle \beta ^*, x\rangle \Vert _{\psi _2}^2+\Vert v\Vert _{\psi _2}^2)$ for some $C_3$.

For term B, we have

$$\begin{aligned} {\bar{K}}^i_j= \underbrace{(1-z_{i,j})\beta _j}_{C}+\underbrace{\sum _{k=1}^d{\bar{m}}^i_j{\bar{m}}^i_k\beta _k}_{D}-\underbrace{\sum _{k=1}^d[(1-z_{i,j}){\bar{m}}^i_j][(1-z_{i,k}){\bar{m}}^i_k]\beta _k}_{E}. \end{aligned}$$

(89)

For term C, we have the following [by Example 5.8 in Vershynin (2010)]

$$\begin{aligned} \Vert (1-z_{i,j})\beta _j\Vert _{\psi _2}\le |\beta _j|\le \Vert \beta \Vert _{\infty }\le (1+k)\sqrt{s}\Vert \beta ^*\Vert _2. \end{aligned}$$

(90)

For term D, by Lemma 18 and 13 we have

$$\begin{aligned} \Vert \sum _{k=1}^d{\bar{m}}^i_j{\bar{m}}^i_k\beta _k\Vert _{\psi _1}\le \sum _{k=1}^d|\beta _k|\Vert {\bar{m}}^i_j{\bar{m}}^i_k\Vert _{\psi _1}\le \sum _{k=1}^d|\beta _k|C^2(1+kr)^2\le C_4 (1+kr)^2\Vert \beta \Vert _1. \end{aligned}$$

(91)

Since $\beta \in {\mathcal {B}}$, we get $\Vert \beta \Vert _1\le \sqrt{s}\Vert \beta \Vert _2\le (1+k)\sqrt{s}\Vert \beta ^*\Vert _2$. Thus, we have

$$\begin{aligned} \Vert \sum _{k=1}^d{\bar{m}}^i_j{\bar{m}}^i_k\beta _k\Vert _{\psi _1}\le C_4\sqrt{s}(1+kr)^2\Vert \beta ^*\Vert _2. \end{aligned}$$

(92)

For term E, since $1-z_i\in [0,1]$, we have $\Vert (1-z_{i,j}){\bar{m}}^i_j\Vert _{\psi _2}\le \Vert {\bar{m}}^i_j\Vert _{\psi _2}\le C(1+kr)$. Hence, by Lemma 13 we get

$$\begin{aligned} \Vert \sum _{k=1}^d[(1-z_{i,j}){\bar{m}}^i_j][(1-z_{i,k}){\bar{m}}^i_k]\beta _k\Vert _{\psi _1}&\le \sum _{k=1}^d|\beta _k| \Vert [(1-z_{i,j}){\bar{m}}^i_j][(1-z_{i,k}){\bar{m}}^i_k]\Vert _{\psi _1} \nonumber \\&\le \sum _{k=1}^d|\beta _k| C(1+kr)^2\le C_6(1+kr)^2\sqrt{s}\Vert \beta ^*\Vert _2. \end{aligned}$$

(93)

This gives us

$$\begin{aligned} \Vert {\bar{K}}^i_j\Vert _{\psi _1}\le C_7\sqrt{s}(1+k)(1+kr)^2\Vert \beta ^*\Vert _2. \end{aligned}$$

(94)

By Lemma 12, we get

$$\begin{aligned}&\Vert [\nabla q_i-\nabla q]_j\Vert _{\psi _1}\le 2\Vert [\nabla q_i]_j\Vert _{\psi _1}\nonumber \\&\quad \le C_8[(1+k)(1+kr)^2\sqrt{s}\Vert \beta ^*\Vert _2+\max \{(1+kr)^2, \sigma ^2+\Vert \beta ^*\Vert _2^2\}]. \end{aligned}$$

(95)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, D., Guo, X., Li, S. et al. Robust high dimensional expectation maximization algorithm via trimmed hard thresholding. Mach Learn 109, 2283–2311 (2020). https://doi.org/10.1007/s10994-020-05926-z

Download citation

Received: 16 April 2020
Revised: 21 July 2020
Accepted: 17 October 2020
Published: 02 November 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10994-020-05926-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust high dimensional expectation maximization algorithm via trimmed hard thresholding

Abstract

Similar content being viewed by others

Robust and sparse regression in generalized linear model by stochastic optimization

A new method for estimation and model selection: $$\rho $$ -estimation

Nonparametric estimation of $${\mathbb {P}}(X<Y)$$ from noisy data samples with non-standard error distributions

1 Introduction

2 Related work

3 Preliminaries

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Assumption 1

Definition 6

4 Trimmed expectation maximization algorithm

Definition 7

Theorem 1

5 Implications for some specific models

5.1 Corrupted Gaussian mixture model

Lemma 1

Lemma 2

Theorem 2

5.2 Corrupted mixture of regressions model

Lemma 3

Lemma 4

Theorem 3

5.3 Corrupted linear regression with missing covariates

Lemma 5

Lemma 6

Theorem 4

6 Experiments

7 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Author notes

Editors: Kee-Eung Kim, Vineeth N. Balasubramanian.

Corresponding author

Additional information

Publisher's Note

Appendices

Auxiliary lemmas

Definition 8

Lemma 7

Lemma 8

Lemma 9

Definition 9

Lemma 10

Lemma 11

Definition 10

Definition 11

Lemma 12

Lemma 13

Lemma 14

Omitted proofs

1.1 Proof of Theorem 1

Lemma 15

Proof of Lemma 15

Lemma 16

Proof of Lemma 16

Lemma 17

Proof of Theorem 1

1.2 Proof of Lemma 2

1.3 Proof of Lemma 4

1.4 Proof of Lemma 6

Lemma 18

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation