# Robust and sparse regression in generalized linear model by stochastic optimization

- 122 Downloads

## Abstract

The generalized linear model (GLM) plays a key role in regression analyses. In high-dimensional data, the sparse GLM has been used but it is not robust against outliers. Recently, the robust methods have been proposed for the specific example of the sparse GLM. Among them, we focus on the robust and sparse linear regression based on the \(\gamma\)-divergence. The estimator of the \(\gamma\)-divergence has strong robustness under heavy contamination. In this paper, we extend the robust and sparse linear regression based on the \(\gamma\)-divergence to the robust and sparse GLM based on the \(\gamma\)-divergence with a stochastic optimization approach to obtain the estimate. We adopt the randomized stochastic projected gradient descent as a stochastic optimization approach and extend the established convergence property to the classical first-order necessary condition. By virtue of the stochastic optimization approach, we can efficiently estimate parameters for very large problems. Particularly, we show the linear regression, logistic regression and Poisson regression with \(L_1\) regularization in detail as specific examples of robust and sparse GLM. In numerical experiments and real data analysis, the proposed method outperformed comparative methods.

## Keywords

Sparse Robust Divergence Stochastic gradient descent Generalized linear model## 1 Introduction

The regression analysis is a fundamental tool in data analysis. The generalized linear model (GLM) (Nelder and Wedderburn 1972; McCullagh and Nelder 1989) is often used and includes many important regression models such as linear regression, logistic regression and Poisson regression. Recently, the sparse modeling has been popular in GLM to treat high-dimensional data and, for some specific examples of GLM, the robust methods have also been incorporated [linear regression: Khan et al. (2007), Alfons et al. (2013), logistic regression: Bootkrajang and Kabán (2013), Chi and Scott (2014)].

Kawashima and Fujisawa (2017) proposed a robust and sparse regression based on the \(\gamma\)-divergence (Fujisawa and Eguchi 2008), which has a strong robustness that the latent bias can be sufficiently small even under heavy contamination. The proposed method showed better performances than past methods by virtue of the strong robustness. A coordinate descent algorithm with majorization–minimization algorithm (MM algorithm) (Hunter and Lange 2004) was constructed as an efficient estimation procedure for linear regression, but it is not always useful for GLM. In particular, when we consider the Poisson regression with \(L_1\) regularization based on the \(\gamma\)-divergence, although the objective function includes a hypergeometric series and demands high computational cost. To overcome this problem, we propose a new estimation procedure with a stochastic optimization approach, which largely reduces the computational cost and is easily applicable to any examples of GLM. In many stochastic optimization approaches, we adopt the randomized stochastic projected gradient descent (RSPG) proposed by Ghadimi et al. (2016).

In Sect. 2, we review the robust and sparse regression based on the \(\gamma\)-divergence. In Sect. 3, the RSPG is explained with regularized expected risk minimization. In Sect. 4, an online algorithm is proposed for GLM and the robustness of online algorithm is described with some typical examples of GLM. In Sect. 5, the convergence property of the RSPG is extended to the classical first-order necessary condition. In Sects. 6 and 7, numerical experiments and real data analysis are illustrated to show better performances than comparative methods. Concluding remarks are given in Sect. 8.

## 2 Regression via \(\gamma\)-divergence

### 2.1 Regularized empirical risk minimization

*g*is the underlying probability density function and

*f*is a parametric probability density function. Let us define the \(\gamma\)-cross entropy for regression given by

*g*(

*x*,

*y*). The \(\gamma\)-cross entropy can be empirically estimated by

### 2.2 MM algorithm for \(\gamma\)-regression

Kawashima and Fujisawa (2017) proposed the iterative estimation algorithm for (2) by MM algorithm (Hunter and Lange 2004). It has a monotone decreasing property, i.e., the objective function monotonically decreases at each iterative step, which property leads to numerical stability and efficiency. In particular, the linear regression with \(L_1\) penalty was deeply considered.

*m*-th iterative step for \(m=0,1,2,\ldots\). MM algorithm optimizes the majorization function instead of the objective function as follows:

### 2.3 Sparse \(\gamma\)-Poisson regression case

*n*times of an approximate calculation for the hypergeometric series at each iterative step in sub-problem \(\mathop {\mathrm{argmin}}\limits _{\theta } h_{MM} (\theta |\theta ^{(m)})\). Therefore, it requires high computation cost, especially for very large problems. We need another optimization approach to overcome such problems. In this paper, we consider minimizing the regularized expected risk (1) directly by a stochastic optimization approach. In what follows, we refer to the sparse \(\gamma\)-regression in GLM as the sparse \(\gamma\)-GLM.

## 3 Stochastic optimization approach for regularized expected risk minimization

*l*is a loss function with a parameter \(\theta\) and \(\varPsi (\theta )\) is bounded below over \(\varTheta\) by \(\varPsi ^* > - \infty\). Stochastic optimization approach solves (5) sequentially. More specifically, we draw a sequence of i.i.d. paired samples \((x_1,y_1),(x_2,y_2),\ldots ,(x_t,y_t),\ldots\) and, at

*t*-th time, update the parameter \(\theta ^{(t)}\) based on the latest paired sample \((x_t,y_t)\) and the previous updated parameter \(\theta ^{(t-1)}\). Therefore, it requires low computational complexity per iteration and stochastic optimization can scale well for very large problems.

### 3.1 Stochastic gradient descent

*l*is convex (possibly non-differentiable) and \(\eta _t\) is set to be appropriate, e.g., \(\eta _t= \mathcal{O} \left( \frac{1}{ \sqrt{t}} \right)\), under some mild conditions, the convergence property was established for the average of the iterates, i.e., \(\bar{\theta }_{T}=\frac{1}{T} \sum _{t=1}^T \theta ^{(t)}\) as follows [see, e.g., Bubeck (2015)]:

These methods assume that a loss function is convex to establish the convergence property, but the loss function is non-convex in our problem (1). Then, we can not adopt these methods directly. Recently, for non-convex loss function with convex regularization term, randomized stochastic projected gradient (RSPG) was proposed by Ghadimi et al. (2016). Under some mild conditions, the convergence property was established. Therefore, we consider applying the RSPG to our problem (1).

### 3.2 Randomized stochastic projected gradient

*t*-th time, \((x_{t,i},y_{t,i})\) is the

*i*-th mini-batch sample at

*t*-th time and

*w*is continuously differentiable and \(\alpha\)-strongly convex function satisfying \(\langle a - b , \nabla w(a) - \nabla w(b) \rangle \ge \alpha \Vert a - b \Vert ^2\) for \(a , b \in \varTheta\). When \(w(\theta ) = \frac{1}{2} || \theta ||_2^2\), i.e., \(V(\theta , \theta ^{(t)}) = \frac{1}{2} || \theta - \theta ^{(t)} ||_2^2\), (7) is almost equal to (6).

Here, we denote two remarks on RSPG as a difference from the SGD. One is that the RSPG uses the mini-batch strategy, i.e., taking multiple samples at *t*-th time. The other is that the RSPG randomly select the output \(\hat{\theta }\) from \(\left\{ \theta ^{(1)}, \ldots , \theta ^{(T)} \right\}\) according to a certain probability distribution instead of taking the average of the iterates. This is because for non-convex stochastic optimization, later iterates does not always gather around local minimum and the average of the iterates can not work in such a convex case.

Next, we show the implementation of the RSPG, given by Algorithm 1. However, Algorithm 1 has a large deviation of the output because the only one final output is selected via some probability mass function \(P_R\). Therefore, Ghadimi et al. (2016) also proposed the two phase RSPG (2-RSPG) which has the post-optimization phase. In the post-optimization phase, multiple outputs are selected and these are validated to determine the final output, as shown in Algorithm 2. This can be expected to achieve a better complexity result of finding an \((\epsilon ,\varLambda )\)-solution, i.e., Prob\(\left\{ C(\theta ^{(R)}) \le \epsilon \right\} \ge 1- \varLambda\), where *C* is some convergence criterion, for some \(\epsilon > 0\) and \(\varLambda \in (0,1)\). For more detailed descriptions and proofs, we refer to the Sect. 4 in Ghadimi et al. (2016).

## 4 Online robust and sparse GLM

In this section, we show the sparse \(\gamma\)-GLM with the stochastic optimization approach on three specific examples; linear regression, logistic regression, and Poisson regression with \(L_1\) regularization. In what follows, we refer to the sparse \(\gamma\)-GLM with the stochastic optimization approach as the online sparse \(\gamma\)-GLM.

To implement our methods, we need to determine some tuning parameters, e.g., the step size \(\eta _t\), mini-batch size \(m_t\). In Sect. 5, we discuss how to determine some tuning parameters in detail.

### 4.1 Online sparse \(\gamma\)-linear regression

*t*-th time. The conditional probability density \(f(y_{t,k}|x_{t,k};\theta ^{(t)})\) can be expected to be sufficiently small. We see from \(f(y_{t,k}|x_{t,k};\theta ^{(t)}) \approx 0\) and (10) that

### 4.2 Online sparse \(\gamma\)-logistic regression

### 4.3 Online sparse \(\gamma\)-Poisson regression

### Lemma 1

If the term \(\mu _{x_{t,i}}(\beta _0^{(t)}, \beta ^{(t)})\) is bounded, \(\sum _{y=0}^\infty f(y|x_{t,i};\theta ^{(t)})^{1+\gamma }\) and \(\sum _{y=0}^\infty (y- y_{t,i} ) f(y|x_{t,i};\theta ^{(t)})^{1+\gamma }\) converge.

The proof is in Appendix 1.

Consequently, we can obtain the update algorithm as shown in Algorithm 5. In a similar way to online sparse \(\gamma\)-linear regression, we can also see the robustness for parameters \(\beta _0\) and \(\beta\) in online sparse \(\gamma\)-Poisson regression (12). Moreover, this update algorithm requires at most twice sample size \(2n=2 \times \sum _{t=1}^T m_t\) times of an approximate calculation for the hypergeometric series in Algorithm 5. Therefore, we can achieve a significant reduction in computational complexity.

## 5 Convergence property of online sparse \(\gamma\)-GLM

In this section, we show the global convergence property of the RSPG established by Ghadimi et al. (2016). Moreover, we extend it to the classical first-order necessary condition, i.e., at a local minimum, the directional derivative, if it exists, is non-negative for any direction [see, e.g., Borwein and Lewis (2010)].

*w*is continuously differentiable and \(\alpha\)-strongly convex function satisfying \(\langle a - b , \nabla w(a) - \nabla w(b) \rangle \ge \alpha \Vert a - b \Vert ^2\) for \(a , b \in \varTheta\). We make the following assumptions.

### Assumption 1

*L*-Lipschitz continuous for some \(L>0\), i.e.,

### Assumption 2

### Theorem 1

[Global Convergence Property in Ghadimi et al. (2016)]

*t*, and the probability mass function \(P_{R}\) is chosen such that for any \(t=1,\ldots ,T\),

*R*and past samples \((x_{t,i}, y_{t,i}) \ (t=1,\ldots ,T; \ i=1,\ldots , m_{t} )\) and \(D_{ \varPsi }= \left[ \frac{ \varPsi ( \theta ^{(1)} ) - \varPsi ^* }{L} \right] ^{\frac{1}{2}}\).

### Proof

See Ghadimi et al. (2016), Theorem 2. \(\square\)

In particular, Ghadimi et al. (2016) investigated the constant step size and mini-batch size policy as follows.

### Corollary 1

[Global Convergence Property with constant step size and mini-batch size in Ghadimi et al. (2016)]

*m*is given by

*N*is relatively large, the optimal choice of \(\tilde{D}\) would be \(D_{\varPsi }\) and (18) reduces to

### Proof

See Ghadimi et al. (2016), Corollary 4. \(\square\)

Finally, we extend (18) to the classical first-order necessary condition as follows

### Theorem 2

The Modified Global Convergence Property

*N*, we can expect \(P_{X,R} \approx 0\) with probability of (19) and (20) in RSPG and 2-RSPG, respectively. Then, for any direction \(\delta\) and \(\theta ^{(R)} \in \ relint \left( \varTheta \right)\), we have

The proof is in Appendix 2.

Here we discuss assumptions 1 and 2 in the case of online sparse \(\gamma\)-GLM. For all examples in Sect. 4, \(l((x,y;\theta ))\) is continuously twice differentiable, then \(\nabla l((x,y);\theta )\) is locally Lipschitz continuous over a compact domain. Therefore, assumption 1 holds locally. In particular, assumption 1 holds globally, i.e., (globally) Lipschitz continuous, in online sparse \(\gamma\)-logistic regression. By the expression of stochastic gradients (10), (11) and (12), it is easy to verify that (14) in assumption 2 holds. On the other hand, (15) in assumption 2 is generally hard to verify precisely. As an alternative way, using finite samples, we can check in advance that (15) practically holds.

## 6 Numerical experiments

In this section, we present the numerical results of online sparse \(\gamma\)-linear regression. We compared online sparse \(\gamma\)-linear regression based on the RSPG with online sparse \(\gamma\)-linear regression based on the SGD, which does not guarantee convergence for non-convex case. The RSPG has two variants, which are shown in Algorithms 1 and 2. In this experiment, we adopted the 2-RSPG for the numerical stability. In what follows, we refer to the 2-RSPG as the RSPG. As a comparative method, we implemented the SGD with the same parameter setting described in Sect. 3.1. All results were obtained in R version 3.3.0 with Intel Core i7-4790K machine.

### 6.1 Linear regression models for simulation

Outliers were incorporated into simulations. We set the outlier ratio (\(\epsilon =0.2\)) and the outlier pattern that the outliers were generated around the middle part of the explanatory variable, where the explanatory variables were generated from \(N(0,0.5^2)\) and the error terms were generated from \(N(20,0.5^2)\).

### 6.2 Performance measure

### 6.3 Initial point and tuning parameter

In our method, we need an initial point and some tuning parameters to obtain the estimate. Therefore, we used \(N_{\text {init}}=200\) samples which were used for estimating an initial point and other parameters *L* in (13) and \(\tau ^2\) in (15) to calculate in advance. We suggest the following ways to prepare an initial point. The estimate of other conventional robust and sparse regression methods would give a good initial point. For another choice, the estimate of the RANSAC (random sample consensus) algorithm would also give a good initial point. In this experiment, we added the noise to the estimate of the RANSAC and used it as an initial point.

For estimating *L* and \(\tau ^2\), we followed the way in Sect. 6 of Ghadimi et al. (2016). Moreover, we used the following value of tuning parameters in this experiment. The parameter \(\gamma\) in the \(\gamma\)-divergence was set to 0.1. The parameter \(\lambda\) of \(L_1\) regularization was set to \(10^{-1}, 10^{-2}, 10^{-3}\).

The RSPG needed the number of candidates \(N_{\text {cand}}\) and post-samples \(N_{\text {post}}\) for post-optimization as described in Algorithm 2. Then, we used \(N_{\text {cand}}=5\) and \(N_{\text {post}}= \left\lceil N/10 \right\rceil\).

### 6.4 Result

EmpRisk, ExpRisk, and computation time for \(\lambda =10^{-3}\)

Methods | \(N=10,000\), \(p=1000\) | \(N=30,000\), \(p=1000\) | ||||
---|---|---|---|---|---|---|

EmpRisk | ExpRisk | Time | EmpRisk | ExpRisk | Time | |

RSPG | \(-\) 0.629 | \(-\) 0.628 | 75.2 | \(-\) 0.692 | \(-\)0.691 | 78.3 |

SGD with 1 mini-batch | \(-\) 0.162 | \(-\)0.155 | 95.9 | \(-\) 0.365 | \(-\)0.362 | 148 |

SGD with 10 mini-batch | \(1.1\times 10^{-2}\) | \(1.45\times 10^{-2}\) | 73.2 | \(5.71\times 10^{-2}\) | \(5.6\times 10^{-2}\) | 73.7 |

SGD with 30 mini-batch | \(4.79\times 10^{-2}\) | \(5.02\times 10^{-2}\) | 71.4 | \(5.71\times 10^{-2}\) | \(-5.6\times 10^{-2}\) | 73.7 |

SGD with 50 mini-batch | \(6.03\times 10^{-2}\) | \(6.21\times 10^{-2}\) | 71.1 | − 3.98\(\times 10^{-2}\) | \(-3.88\times 10^{-2}\) | 238 |

Methods | \(N=10,000\), \(p=2000\) | \(N=30,000\), \(p=2000\) | ||||
---|---|---|---|---|---|---|

EmpRisk | ExpRisk | Time | EmpRisk | ExpRisk | Time | |

RSPG | \(-\) 0.646 | \(-\)0.646 | 117 | \(-\) 0.696 | \(-\)0.696 | 125 |

SGD with 1 mini-batch | 0.187 | 0.194 | 145 | \(-3.89\times 10^{-2}\) | \(-3.56\times 10^{-2}\) | 251 |

SGD with 10 mini-batch | 0.428 | 0.431 | 99.2 | 0.357 | 0.359 | 112 |

SGD with 30 mini-batch | 0.479 | 0.481 | 95.7 | 0.442 | 0.443 | 101 |

SGD with 50 mini-batch | 0.496 | 0.499 | 166 | 0.469 | 0.47 | 337 |

EmpRisk, ExpRisk, and computation time for \(\lambda =10^{-2}\)

Methods | \(N=10,000\), \(p=1000\) | \(N=30,000\), \(p=1000\) | ||||
---|---|---|---|---|---|---|

EmpRisk | ExpRisk | Time | EmpRisk | ExpRisk | Time | |

RSPG | \(-\) 0.633 | \(-\)0.632 | 75.1 | \(-\) 0.65 | \(-\)0.649 | 78.4 |

SGD with 1 mini-batch | \(-\) 0.322 | \(-\)0.322 | 96.1 | \(-\) 0.488 | \(-\)0.487 | 148 |

SGD with 10 mini-batch | 1.36 | 1.37 | 73.4 | 0.164 | 0.165 | 79.7 |

SGD with 30 mini-batch | 2.61 | 2.61 | 71.6 | 1.34 | 1.34 | 73.9 |

SGD with 50 mini-batch | 3.08 | 3.08 | 409 | 1.95 | 1.95 | 576 |

Methods | \(N=10,000\), \(p=2000\) | \(N=30,000\), \(p=2000\) | ||||
---|---|---|---|---|---|---|

EmpRisk | ExpRisk | Time | EmpRisk | ExpRisk | Time | |

RSPG | \(-\) 0.647 | \(-\)0.646 | 117 | \(-\) \(-\) 0.66 | \(-\)0.66 | 125 |

SGD with 1 mini-batch | \(-\) 0.131 | \(-\)0.13 | 144 | \(-\) 0.436 | \(-\)0.435 | 250 |

SGD with 10 mini-batch | 3.23 | 3.23 | 99.1 | 0.875 | 0.875 | 112 |

SGD with 30 mini-batch | 5.63 | 5.63 | 95.6 | 3.19 | 3.19 | 100 |

SGD with 50 mini-batch | 6.52 | 6.53 | 503 | 4.38 | 4.38 | 675 |

EmpRisk, ExpRisk, and computation time for \(\lambda =10^{-1}\)

Methods | \(N=10,000\), \(p=1000\) | \(N=30,000\), \(p=1000\) | ||||
---|---|---|---|---|---|---|

EmpRisk | ExpRisk | Time | EmpRisk | ExpRisk | Time | |

RSPG | \(-\) 0.633 | \(-\)0.632 | 74.6 | \(-\) 0.64 | \(-\)0.639 | 78.1 |

SGD with 1 mini-batch | \(-\) 0.411 | \(-\)0.411 | 95.6 | \(-\) 0.483 | \(-\)0.482 | 148 |

SGD with 10 mini-batch | 0.483 | 0.483 | 72.9 | \(-4.56\times 10^{-2}\) | \(-4.5\times 10^{-2}\) | 79.6 |

SGD with 30 mini-batch | 1.53 | 1.53 | 71.1 | 0.563 | 0.563 | 73.7 |

SGD with 50 mini-batch | 2.39 | 2.39 | 70.8 | 0.963 | 0.963 | 238 |

Methods | \(N=10,000\), \(p=2000\) | \(N=30,000\), \(p=2000\) | ||||
---|---|---|---|---|---|---|

EmpRisk | ExpRisk | Time | EmpRisk | ExpRisk | Time | |

RSPG | \(-\) 0.654 | \(-\)0.653 | 116 | − 0.66 | \(-\)0.66 | 130 |

SGD with 1 mini-batch | \(-\) 0.462 | \(-\)0.461 | 144 | \(-\) 0.559 | \(-\)0.558 | 262 |

SGD with 10 mini-batch | 0.671 | 0.672 | 98.9 | \(-9.71\times 10^{-2}\) | \(-9.62\times 10^{-2}\) | 116 |

SGD with 30 mini-batch | 2.43 | 2.44 | 95.4 | 0.697 | 0.697 | 104 |

SGD with 50 mini-batch | 4.02 | 4.02 | 165 | 1.32 | 1.32 | 340 |

## 7 Application to real data

We applied our method ‘online sparse \(\gamma\)-Poisson’ to real data ‘Online News Popularity’ [Fernandes et al. (2015)], which is available at https://archive.ics.uci.edu/ml/datasets/online+news+popularity. We compared our method with sparse Poisson regression which was implemented by R-package ‘glmnet’ with default parameter setting.

*i*is the index of the randomly selected sample and \(y_{i}\) is the response variable of the

*i*-th randomly selected sample and \(t_{i}\) is the offset term of the

*i*-th randomly selected sample.

*L*in (13) and \(\tau ^2\) in (15) to calculate in advance. In this experiment, we used the estimate of the RANSAC. For estimating

*L*, we followed the way in Ghadimi et al. (2016), page 298–299. Moreover, we used the following value of tuning parameters in this experiment. The parameter \(\gamma\) in the \(\gamma\)-divergence was set to 0.1, 0.5, and 1.0. The parameter \(\lambda\) of \(L_1\) regularization was selected by the robust cross-validation proposed by Kawashima and Fujisawa (2017). The robust cross-validation was given by:

*i*-th observation and \(\gamma _0\) is an appropriate tuning parameter. In this experiment, \(\gamma _0\) was set to 1.0. The mini-batch size was set to 100, 200, 500. The RSPG needed the number of candidates and post-samples \(N_{cand}\) and \(N_{post}\) for post-optimization as described in Algorithm 2. We used \(N_{\text {cand}}=5\) and \(N_{\text {post}}= \left\lceil N/10 \right\rceil\). We showed the best result of our method and comparative method in Table 4. All results were obtained in R version 3.3.0 with Intel Core i7-4790K machine. Table 4 shows that our method performed better than sparse Poisson regression.

Root trimmed mean squared prediction error in test samples

Methods | Trimming fraction 100\(\alpha \%\) | |||||
---|---|---|---|---|---|---|

\(5\%\) | \(10\%\) | \(15\%\) | \(20\%\) | \(25\%\) | \(30\%\) | |

Our method | 2419.3 | 1760.2 | 1423.7 | 1215.7 | 1064 | 948.9 |

Sparse Poisson regression | 2457.2 | 2118.1 | 1902.5 | 1722.9 | 1562.5 | 1414.1 |

## 8 Conclusions

We proposed the online robust and sparse GLM based on the \(\gamma\)-divergence. We applied a stochastic optimization approach to reduce the computational complexity and overcome the computational problem on the hypergeometric series in Poisson regression. We adopted the RSPG, which guaranteed the global convergence property for non-convex stochastic optimization problem, as a stochastic optimization approach. We proved that the global convergence property can be extended to the classical first-order necessary condition. In this paper, linear/logistic/Poisson regression problems with \(L_1\) regularization were illustrated in detail. As a result, not only Poisson case but also linear/logistic case can scale well for very large problems by virtue of the stochastic optimization approach. To the best of our knowledge, there is no efficient method for the robust and sparse Poisson regression, but w e have succeeded to propose an efficient estimation procedure with online strategy. The numerical experiments and real data analysis suggested that our methods had good performances in terms of both accuracy and computational cost. However, there are still some problems in Poisson regression problem, e.g., overdispersion (Dean and Lawless 1989), zero inflated Poisson (Lambert 1992). Therefore, it can be useful to extend the Poisson regression to the negative binomial regression and the zero inflated Poisson regression for future work. Moreover, the accelerated RSPG was proposed in (Ghadimi and Lan 2016), and then we can adopt it as a stochastic optimization approach to achieve faster convergence than the RSPG.

## Notes

### Acknowledgements

This work was partially supported by JSPS KAKENHI Grant Number 17K00065.

## References

- Alfons, A., Croux, C., & Gelper, S. (2013). Sparse least trimmed squares regression for analyzing high-dimensional large data sets.
*The Annals of Applied Statistics*,*7*(1), 226–248.MathSciNetCrossRefzbMATHGoogle Scholar - Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
*SIAM Journal on Imaging Sciences*,*2*(1), 183–202. https://doi.org/10.1137/080716542.MathSciNetCrossRefzbMATHGoogle Scholar - Bootkrajang, J., & Kabán, A. (2013). Classification of mislabelled microarrays using robust sparse logistic regression.
*Bioinformatics*,*29*(7), 870–877. https://doi.org/10.1093/bioinformatics/btt078.CrossRefGoogle Scholar - Borwein, J., & Lewis, A. S. (2010).
*Convex analysis and nonlinear optimization: Theory and examples*. Berlin: Springer Science & Business Media.Google Scholar - Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In
*Proceedings of COMPSTAT’2010*, Springer, pp. 177–186.Google Scholar - Bubeck, S. (2015). Convex optimization: Algorithms and complexity.
*Found Trends in Machine Learning*,*8*(3–4), 231–357. https://doi.org/10.1561/2200000050.CrossRefzbMATHGoogle Scholar - Chi, E. C., & Scott, D. W. (2014). Robust parametric classification and variable selection by a minimum distance criterion.
*Journal of Computational and Graphical Statistics*,*23*(1), 111–128. https://doi.org/10.1080/10618600.2012.737296.MathSciNetCrossRefGoogle Scholar - Dean, C., & Lawless, J. F. (1989). Tests for detecting overdispersion in poisson regression models.
*Journal of the American Statistical Association*,*84*(406), 467–472. https://doi.org/10.1080/01621459.1989.10478792. http://www.tandfonline.com/doi/abs/10.1080/01621459.1989.10478792. - Duchi, J., & Singer, Y. (2009). Efficient online and batch learning using forward backward splitting.
*Journal of Machine Learning Research*,*10*, 2899–2934. http://dl.acm.org/citation.cfm?id=1577069.1755882. - Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008). Efficient projections onto the l1-ball for learning in high dimensions. In
*Proceedings of the 25th International Conference on Machine Learning, ICML ’08*, ACM, New York, NY, USA, pp 272–279, https://doi.org/10.1145/1390156.1390191. - Duchi, J. C., Shalev-Shwartz, S., Singer, Y., & Tewari, A. (2010). Composite objective mirror descent. In
*COLT 2010 - The 23rd Conference on Learning Theory*, pp 14–26. http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdf#page=22. - Duchi, J. C., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization.
*Journal of Machine Learning Research*,*12*, 2121–2159. http://dblp.uni-trier.de/db/journals/jmlr/jmlr12.html#DuchiHS11. - Fernandes, K., Vinagre, P., & Cortez, P. (2015). A proactive intelligent decision support system for predicting the popularity of online news. In F. Pereira, P. Machado, E. Costa, & A. Cardoso (Eds.),
*Progress in artificial intelligence*(pp. 535–546). Cham: Springer International Publishing.CrossRefGoogle Scholar - Fujisawa, H., & Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination.
*Journal of Multivariate Analysis*,*99*(9), 2053–2081.MathSciNetCrossRefzbMATHGoogle Scholar - Ghadimi, S., & Lan, G. (2016). Accelerated gradient methods for nonconvex nonlinear and stochastic programming.
*Mathematical Programming*,*156*(1), 59–99. https://doi.org/10.1007/s10107-015-0871-8.MathSciNetCrossRefzbMATHGoogle Scholar - Ghadimi, S., Lan, G., & Zhang, H. (2016). Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.
*Mathematical Programming*,*155*(1–2), 267–305. https://doi.org/10.1007/s10107-014-0846-1.MathSciNetCrossRefzbMATHGoogle Scholar - Hunter, D. R., & Lange, K. (2004). A tutorial on mm algorithms.
*The American Statistician*,*58*(1), 30–37.MathSciNetCrossRefGoogle Scholar - Kanamori, T., & Fujisawa, H. (2015). Robust estimation under heavy contamination using unnormalized models.
*Biometrika*,*102*(3), 559–572.MathSciNetCrossRefzbMATHGoogle Scholar - Kawashima, T., & Fujisawa, H. (2017). Robust and sparse regression via \(\gamma\)-divergence.
*Entropy*,*19*, 608. https://doi.org/10.3390/e19110608.CrossRefGoogle Scholar - Khan, J. A., Van Aelst, S., & Zamar, R. H. (2007). Robust linear model selection based on least angle regression.
*Journal of the American Statistical Association*,*102*(480), 1289–1299.MathSciNetCrossRefzbMATHGoogle Scholar - Kivinen, J., & Warmuth, M. K. (1995). Exponentiated gradient versus gradient descent for linear predictors.
*Information and Computation*,*132*, 1–63.MathSciNetCrossRefzbMATHGoogle Scholar - Lambert, D. (1992). Zero-inflated poisson regression, with an application to defects in manufacturing.
*Technometrics*,*34*(1), 1–14. http://www.jstor.org/stable/1269547. - McCullagh, P., & Nelder, J. (1989).
*Generalized linear models, Second Edition. Chapman and Hall/CRC Monographs on Statistics and Applied Probability Series*, Chapman & Hall. http://books.google.com/books?id=h9kFH2_FfBkC. - Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models.
*Journal of the Royal Statistical Society Series A (General)*,*135*(3), 370–384. http://www.jstor.org/stable/2344614. - Nesterov, Y. (2007). Gradient methods for minimizing composite objective function. CORE Discussion Papers 2007076, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE). https://EconPapers.repec.org/RePEc:cor:louvco:2007076.
- Rockafellar, R. T. (1970).
*Convex analysis. Princeton Mathematical Series*. Princeton: Princeton University Press.Google Scholar - Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
*Journal of the Royal Statistical Society Series B*pp 267–288.Google Scholar - Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimization.
*Journal of Machine Learning Research*,*11*, 2543–2596.MathSciNetzbMATHGoogle Scholar - Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.
*Journal of the Royal Statistical Society: Series B*,*67*(2), 301–320.MathSciNetCrossRefzbMATHGoogle Scholar