1 Introduction

Motivated by applications in areas as diverse as finance, image reconstruction, and curve estimation, many literatures begin to focus on constrained lasso (hereinafter referred to as classo), such as He (2011), James et al. (2013), Zhou and Lange (2013), Hu et al. (2015b), Gaines et al. (2018), James et al. (2020), etc. Classo is defined as:

$$\begin{aligned} \arg \mathop {\min }\limits _\beta \sum \limits _{i = 1}^n {({y{}_i - x_i^{\prime }\beta })^2} + n\sum \limits _{j = 1}^p {{\lambda _j}\left| {{\beta _j}} \right| } \ \text {subject to }{C_1}\beta = {b_1} \ \text {and} \ C{}_2\beta \le {b_2}, \end{aligned}$$
(1)

where \({y_i}\) is the ith element of \(y = {({y_1},{y_2}, \ldots ,{y_n})^\prime }\), \({x_i}\) is the ith row of design matrix \(X = {(x_{1}^{\prime },x_{2}^{\prime }, \ldots ,x_{n}^{\prime })^{\prime }}\). We assume that every column of X has been standardized and the constrained matrixs \(C_{1}\), \(C_{2}\) have full row rank. \(\lambda _{j}\) is the penalty level (tuning parameter) which is always nonnegative.

Classo is a very flexible framework for imposing additional knowledge and structure onto the lasso coefficient estimates. This feature makes it have a very wide range of applications. For instance, in economics when people predictor the car sale, one important predictor is personal income. With the increase of income, the amount of sale of cars also increases. The personal income cannot have negative impacts on car price. Therefore non-negativity constraints need to imposed on the corresponding regression efficients. This nonnegative effects are also applied to stock index tracking, because the impact of each component stock on the stock index can not be negative. Another famous example in which linear constraints need to be utilized is the case of isotonic regression. The problem has a unique property that if \({x_i} \le {x_j}\), then \({x_i}\beta \le {x_j}\beta \). In many fields of genomic data analysis, much biological knowledge or pathway information is available. This kind of information has been accumulated from years of biological and medical research and is a precious resource supplementary to statistical gene data analysis. More applicable situations using the classo can be referred to Gaines et al. (2018) and James et al. (2020). However, from James et al. (2013) we know that the near Oracle performance of classo relies heavily on the Gaussian assumptions and a known variance \(\sigma ^2\). In practice, the Gaussian assumption may not hold and the estimation of the standard deviation \(\sigma \) is not a trivial problem. Moreover, in some cases where heavy-tailed errors or outliers are found in the response, the variance of the errors may be unbounded. In this case, the classo method is no longer applicable.

To deal with these problems, we propose the following \(L_1\) penalized constrained least absolute deviation estimation (hereinafter referred to as pcLAD),

$$\begin{aligned} \arg \mathop {\min }\limits _\beta \sum \limits _{i = 1}^n {\left| {y{}_i - x_i^{\prime }\beta } \right| } + n\sum \limits _{j = 1}^p {{\lambda _j}\left| {{\beta _j}} \right| } \ \text {subject to }{C_1}\beta = {b_1} \ \text {and} \ C{}_2\beta \le {b_2}. \end{aligned}$$
(2)

The least absolute deviation (LAD) type of methods are effective alternative to the least square methods since it doesn’t require the distribution of errors. When heavy-tailed errors or outliers are present, these methods have desired robust properties in linear regression models, see for example Bassett and Koenker (1978), Huber (1981), Portnoy and Koenker (1997).

Recently, the penalized version of the LAD method was studied in several papers and the variable selection and estimation properties were discussed. When the dimension of coefficients p is assumed to be fixed, the consistency of the penalized LAD estimator has been proven, one can see Wang et al. (2007), Lambert-Lacroix and Zwald (2011), Wu and Liu (2009). When p is high dimension, Gao and Huang (2010), Belloni and Chernozhukov (2011), Wang (2013) has showed the properties of penalized LAD method in different assumptions. It’s remarkable that, Wang (2013) proposed a clear and practical rule for setting the penalty parameter and a sharp bound of estimation error. That is,

$$\begin{aligned}&{\lambda _j}= c\sqrt{{{2A(\alpha )\log (p)}}/{n}}, \end{aligned}$$
(3)
$$\begin{aligned}&{\left\| {\hat{\beta }_{pLAD} - \beta } \right\| _2}= O(\sqrt{{{k\log (p)}}/{n}}), \end{aligned}$$
(4)

where \(c>1\) is a constant, \(\alpha \) is a chosen small probability, and \(A(\alpha )\) is a constant such that \(2{p^{ - (A(\alpha ) - 1)}} \le \alpha \). k is the number of nonzero or significant true coefficients.

In this paper, when p is fixed, we use the \(\lambda _j\) suggested by Wang et al. (2007) and develop Oracle property of the equality constrained pcLAD similar to LADlasso. Because of the existence of constraints, the asymptotically normalized variance of pcLAD has an adjustment item compared to that of LADlasso. This adjustment will make pcLAD estimation more effective than LADlasso. When p is large then n , we adopt the form of \(\lambda _j\) in (3), and obtain a \(L_2\) norm of equality constrained pcLAD estimation error bound

$$\begin{aligned} {\left\| {\hat{\beta } - \beta } \right\| _2} = O(\sqrt{{{\max (m,k - m)\log (p)}}/{n}} ), \end{aligned}$$
(5)

where m is the number of equality constraints and should be less than k. The pcLAD estimation bound have a similar form as (4), but our bound clearly demonstrates the potential improvements in accuracy that can be derived from adding constraints. We can also point out this pcLAD will choose the significance coefficient with a high probability close to 1. For inequality constrained pcLAD, we can get same result if there are some constraints at the boundary. It is worth noting that all the above theoretical results do not assume the error distribution, which makes pcLAD have a good fitting effect in the presence of heavy-tailed errors and outliers.

Compared with least square method, LAD method is freer of error distribution. However, it is more difficult to be solved due to its unsmooth loss function. In the computation of fixed dimensional pcLAD model, a typical approach is to modify the computing method of Wang et al. (2007), which is also used in Gao and Huang (2010) and Wang (2013) when p is larger than n. That is, \({Y_{n + j}} = 0\) and \({x_{n + j,j}} = \lambda _{j} \times I(j = i)\) for \(i,j = 1,2, \ldots p\). Here \(I(j = i)\) is the indicator function such that \(I(j = i)=1\) if \(j=i\) and \(I(j = i)=0\) if not. Then our pcLAD estimator can be considered as an ordinary LAD estimator satisfying some liner constraints with p unknown coefficients and \(p+n\) observations. Hence it can be solved efficiently by R package quantreg. More details about this linear programing can be found in Sect. 4.1. However, as Yang et al. (2013), Gu et al. (2017) and Yu and Lin (2017) point out, LP scales well to data with moderate sizes, it still comes short when dealing with high dimensions. This observation motivates us to consider an efficient method to fit the high dimensional constrained LAD regression. Fortunately, alternating direction method of multipliers (ADMM) has been proved to be able to deal with high-dimensional constrained optimization, such as Gaines et al. (2018), Stellato et al. (2018) and so on. Inspired by these work, we propose a nested ADMM to solve pcLAD. In nested ADMM algorithm, the first update is unconstrained LAD regression with combined penalty term, which is solved effectively by ADMM, the second update is a projection onto the affine constrained space and the third update is the renewal of dual variables. Since the first update is a complete ADMM iteration, we call it nested ADMM. Although nested ADMM contains an inner iteration and an outer iteration, every step has an explicit solution. Thus, pcLAD can be calculated fast by nested ADMM. A lot of numerical experiments in Sect. 5 can also confirm this.

Importantly, pcLAD can solve almost all problems that classo can be applied to, such as monotone curve estimation, monotonic order regression estimation, sum to zero or one estimation, and all problems that can be transformed into generalized lasso, fused lasso, nonnegative lasso etc.. Furthermore, when the noise of the above problems does not obey Gaussian distribution, pcLAD is more robust and reliable than classo.

This paper is organized as follows. In Sect. 2, we provide a number of motivating examples which illustrate the wide range of situations where the pcLAD is applicable. Section 3 discusses pcLAD theoretical properties when p is fixed and high dimension. LP and ADMM algorithms are described in detail in Sect. 4. Section 5 will compare the performance of above two algorithms in different dimensions, and present three data simulations which show the pcLAD will do a good job when classo is unreliable. In Sect. 6, some real data examples implie that the pcLAD method has a better performance than classo in applications. We conclude with a discussion about future extensions of this work in Sect. 7. Technical lemmas and proofs of theorems are given in appendix.

2 Motivagting examples

In this section, we will briefly show some applied statistical problems solved by classo that can also be solved by pcLAD.

2.1 Monotone curve fitting

Consider the problem of fitting a smooth function l(x), to a set of observations \(\{(x_1,y_1),\ldots \),\((x_n,y_n)\}\), subject to the constraint that l must be monotone. James et al. (2020) has shown that classo can be applied to monotone curve fitting. We can replace the \(g(\beta ) = \sum \nolimits _{i = 1}^n {{{({y_i} - B{{({x_i})}^\prime }\beta )}^2}} \) as \(g(\beta ) = \sum \nolimits _{i = 1}^n {\left| {{y_i} - B{{({x_i})}^\prime }\beta } \right| } \), then we need to minimize \(g(\beta ) = \sum \nolimits _{i = 1}^n {\left| {{y_i} - B{{({x_i})}^\prime }\beta } \right| }\) subject to \(C\beta \le {\mathrm{0}}\), where the tth row of C is the derivative \(B^{\prime }(v_{t})\) of the basis functions evaluated at \(v_{t}\) for a fine grid of points, \(v_{1},\ldots ,v_{m}\), over the range of x. Enforcing this constraint ensures that the derivative of l is non-positive, so l will be monotone decreasing. Obviously, this model can be addressed using the pcLAD methodology.

2.2 Monotonic order estimation

Isotonic regression is a monotonic order estimate studied by many literatures such as Wu et al. (2001), Tibshirani et al. (2011), Gaines et al. (2018), etc.. The lasso with a monotonic ordering of the coefficients was referred to by Tibshirani and Suo (2016) as the ordered lasso. Gaines et al. (2018) has implied that both of the above estimates can be solved by classo. Next, we will show the monotonic order estimation can be also solved by pcLAD. Consider pcLAD without equality constraints as follows :

$$\begin{aligned} \arg \mathop {{\mathrm{min}}}\limits _\beta \sum \limits _{i = 1}^n {\left| {{y_i} - x_i^{\prime } \beta } \right| } + n\sum \limits _{j = 1}^p {{\lambda _j}\left| {\beta {}_j} \right| } \ \text {subject to } C_{m}\beta \le 0. \end{aligned}$$
(6)

When \(x_{i}^{\prime }\) is the i row of identity matrix, \(\lambda _{j}=0\) for any j, and the constraints matrix is

$$\begin{aligned} C_m=\left( {\begin{array}{ccccc} 1&{}{ - 1}&{}{}&{}{}&{}{}\\ {}&{}1&{}{ - 1}&{}{}&{}{}\\ {}&{}{}&{} \ddots &{} \ddots &{}{}\\ {}&{}{}&{}{}&{}1&{}{ - 1} \end{array}} \right) . \end{aligned}$$
(7)

The formula (6) will become the LAD isotonic regression. When the \(\lambda _{j}>0\) for any j and the constraints matrix is same the matrix as before, it will be a monotonic order LADlasso. Indeed, there are many other options of the constraints for the monotonic order estimation. For example, \(C_m\) can also defined as follows.

$$\begin{aligned} C_m=\left( {\begin{array}{lllll} -1&{}{ 1}&{}{}&{}{}&{}{}\\ {}&{}-1&{}{ 1}&{}{}&{}{}\\ {}&{}{}&{} \ddots &{} \ddots &{}{}\\ {}&{}{}&{}{}&{}-1&{}{ 1} \end{array}} \right) . \end{aligned}$$

This \(C_m\) makes the estimation coefficient monotone decreasing. Other options can also be limited to some certain coefficients, such as

$$\begin{aligned} {\beta _1} \le {\beta _3},{\beta _2} \le {\beta _4},{\beta _3} \le {\beta _6}. \end{aligned}$$

Furthermore, we can also obtain the LAD version of order lasso (Tibshirani and Suo 2016) by rewriting \(\beta \) into \(\beta ^{+}-\beta ^{-}\) and adding monotonoic order to \(\beta ^{+}\) and \(\beta ^{-}\).

2.3 Generalized LADlasso and LAD fused lasso

Gaines et al. (2018) and James et al. (2020) have proved that generalized lasso (Tibshirani and Taylor 2011) can be transformed into classo. We also consider the following the generalized LADlasso problem:

$$\begin{aligned} \arg \mathop {\min }\limits _\beta \left\| {y - X\beta } \right\| _1 + n\lambda \left\| {D\beta } \right\| _1, \end{aligned}$$
(8)

where \(D \in {R^{r \times p}}\), \(\text {rank}(D) = r\).

The following lemma will show the generalized LADlasso can also be transformed into pcLAD.

Lemma 1

If \(r \le p\), (8) can be converted to the classical LADlasso problem. If \(r>p\) and \(rank(D)=p\), then there exist matrix C, F, and \(\tilde{X}\) such that, for all values of \(\lambda \), the solution to (8) is equal to \(\beta = F\theta \), where \(\theta \) is given by:

$$\begin{aligned} \arg \mathop {\min }\limits _\theta \left\| {y - \tilde{X}\theta } \right\| _1 + n\lambda \left\| \theta \right\| _1 \ \text {subject to } C\beta = 0. \end{aligned}$$
(9)

The proof of Lemma 1 is provided in Appendix A. Hence, any problem that falls into the generalized LADlasso can be solved by pcLAD.

LAD fused lasso is the LAD version of fused lasso (Tibshirani et al. 2005), it is defined as the solution to

$$\begin{aligned} \arg \mathop {\min }\limits _\beta \left\| {y - X\beta } \right\| _1 + n\sum \limits _{j = 1}^p {{\lambda _j}\left| {{\beta _j}} \right| } + n\sum \limits _{j = 2}^p {{\gamma _j}} \left| {{\beta _j} - {\beta _{j - 1}}} \right| . \end{aligned}$$
(10)

It’s very easy to know LAD fused lasso is a special case of the generalized LADlasso (8) with the equality penalty matrix D as

$$\begin{aligned}\left( \begin{array}{l} -{C_m}\\ {I_p} \end{array} \right) \in {R^{(2p - 1) \times p}}, \end{aligned}$$

where \(I_p\) is the \(p \times p\) identity matrix.

The fused LADlasso encourages blocks of adjacent estimated coefficients to all have the same value. This type of structure often makes sense in situations where there is a natural ordering in the coefficients. Similar to James et al. (2013), if the data have a two-dimensional ordering, such as for an image reconstruction, this idea can be extended to the 2d fused LADlasso

$$\begin{aligned}&\arg \mathop {\min }\limits _\beta \left\| {y - X\beta } \right\| _1 + n\left( \sum \limits _{j,j'} {{\lambda _{j,j'}}\left| {{\beta _{j,j'}}} \right| } + \sum \limits _{j \ne j'} {{\gamma _{j,j'}}\left| {{\beta _{j,j'}} - {\beta _{j,j' - 1}}} \right| } \right. \\&\qquad \quad +\left. \sum \limits _{j \ne j'} {{\eta _{j,j'}}\left| {{\beta _{j,j'}} - {\beta _{j - 1,j'}}} \right| } \right) \end{aligned}$$

2.4 Nonnegative sparse estimation

The most common non negative sparse estimation is nonnegative lasso. It appeared in a lot of literatures. First mentioned in the seminal work of Efron et al. (2004), the positive lasso requires the lasso coefficients to be nonnegative. This variant of the lasso has seen applications in areas such as vaccine design (Hu et al. 2015a), nuclear material detection (Kump et al. 2012), document classification (El-Arini et al. 2013), and portfolio management (Wu et al. 2014). Many other nonnegative sparse estimators have been proposed such as Yang and Wu (2016), Wu and Yang (2014), Mandal and Ma (2016), Li et al. (2019), Xie and Yang (2019), Li and Yang (2019), etc.. However, the LAD version of non negative lasso is not appeared in the literature. In the discussion of Wang (2013), we know that LAD method can process a wide range of non Gaussian observations, so it is necessary to propose some non negative sparse LAD methods.

The first non negative sparse LAD method proposed in this section is non negative LADlasso:

$$\begin{aligned} \arg \mathop {\min }\limits _{\beta \ge 0} \left\| {y - X\beta } \right\| _1 + n\sum \limits _{j = 1}^p {{\lambda _j}\left| {{\beta _j}} \right| }. \end{aligned}$$
(11)

Obviously, non negative LADlasso is pcLAD (2) with \(C_{1}=-I_{p}\) and \(b_{1}=0_{p}\).

The other is non negative fused LADlasso:

$$\begin{aligned} \arg \mathop {\min }\limits _{\beta \ge 0} \left\| {y - X\beta } \right\| _1 + n\sum \limits _{j = 1}^p {{\lambda _j}\left| {{\beta _j}} \right| } + n\sum \limits _{j = 2}^p {{\gamma _j}} \left| {{\beta _j} - {\beta _{j - 1}}} \right| . \end{aligned}$$
(12)

As discussed in Sect. 2.3, non negative fused LADlasso can be transformed into non negative generalized LADlasso, which is a special case of pcLAD.

Although four examples are listed, there are still many statistical problems that can be solved by pcLAD, such as sum to zero regression , sum to one regression, relaxed lasso, sign-constrained least square regression, etc.. One can find more details about these methods by referring to Shi et al. (2016), Meinshausen (2007), Meinshausen (2013).

3 Statistical properties

In this section, we firstly discuss the statistical properties of pcLAD with equality constraints when the dimension of estimation p is fixed and much larger than sample size n. As discussed in Sect. 1, we use adaptive \(L_1\) penalty (Wang et al. 2007; Zou 2006), which is more general penalty than \(L_1\) penalty (Tibshirani 1996). When the p is high dimensional setting, the calculation of adaptive \(L_1\) penalty parameters are complicated and time-consuming, therefore we adopt ordinary \(L_1\) penalty parameters suggested by Wang (2013). We assume that \(y \in {R^n}\) is generated from:

$$\begin{aligned} {y_i} = x_i^{\prime }\beta + {\varepsilon _i}, i = 1,2, \ldots ,n. \end{aligned}$$
(13)

Where \({x_i} = {({x_{i1}},{x_{i2}}, \ldots ,{x_{ip}})^\prime }\), \(\beta \in {R^p}\), and \(\varepsilon = {({\varepsilon _1},{\varepsilon _2}, \ldots ,{\varepsilon _n})^\prime }\) are i.i.d. median-zero random variables. Let \(\hat{\beta } \) denote a solution of pcLAD defined by:

$$\begin{aligned} \hat{\beta } = \arg \mathop {\min }\limits _\beta Q(\beta ) \ \text {subject to } C\beta = b, \end{aligned}$$
(14)

where \(Q(\beta ) = \sum \limits _{i = 1}^n {\left| {{y_i} - x_i^{\prime }\beta } \right| } + \sum \limits _{j = 1}^p {{\lambda _j}\left| {{\beta _j}} \right| }\). If the solution (14) is not unique, we can take \(\hat{\beta }\) to be any optimal solution, our statistical properties hold for all such solutions.

3.1 p is fixed

For convenience, we decompose the true regression coefficient as \(\beta _{0} = {(\beta _{0A}^{\prime },\beta _{0B}^{\prime })^\prime }\), where \({\beta _{0A}} = ({\beta _{01}},{\beta _{02}}, \ldots ,{\beta _{0k}})\) are k ture significant coefficients and \({\beta _{0B}} = ({\beta _{0(k + 1)}},{\beta _{0(k + 2)}} , \ldots ,{\beta _{0p}})\) are \(p-k\) ture insignificant coefficients. Moreover, assume that \({\beta _{0j}} \ne 0\) for \(1 \le j \le k\) and \({\beta _{0j}} = 0\) for \(k < j \le p\). Its corresponding pcLAD estimator is denoted \(\hat{\beta } = {(\hat{\beta } _A^{\prime },\hat{\beta } _B^{\prime })^{\prime }}\). We also decompose the covariate \({x_i} = (x_{iA}^{\prime },x_{iB}^{\prime })\) with \({x_{iA}} = {({x_{i1}},{x_{i2}}, \ldots ,{x_{ik}})^\prime }\) and \({x_{iB}} = {({x_{i(k + 1)}},{x_{i(k + 2)}}, \ldots ,{x_{ip}})^\prime }\). In addition, constraint matrix C can be rewritten as \(C=(C_A,C_B)\). To study the theoretical properties of pcLAD in fixed dimension, the following technical assumptions are necessarily needed:

Assumption 1

The error \(\varepsilon _i\) has continuous and positive density at the origin, that is \(f(0)>0\).

Assumption 2

The design \(x_{i}\), \(i=1,2,\dots .,n\), satisfies the limit of \({\sum \limits _{i = 1}^n {{x_i}x_i^\prime } }/{n} \rightarrow \varSigma \) as \(n \rightarrow \infty \). Denote the top-left \(k-by-k\) submatrix of \(\varSigma \) by \(\varSigma _{11}\), and the right-bottom \((p-k)-by-(p-k)\) submatrix of \(\varSigma \) by \(\varSigma _{22}\).

Assumption 3

There is a nonsingular submatrix in \({C_A}\), which is denoted as \({C_{A_1}}\). The size of index set \(A_1\) should be equal to be the row rank of C, that is m.

Note that Assumptions 1 and 2 are both very typical technical assumptions used extensively in the sparse estimation in fixed dimension such as Fan and Li (2001), Wang et al. (2007), Wu and Liu (2009). Assumptions 3 is required in classo (James et al. 2013).

Furthermore, define \({a_n} = \max \{{\lambda _j},1 \le j \le k\}\) and \({b_n} = \min \{{\lambda _j},k < j \le p\}\), where \(\lambda _j\) is a function of n. Based on the foregoing notation, the consistency of pcLAD estimator can be first established.

Lemma 2

(Consistency) Consider a sample \(\{({{x}_{i}},{{y}_{i}}),i=1,2,\ldots ,n\}\) from model (13) satisfying Assumption 1 and  2 with i.i.d. \(\varepsilon _{i}^{\prime }\) s. If \(\sqrt{n}{{a}_{n}}\rightarrow 0,\) as \(n\rightarrow \infty \), there exists a pcLAD estimation \(\hat{\beta }\) such that \({{\left\| \hat{\beta }-{{\beta }_{0}} \right\| }_{2}}={{O}_{P}}({{n}^{-\frac{1}{2}}})\).

\(\sqrt{n}\)-consistency is a common property in constrained LAD estimation such as Wang (1995), Geyer (1994), Silvapulle and Sen (2005), and Parker (2019). Lemma 2 show that linear constrained LADlasso also enjoys this nice property. Under some further conditions, the sparsity property of the pcLAD estimator can be obtained as in Lemma 3.

Lemma 3

(Sparsity) Consider a sample \(\{({{x}_{i}},{{y}_{i}}),i=1,2,\ldots ,n\}\) from model (13) satisfying Assumption 1 and 2 with i.i.d. \(\varepsilon _{i}^{\prime }\) s. If \(\sqrt{n}{{b}_{n}}\rightarrow \infty ,\) as \(n\rightarrow \infty ,\) for any given \({{\beta }}\), satisfying \({{\left\| {{\beta }_{A}}-{{\beta }_{A0}} \right\| }_{2}}={{O}_{P}}({{n}^{-\frac{1}{2}}}), C({{\beta }}-{{\beta }_{0}})=0.\) Then, with probability trending to 1, for any constant \(R>0\), \(Q({{(\beta _{A}^{\prime },{{0}^{\prime }})}^{\prime }})=\underset{\left| {{\beta }_{2}} \right| \le R{{n}^{-1/2}}}{\mathop {\min }}\,Q({{(\beta _{A}^{\prime },\beta _{B}^{\prime })}^{\prime }})\).

The Lemmas 2 and 3 are common results of many sparse estimations in fixed dimension. There is no doubt that they are very nice properties, but not reflect the influence of constraints. Our next theorem will show influence of constraints and illustrate the pcLAD estimator also enjoy the popular asymptotic Oracle property.

Theorem 1

(Oracle) For a sample \(\{({{x}_{i}},{{y}_{i}}),i=1,2,\ldots ,n\}\) from model (13) satisfying Assumption 1, 2 and 3 with i.i.d. \(\varepsilon _{i}^{\prime }\) s. if \(\sqrt{n}{{a}_{n}}\rightarrow 0,\sqrt{n}{{b}_{n}}\rightarrow \infty ,\) as \(n\rightarrow \infty ,\) then with probability trending to one, the consistent pcLAD estimation \(\hat{\beta }\text {=}{{(\hat{\beta }_{A}^{\prime },\hat{\beta }_{B}^{\prime })}^{\prime }}\) in Lemma 2 must be satisfy:

  • \(\langle a\rangle \) Sparsity:\({{\hat{\beta }}_{B}}\text {=}0\).

  • \(\langle b\rangle \) Asymptotic normality: \(\sqrt{n}({{\hat{\beta }}_{A}}-{{\beta }_{A0}})\xrightarrow {L}N(0,\frac{\sum _{11}^{-1}}{4{{f}^{2}}(0)}(I-V)^{\prime }(I-V))\), where \({{\sum }_{11}}\) is defined in Assumption 2, \(V=C_{A_1}^{\prime }{{({{C_{A_1}}}\sum _{11}^{-1}C_{A_1}^{\prime })}^{-1}}{{C_{A_1}}}\sum _{11}^{-1}\) and \(\xrightarrow {L}\) represents convergence in distribution.

The proof details of Lemma 2, 3 and Theorem 1 can be found in Appendix B.

Remark 1

Theorem 1 is a novel conclusion which reflects the influence of constraints on the asymptotic distribution of significant coefficients estimation. It is easy to see the variance of pcLAD estimation error is numerically smaller than LADlasso. This result is not surprising because the prior information to the model has been used.

3.2 p is high dimensional

Due to using the ordinary \(L_1\) penalty, the high dimensional pcLAD can be rewritten as:

$$\begin{aligned} \hat{\beta } = \arg \mathop {\min }\limits _\beta \left\| {y - X\beta } \right\| _1 + n\lambda \left\| \beta \right\| _1 \ \text {subject to } C\beta = b, \end{aligned}$$
(15)

where C has full row rank and \(\text {rank}(C)=m\).

We adopt the Assumption 1, 3 and the notation in Sect. 3.1. We can also decompose the constrained matrix \(C_A\) as \(\left( {{C_{A_1}},{C_{A_2}}} \right) \). In practical application, the equality constraint parameters before the insignificant coefficients do not work and are usually set to 0 that is \(C_B=\mathbf{0} \). This setting is not only in line with the actual situation, but also necessary. If the corresponding penalty submatrix of the insignificant coefficients \(C_B \ne \mathbf{0} \), then it must be found that an insignificant estimated coefficient can be linearly expressed by the significant estimated coefficients, which will cause the estimation value of this insignificant coefficient to become non-zero. In this way, the accuracy of variable selection of pcLAD model will be greatly reduced. A naive method to avoid this is to increase the penalty parameter so that the insignificant coefficients (usually the estimated value is not too large) are completely shrunk to 0, but this will also lead to the bias of the significant estimated coefficients. Thus, if one wants to apply the \(\lambda \) suggested by Wang (2013) to pcLAD, \(C_B=\mathbf{0} \) is indispensable.

It is worth noting that there is a special case that does not meet the above setting, that is relaxed lasso constraint (Meinshausen 2007). This constraint is defined as follows,

$$\begin{aligned} {\beta _M} = \mathbf{0} ,\,\, \text {where} \,M \in B. \end{aligned}$$
(16)

Because in this setting, \({C_A} = \mathbf{0} \), then insignificant estimated coefficients are not linearly expressed by the significant estimated coefficients. In order to include this special case in pcLAD, when the constraint of \({\beta _M} = \mathbf{0} \) exists, the constraint of \(C_B = \mathbf{0} \) can be transformed into the constraint of \(C_{B/M} = \mathbf{0} \).

Decompose \({\beta ^{'}} = ( \beta _{{A_1}}^{'}, \beta _{{A_2}}^{'}, \beta _{{M}}^{'}, \beta _{{B/M}}^{'})\), and consider \({\beta _M} = \mathbf{0} \) and \(C_{B/M} = \mathbf{0} \), then we get the equation:

$$\begin{aligned} {C_{A_1}}{\beta _{A_1}} + {C_{A_2}}{\beta _{A_2}} = b. \end{aligned}$$
(17)

In order to do prove the near oracle property of pcLAD estimator \(\hat{\beta }\), we need to present a lemma related to the estimation error \(h = {\beta _0} - \hat{\beta } \). In what follows, let the vector \(h_{A}\) be defined as: if the index i is in the index set A, the i th element of \(h_{A}\) is the same as that of h; otherwise, the i th element of \(h_{A}\) is 0. By (17) and Assumption 3, we will get \({h_{A_1}} = - {\left( {{C_{A_1}}} \right) ^{ - 1}}({C_{A_2}}{h_{A_2}})\), where \(C_{A_1}\) is \(m \times m\) matrix and \(C_{A_2}\) is \(m \times (k-m)\) matrix. In high dimensional statistics, as described by Bhlmann and van de Geer (2011), it is generally required to: \(k\log (p) \ll n\). As Gu and Zou (2020) points out, in the lasso framework, the dimension of LAD estimation can reach the order \({e^{{n^\pi }}}\), where \(0<\pi <1\). So, we need to assume \(k < \infty \), that is \(k = O(1)\) which will always satisfy \(k\log (p) \ll n\). By synthesizing \(m < k\), we can obtain that there is always a constant \(\varPhi >0 \), such that \({\left\| {{h_{A_1}}} \right\| _1} \le \varPhi {\left\| {{h_{A_2}}} \right\| _1}\) and \({\left\| {{h_{A_1}}} \right\| _2} \le \varPhi {\left\| {{h_{A_2}}} \right\| _2}\). Then we can get a lemma about the cone constraint of h.

Lemma 4

Suppose \({\lambda }=c\sqrt{{{2A(\alpha )\log (p)}}/{n}}\), let \({\varDelta _{\bar{c}}} = \left\{ {\delta \in {R^p}:{{\left\| {{\delta _{A_2}}} \right\| }_1} \ge \frac{{\bar{c}}}{{1 + \varPhi }}{{\left\| {{\delta _B}} \right\| }_1}} \right\} \). Then \(h \in {\varDelta _{\bar{c}}}\), where \(\bar{c} = \frac{{(c - 1)}}{{(c + 1)}}\).

The proof details of Lemma 4 can be found in the Appendix A. This cone constraint is extremely important for high dimensional estimation error bounds, one can see it in the classical lasso, square root lasso, LADlasso (quantile lasso) and constrained lasso, for example, Bickel et al. (2009), Wang (2013), Belloni et al. (2011), James et al. (2013), Gu and Zou (2020).

Now we introduce some restricted eigenvalue concepts on the design matrix X, based on \(L_2\) norm to prepare for the analysis of near Oracle property of the pcLAD estimator. Let \(\lambda _{k}^{u}\) be the smallest number such that for any k sparse vector d:

$$\begin{aligned} \left\| {Xd} \right\| _2^2 \le n\lambda _k^u\left\| d \right\| _2^2; \end{aligned}$$
(18)

also let \(\lambda _{k}^{u}\) be the largest number such that for any k sparse vector d:

$$\begin{aligned} \left\| {Xd} \right\| _2^2 \ge n\lambda _k^l\left\| d \right\| _2^2. \end{aligned}$$
(19)

Let \({\theta _{{k_1},{k_2}}}\) be the smallest number such that for any \(k_1\) and \(k_2\) sparse vector \(c_1\) and \(c_2\) with disjoint support,

$$\begin{aligned} \left| {\left( {X{c_1},X{c_2}} \right) } \right| \le n{\theta _{k_1}^{k_2}}{\left\| {{c_1}} \right\| _2}{\left\| {{c_2}} \right\| _2}. \end{aligned}$$
(20)

The above concepts \(\lambda _{k}^{u}\) and \({\theta _{k_1}^{k_2}}\) are related to the sparse recovery conditions in the compressed sensing (CS). See Wang (2013) for more details. For the pcLAD model, we define a concept on restricted eigenvalues of design matrix X based on \(L_1\) norm as:

$$\begin{aligned} k_k^l(\bar{c}) = \mathop {\min }\limits _{h \in {\varDelta _{\bar{c}}}} \frac{{{{\left\| {Xh} \right\| }_1}}}{{n{{\left\| {{h_A}} \right\| }_2}}}. \end{aligned}$$
(21)

To simplify the notations, we will simply write \(k_k^l(\bar{c})\) as \(k_k^l\).

In order to formulate our main result, we also need the following condition:

$$\begin{aligned} \frac{3}{{16}}\sqrt{n} k_k^l > \lambda \left( {1 + \varPhi } \right) \sqrt{\frac{{{k-m}}}{n}} + {c_1}\sqrt{2\max (m,{k-m})\log (p)} \left( {\frac{5}{4} + \frac{{(\bar{c} + 1)(\varPhi + 1)}}{{\bar{c}}}} \right) , \end{aligned}$$
(22)

for some constant \(c_1\) such that \({c_1} > 1 + 2\sqrt{\lambda _k^u} \). This condition is obviously true when \(n \rightarrow \infty \). Then we have following theorem.

Theorem 2

Consider pcLAD model, assume \(\varepsilon _1,\varepsilon _2, \dots , \varepsilon _n\) are \(\text {i.i.d.}\) random variables satisfying Assumption 1 and 3, suppose \(\lambda _k^l>\theta _{k,k}(\frac{1+\varPhi }{\bar{c}})\) and (22) holds, then the pcLAD estimator \(\hat{\beta }\) satisfies with probability at least \(1-2p^{-4\min (k-m,m)(c_2^2-1)+1}\)

$$\begin{aligned} {\left\| {\hat{\beta } - \beta _0 } \right\| _2}&\le \sqrt{\frac{{2\max (m,{k-m})\log (p)}}{n}} \frac{{16\left\{ {\sqrt{2} c(1 + \varPhi ) + {c_1}[\frac{5}{4} + \frac{{(\bar{c} + 1)(\varPhi + 1)}}{{\bar{c}}}]} \right\} }}{{a \eta _k^l}}\\&\quad \sqrt{1 + \frac{1}{{\bar{c}}} + \varPhi }, \end{aligned}$$

where \(c_1=1+2c_2\sqrt{\lambda _k^u}\) and \(c_2>1\) is a constant, \(\eta _k^l = \frac{{{{[\lambda _k^l - \theta _k^k(\frac{1+\varPhi }{{\bar{c}}})]}^2}}}{{\lambda _k^u(1 + \varPhi )}},\lambda = 2c\sqrt{n\log (p)}\).

From the theorem we can easily see that asymptotically, with high probability,

$$\begin{aligned} {\left\| {\hat{\beta } - \beta _0} \right\| _2} = O\left( \sqrt{{{\max (k - m,m)\log (p)}}/{n}} \right) . \end{aligned}$$
(23)

By \(k\log (p) \ll n\), we know the pcLAD estimator has near Oracle performance. Moreover, The bounds of pcLAD estimation error decay faster with the increase of n than (4), that is, equality constraints can improve the accuracy of estimation. This is also verified in section of simulation study in Sect. 5.

Unlike the \(L_2\) bound of LADlasso in (4), which depends on \(\sqrt{k}\), the \(L_2\) bound of pcLAD depend on \(\sqrt{m}\) and \(\sqrt{k-m}\). The rate \(\sqrt{{(k-m)\log (p)}/{n}}\) follows from the fact that (17) implies

$$\begin{aligned} {\beta _{A_1}} = {({C_{A_1}})^{ - 1}}({b} - {C_{A_2}}{\beta _{A_2}}). \end{aligned}$$
(24)

Hence, the m coordinates of \(\beta _{A_1}\) are completely determined by the remaining \((p-m)\) coordinates. The problem of estimating k significant coefficients can be regarded as the problem of estimating \((k-m)\) significant coefficients. Note that the bound in Theorem 2 also depend on \(\sqrt{m}\). In fact, this term reflects the error due to model selection. To see this, when \(m=k\), it follows from (24) that we can exactly recover \(\beta _{A}\), but only if we know the locations of the non-zero entries. There are \(\left( \begin{array}{l} p\\ m \end{array} \right) \sim {p^m}\) possible locations of the non-zero entries. The number of possible locations of nonzero coefficients is a number related to m, so the bounds of estimation error are related to m.

Next, we will explain another reason that we need \(k = O(1)\), there are \(p^{k}\) hypotheses for the determination of nonzero entries, and information theoretic arguments show that even if we have \(m=k\) constraints, we still at least need n to be of order \(\log \left( \begin{array}{l} p\\ k \end{array} \right) = k\log (p)\) to identify the correct hypothesis. In fact, the number of equality constraints m equals to the number of significance coefficients k is not satisfied in most cases. So we relax the requirement of \(k\log (p) \ll n\), only require \(k = O(1)\) to make sure correct hypothesis can be identified.

A simple consequence of Theorem 2 is that the pcLAD estimator will select most of the significant variables with high probability. We have the following theorem.

Theorem 3

Suppose \(\hat{T}=\text {supp}(\hat{\beta })\) be the estimated support of coefficients, in other words, \(\hat{T}\) is the set of significant coefficient estimates. Then under the same conditions as in Theorem 2, with probability at least \(1-2p^{-4\min (k-m,m)(c_2^2-1)+1}\)

$$\begin{aligned} \left\{ {i:\left| {{\beta _i}} \right| \ge \sqrt{\frac{{2\max (m,{k-m})\log (p)}}{n}} \frac{{16\left\{ {\sqrt{2} c(1 + \varPhi ) + {c_1}[\frac{5}{4} + \frac{{(\bar{c} + 1)(\varPhi + 1)}}{{\bar{c}}}]} \right\} }}{{\alpha \eta _k^l}}} \right\} \in \hat{T} \end{aligned}$$

where \(c_1=1+2c_2\sqrt{\lambda _k^u}\) and \(c_2>1\) is a constant, \(\eta _k^l = \frac{{{{[\lambda _k^l - \theta _k^k(\frac{1+\varPhi }{{\bar{c}}})]}^2}}}{{\lambda _k^u(1 + \varPhi )}},\lambda = 2c\sqrt{n\log (p)}\).

This theorem shows that the pcLAD method will select a model that contains all the variables with large coefficients. If in model (15), all the nonzero coefficients are large enough in terms of absolute value, then the pcLAD method can select all of them into the model.

3.3 Discussion on inequality constraints

In the previous section we have concentrated on results for the equality constrained pcLAD. Next, we will briefly discuss the theoretical results of inequality constrained pcLAD. When p is fixed dimension, results in the inequality setting are same to equality constraints’. It is easy to see that if \(\beta _0\) lies inside the region, that is \(C\beta _0 < b\), then the pcLAD and LADlasso should give same result because the constraints will play little role in the regression. However, if \(\beta _0\) is on the constraint boundary, then the pcLAD should offer same improvements as equality LADlasso. This method of analyzing inequality constrained regression in fixed dimension has been used in many literatures, such as Liew (1976), Wang (1995), Wang (1996) and so on. Specifically, take nonnegative constraint \(\beta \ge \mathbf{0} \) as an example. Under the assumptions and settings of this paper, the true significant coefficients and insignificant coefficients satisfy \(\beta _{0A}>\mathbf{0} ,\beta _{0B}=\mathbf{0} \). In the theoretical analysis of asymptotic properties, we only need to consider equality constraints \(C_{B}=\mathbf{I} _{p-k}\). Thus the constrained matrix C is composed of 0 and \(C_B\) and the constrained vector b is 0, this constraint does not affect the proof of Lemma 1 and 2. In this paper, the result of Lemma 2 is the same as \(\beta _{B}=0\). So, \(C_{B}=\mathbf{I} _{p-k}\) does not change the result of Theorem 1 and nonnegative constraint LAD also enjoys the Oracle property as unconstrained LADlasso.

Other examples that are highlighted in this paper are monotonic order LAD estimation, fused and general LADlasso. For monotonic order constraint in this paper, the insignificant coefficients \(\beta _{B}=0\) and the significant coefficients constrained matrix \(C_{A}\) is defined as (7). If there is no equality constraint in \(C_{A} \le b_{A}\), monotonic order constraint LADlasso will have the same result as unconstrained LADlasso. If there are some equality constraints, monotonic order constraint LADlasso will enjoy the asymptotic normality with equality constraints as (b) of Theorem 1. Nevertheless, for fused and general LADlasso, we can’t assert this conclusion under the assumption of this paper. The reason is that in the proof of Lemma 1, \(\tilde{X}\) will have different augmented forms under different dimensional settings. In fixed and high dimensions, \(\tilde{X}\) may not satisfy the assumptions in this paper. How to make fused and general LADlasso also have Oracle property is a further research needing more technical assumptions and proof methods.

When p is high dimension, the \(L_2\) bound of pcLAD in the inequality setting are more complicated. To our knowledge, there are two methods to analyze it. One is similar to the fixed dimension method, which has been used in He (2011). Following his ideas, for the inequality constraints \({C_1}\beta \le {b_1}\), we partition \(C_1\) and \(b_1\) into block matrices as

$$\begin{aligned} {C_1}\beta _0 = \left( {\begin{array}{cc} {{C_{11}}}&{}{{C_{12}}}\\ {{C_{13}}}&{}{{C_{14}}} \end{array}} \right) \left( {\begin{array}{c} {{\beta _{0A}}}\\ {{\beta _{0B}}} \end{array}} \right) \,\, \, \text {and} \,\, \, {b_1} = \left( {\begin{array}{c} {{b_{11}}}\\ {{b_{12}}} \end{array}} \right) , \end{aligned}$$

such that \({C_{11}}{\beta _{0A}} = {b_{11}}\) and \({C_{13}}{\beta _{0A}} < {b_{12}}\). Note that \(\beta _0\) satisfies the constraints \(C_{1}\beta \le b_{1}\) at the boundary (i.e., \({C_{11}}{\beta _{0A}} = {b_{11}}\)) while satisfying the constraints \(C_{1}\beta \le b_{1}\) in the interior ( \({C_{13}}{\beta _{0A}} < {b_{12}}\)). Moreover, if the equality constraint \({C_2}\beta = {b_2}\) exists, partition it into block matrices as

$$\begin{aligned} {C_2}\beta = \left( {\begin{array}{cc} {{C_{21}}}&{{C_{22}}} \end{array}} \right) \left( {\begin{array}{c} {{\beta _A}}\\ {{\beta _B}} \end{array}} \right) \,\,\, \text {and} \,\,\, {b_2} = {b_2}. \end{aligned}$$

Then, we can reconstruct equality constraints

$$\begin{aligned} G = \left( {\begin{array}{c} {{C_{11}}}\\ {{C_{21}}} \end{array}} \right) \,\,\, \text {and} \,\,\,g = \left( {\begin{array}{c} {{b_{11}}}\\ {{b_2}} \end{array}} \right) . \end{aligned}$$

Thus, \(G\beta = g\) is new equality constraints which are brought into theoretical analysis. Here, we still take nonnegative constraint \(\beta \ge \mathbf{0} \) as an example. The constraints of coefficients in high dimension is the same as in the fixed dimension. And the nonnegative constraints will be relax lasso like (16). Due to the \(\beta _{M}=0\) is not affect the proof process, the \(L_2\) bound is \(\sqrt{{(k)\log (p)}/{n}}\), which is the same as the result of Wang (2013). For monotonic order LAD estimation in high dimension, the \(L_2\) bound is \(\sqrt{{\max (m,k-m)\log (p)}/{n}}\), where m is the \(\sum \limits _{i \ne j} {I({\beta _{0i}} = {\beta _{0j}})}\). The case of fused and general LADlasso in high dimension has been discussed briefly before, and we omit it here.

Another analysis method of inequality constrained regression is to add relaxed variables. As discussed by James et al. (2013), we can change inequality constraints into equality constraints by adding relaxed variables. However, when the added relaxed variable is close to 0 but not exactly 0, the assumption of coefficient sparsity may not be tenable. Although Negahban et al. (2010) have discussed lasso in this case, and they proved that another sparse vector can be used to approximate the not exactly sparse estimation, but extending it to pcLAD is far beyond the scope of our paper.

4 The implement of pcLAD

How to compute L1-penalized LAD regression is nontrivial task due to the nonsmoothness of LAD loss function and L1 penalty term. Fortunately, this nontrivial task has obtained a lot of attentions and many methods have been presented to solve penalized LAD regression such as the linear program using the interior point method (Koenker and Ng 2005), a solution path algorithm (Li and Zhu 2008), a greedy coordinate descent algorithm (Wu and Lange 2008; Peng and Wang 2015), pADMM and scdADMM (Gu et al. 2017), QPADMM (Yu and Lin 2017), QPADMM-slack (Fan et al. 2020) and so on.

However, linear constrained will makes all of the above methods can not be directly applied to pcLAD and even several methods fail completely since the optimal solution of pcLAD is limited to an affine set. In particular, a greedy coordinate descent algorithm for LADlasso can’t work and all the ADMM algorithms mentioned above cannot be used directly. Recently, Inspired by Li and Zhu (2008) and Liu et al. (2020) proposed a solution path algorithm for solving generalized L1 penalized quantile regression with linear constraints. This algorithm utilizes the piecewise linear of the L1 penalized quantile regression solution path to get an entire solution path by solving a series of linear programming problems. It is worth noting that compared with the approach as in Li and Zhu (2008), it doesn’t require that X has full column rank and allows more than one events occur at a transition point. These improvements make this algorithm possible to be used in high dimensional setting. If one want to get the entire solution path, algorithm proposed by Liu et al. (2020) is a good choice. But this algorithm also has some limitations. One is it must calculate an entire solution for a sequence lambda in \((0, + \infty )\), then choose the best lambda by some criterions. For some optimization problems which can determine the specific value of \(\lambda \), it is not cost-effective to use it. The other is that in solving the complete solution path, every \(\lambda \) in the sequence needs twice linear optimization with some constraints. Different from the path algorithm of classo (Gaines et al. 2018), it has no explicit solution only related to the active set. When p and n is large, it requires an expensive computational cost.

In the discussion of the Sect. 3, following Wang et al. (2007) and Wang (2013), we respectively determine the specific value \(\lambda \) of pcLAD in fixed and high dimension. Although, the algorithm proposed by Liu et al. (2020) can resolve the pcLAD, the computing is a large burden. In this section, we propose some efficient algorithms to solve pcLAD under specific \(\lambda \).

4.1 Linear programing

A typical approach to solving LAD regression is to cast it as a linear program and then solve the linear program using the interior point method, so the first algorithm we present to solve pcLAD is a linear programing. Further, this method is also applied to penalized LAD in fixed and high dimensional regression such as Wang et al. (2007), Wu and Liu (2009), Gao and Huang (2010), Wang (2013) ,etc.. The popular R package quantreg is based on an interior point method which can solve the (penalized) LAD regression (Portnoy and Koenker 1997). LAD regression problem is equivalent to the linear program,

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _\beta \, 1_n^{\mathrm{T}}u + 1_n^{\mathrm{T}}v\\ s.t.\,\, \,\, \, \left\{ \begin{array}{l} u - v + X\beta = y\\ u,v \in R{}_ + ^n,\beta \in {R^p}. \end{array} \right. \end{array} \end{aligned}$$
(25)

Problem (25) is often solved with the interior method in its dual domain,

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _d \, - {y^{\mathrm{T}}}d\\ s.t.\,\, \,\, \, \left\{ \begin{array}{l} {X^{\mathrm{T}}}d = 0\\ d \in {\left[ { - 1/2,1/2} \right] ^n}. \end{array} \right. \end{array} \end{aligned}$$
(26)

To apply this method to penalized LAD regression, just make a simple data augmentation for X and y. Then, penalized LAD regression can be computed with the follow dual domain,

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _d \, - {{\tilde{y}}^{\mathrm{T}}}d\\ s.t.\,\, \,\, \, \left\{ \begin{array}{l} {{\tilde{X}}^{\mathrm{T}}}d = 0\\ d \in {\left[ { - 1/2,1/2} \right] ^{n + p}}, \end{array} \right. \end{array} \end{aligned}$$
(27)

where \(\tilde{y}=(y^T,{0_p}^{T})^{T}\), \(X=[X^{T},\text {diag}(\lambda )]^{T}\). Note that when \(\lambda = {0_p}\), (27) is ordinary LAD regression.

The main difficulty of solving pcLAD with linear programming algorithm is how to bring equality and inequality constraints into optimization. Inspired by Koenker and Ng (2005), we can also following Berman (1973), consider the primal problem \(\mathop {\min }\limits _x \{ {c^{_{\mathrm{T}}}}x\left| {Ax - b \in T,x \in S} \right. \}\),where the sets \({\mathrm{T = \{ v}} \in {{\mathrm{R}}^n}{\mathrm{\} }}\) and \({\mathrm{S = \{ v}} \in {{\mathrm{R}}^{2n}} \times {{\mathrm{R}}^p}{\mathrm{\} }}\) can be arbitrary closed convex cones. This canonical problem has the dual \(\mathop {\max }\limits _y \{ {b^T}y\left| {c - {A^T}y \in {S^ * },y \in {T^ * }} \right. \} \,\), where \({{\mathrm{S}}^ * }{} = \{ v \in {R^{2n}} \times {R^p}\left| {{x^T}y} \right. \ge 0\,\, if\, x \in S\}\) is the dual of S and \({{\mathrm{T}}^ * }{\mathrm{= \{ v}} \in {{\mathrm{R}}^n}{\mathrm{\} }}\).

Thus, pcLAD is equivalent to the following linear program

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _\beta \, 1_n^{\mathrm{T}}u + 1_n^{\mathrm{T}}v\\ s.t.\,\, \,\, \, \left\{ \begin{array}{l} u - v + \tilde{X}\beta = \tilde{y}\\ {C_1}\beta = {b_1}\\ {C_2}\beta \le {b_2}\\ u,v \in R{}_ + ^n,\beta \in {R^p} \end{array} \right. \end{array} \end{aligned}$$
(28)

Then, for our purposes, it suffices to consider the following special case:

$$\begin{aligned} \left\{ \begin{array}{l} {c^T} = (e_n^T,e_n^T,0_p^T)\\ {x^T} = ({u^{\mathrm{T}}},{v^T},{\beta ^T})\\ T = \{ v \in {0_{n + p + {m_1}}} \times R_ + ^{{m_2}}\} \\ S = \{ v \in R_ + ^{2n} \times {R^p}\} \end{array} \right. \text {and} \left\{ \begin{array}{l} {{\mathrm{b}}^T} = ({{\tilde{y}}^T},{{\mathrm{b}}_1}^T,{b_2}^T)\\ {y^T} = (d_1^T,d_2^T,d_3^T)\\ {T^ * } = \{ v \in {{\mathrm{R}}^{n + p}} \times {R^{{m_1}}} \times R_ - ^{{m_2}}\} \\ {S^ * } = \{ v \in R_ + ^{2n} \times {O_p}\} \end{array} \right. \end{aligned}$$
(29)

After some easy transformations, \({z_1} = \frac{{{d_1} + {e_n}}}{2},{z_2} = {d_2},{z_3} = - {d_3},z = {(z_1^T,z_2^T,z_3^T)^T}\), the dual problem in (28) can be expressed more concisely as,

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _{{z_1},{z_2},{z_3}} - \, ({{\tilde{y}}^T},b_1^T,b_2^T)z\\ s.t.\,\, \, \left[ {{{\tilde{X}}^T},C_1^T, - C_2^T} \right] {\mathrm{z}} = \frac{{{{\tilde{X}}^T}}}{2}{e_{n + p}},\\ \,\, \,\, \,\, \,\, \,\, \,\, \,\, {0_{n + p}} \le \, {z_1} \le {e_{n + p}},\\ \,\, \,\, \,\, \,\, \,\, \,\, \,\, {z_3} \ge {0_{{m_2}}}.\\ \end{array} \end{aligned}$$
(30)

It is noteworthy that we can also use two equality constraints \({C_1}\beta \ge {b_1}\) and \(-{C_1}\beta \ge -{b_1}\) instead of the inequality constraint \({C_1}\beta = {b_1}\) to solve optimization problems (30),then the solution can be found using the R package quantreg, specifically, the estimator is implemented in quantreg’s functions rq.fit.fnc and rq.fit.sfnc.

In addition to transforming the augmented data into an optimized form of constrained LAD, pcLAD can also be equivalent to the following linear programming

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _\beta \, 1_n^{\mathrm{T}}u + 1_n^{\mathrm{T}}v + n{\lambda ^T}({\beta ^ + } + {\beta ^ - })\\ s.t.\,\, \,\, \, \left\{ \begin{array}{l} u - v + X\beta = y\\ \beta = {\beta ^ + } - {\beta ^ - }\\ {C_1}\beta = {b_1}\\ {C_2}\beta \le {b_2}\\ u,v,{\beta ^ + },{\beta ^ - } \in R{}_ + ^n,\beta \in {R^p} \end{array} \right. \end{array} \end{aligned}$$
(31)

Similar to the derivation of (30),the dual to (31) is

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _{{z_1},{z_2},{z_3}} - \, ({y^T},0_p^T,b_1^T,b_2^T)z\\ s.t.\,\, \, \left[ {{X^T},ndiag(\lambda ),C_1^T, - C_2^T} \right] {\mathrm{z}} = \frac{{{X^T}}}{2}{e_p} + \frac{n}{2}\lambda ,\\ \,\, \,\, \,\, \,\, \,\, \,\, \,\, {0_{n + p}} \le \,\, {z_1} \le {e_{n + p}},\\ \,\, \,\, \,\, \,\, \,\, \,\, \,\, {z_3} \ge {0_{{m_2}}},\\ \end{array} \end{aligned}$$
(32)

where \(\text {diag}(\lambda )\) denotes the diagonal matrix with the components of \(\lambda \) on its diagonals. This result is consistent with the result in Gu et al. (2017). Furthermore, it’s very easy to verify that (30) and (32) are equivalent. As Gu et al. (2017) points out, (32) involves p equality constraints and often is solved with the interior point method, but the interior point algorithm is the state-of-the-art method for fitting penalized LAD (0.5 quantile) regression in low to moderate dimensions. When the p is very large, the interior point method is less efficient. A lot of numerical evidence in Sect. 5.1 can demonstrate it. This phenomenon motivates us to consider another efficient alternative for fitting the high dimensional pcLAD regression.

4.2 Alternating direction algorithm

The ADMM is a general convex optimization algorithm first introduced by Gabay and Mercier (1976) and Glowinski and Marrocco (1975). It has become popular recently since its capability of solving high dimensional problems. In this subsection, we briefly review the ADMM and propose a nested scale form ADMM for pcLAD. A comprehensive overview of the ADMM can be found in Boyd et al. (2010).

In general ADMM is an algorithm to solve a problem that features a separable objective but connecting constraints.

$$\begin{aligned} \begin{array}{l} \min \,\, f(x) + g(z)\\ s.t.\,\, \,\, Mx + Fz = c, \end{array} \end{aligned}$$
(33)

where \(f,g:{R^p} \mapsto R \cup \infty \) are closed proper convex functions. The ADMM solves problem (33) by writing it into the following equivalent form,

$$\begin{aligned} \begin{array}{l} \min \,\, \{ f(x) + g(z)\} + \frac{\tau }{2}\left\| {Mx + Fz - c} \right\| _2^2\\ s.t.\,\, \,\, Mx + Fz = c, \end{array} \end{aligned}$$
(34)

where the last term is called the augmentation, which is add for better convergence properties and the \(\tau \) is a tunable augmentation parameter. Following standard convex optimization method, problem (45) has the following Lagrangian.

$$\begin{aligned} {L_\tau }(x,z,v) = f(x) + g(z) + {v^T}(Mx + Fz - c) + \frac{\tau }{2}\left\| {Mx + Fz - c} \right\| _2^2, \end{aligned}$$
(35)

where v is the dual variable.

The basic idea of ADMM is to utilize block coordinate descent to the augmented Lagrangian function followed by an update of the dual variables v

$$\begin{aligned} \begin{array}{l} {x^{(t + 1)}} \leftarrow \mathop {\arg \min }\limits _x {L_\tau }(x,{z^{(t)}},{v^{(t)}});\\ {z^{(t + 1)}} \leftarrow \mathop {\arg \min }\limits _z {L_\tau }({x^{(t + 1)}},{z^{(t)}},{v^{(t)}});\\ {v^{(t + 1)}} \leftarrow {v^{(t)}} + \tau (M{x^{(t + 1)}} + F{z^{(t + 1)}} - c); \end{array} \end{aligned}$$
(36)

where t is the iteration counter. Often it is more convenient to work with the equivalent scaled form of ADMM, which scales the dual variable and combines the linear and quadratic terms in the update step (36). The updates become

$$\begin{aligned} \begin{array}{l} {x^{(t + 1)}} \leftarrow \mathop {\arg \min }\limits _x f(x) + \frac{\tau }{2}\left\| {Mx + F{z^{(t)}} - c + {u^{(t)}}} \right\| _2^2;\\ {z^{(t + 1)}} \leftarrow \mathop {\arg \min }\limits _z g(z) + \frac{\tau }{2}\left\| {M{x^{(t + 1)}} + Fz - c + {u^{(t)}}} \right\| _2^2;\\ {u^{(t + 1)}} \leftarrow {u^{(t)}} + M{x^{(t + 1)}} + F{z^{(t + 1)}} - c; \end{array} \end{aligned}$$
(37)

where \(u = \frac{v}{\tau }\) is the scaled dual variable. As discussed in Gaines et al. (2018), the scaled form is especially useful in the case where \({\mathrm{M}} = - F = I\) and \(c=0\), as the updates can be rewritten as

$$\begin{aligned} \begin{array}{l} {x^{(t + 1)}} \leftarrow pro{x_{\tau f}}({z^{(t)}} - {u^{(t)}});\\ {z^{(t + 1)}} \leftarrow pro{x_{\tau g}}({x^{(t + 1)}} + {u^{(t)}});\\ {u^{(t + 1)}} \leftarrow {u^{(t)}} + {x^{(t + 1)}} - {z^{(t + 1)}}; \end{array} \end{aligned}$$
(38)

where \(pro{x_{\tau f}}\) is the proximal mapping of a function f with parameter \(\tau \). Recall that the proximal mapping is defined as

$$\begin{aligned} pro{x_{\tau f}}(v) = \mathop {\arg \min }\limits _x \left( f(x) + \frac{\tau }{2}\left\| {x - v} \right\| _2^2\right) \end{aligned}$$
(39)

One benefit of using the scaled form for ADMM is that, in many situations, the proximal mappings have simple, closed form solutions, resulting in straightforward ADMM updates. And Gaines et al. (2018) has proved that the scaled form ADMM algorithm (hereinafter referred to as sADMM) is very effective in high dimensional constrained lasso estimation. Following this work, we will show that the scaled form ADMM can also be expanded to pcLAD.

Let \(f(\beta ) = \frac{1}{n}\sum \limits _{i = 1}^n {\left| {{y_i} - x_i^T\beta } \right| } + \sum \limits _{j = 1}^p {{\lambda _j}} \left| {{\beta _j}} \right| \) and \(g(z) = \chi _\mathbf{C } = \left\{ \begin{array}{l} +\infty \,\, \,\, z \notin \mathbf{C} \\ 0\,\, \,\, \,\, \,\, \,\, \,\, \, z \in \mathbf{C} \end{array} \right. \), where set \(\mathbf{C} \) is defined as \(\left\{ {z \in {R^p}:{C_1}z = {b_1},{C_2}z \le {b_2}} \right\} \). For the first update of (38), \(pro{x_{\tau f}}\) in classo is regarded as a regular lasso problem, but in pcLAD needs a more technical method. Substitute \(f(\beta )\) and g(z) into \(pro{x_{\tau f}}\), we can get the following update

$$\begin{aligned} {\beta ^{(t + 1)}} = \mathop {\arg \min }\limits _\beta \frac{1}{n}\sum \limits _{i = 1}^n {\left| {{y_i} - x_i^T\beta } \right| } + \sum \limits _{j = 1}^p {{\lambda _j}} \left| {{\beta _j}} \right| + \frac{\tau }{2}\left\| {\beta + {{\mathrm{z}}^{(t)}} + {u^{(t)}}} \right\| _2^2. \end{aligned}$$
(40)

(40) is an unconstrained optimization problem of LAD loss function + penalty term, but the penalty is a lasso + a quadratic penalty. When \({z^{(t)}} + {u^{(t)}} = 0\), this combined penalty becomes an elastic net (Zou et al. 2005). Due to the combined penalty and nonsmoothness of the LAD loss function, as far as we know, there is no good method to resolve (40) directly when p is large scale. However, some recent literatures have proposed a number of algorithms to solve elastic net penalized quantile regression in high dimension such as Gu et al. (2017), Yu and Lin (2017). Inspired by these works, we can derive some algorithms to calculate (40).

Using the same steps as section 3.2 in Gu et al. (2017), (40) is equivalent to

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _{\beta ,r} \,\, \frac{1}{n}\sum \limits _{i = 1}^n {\left| {{r_i}} \right| } + \sum \limits _{j = 1}^p {{\lambda _j}\left| {{\beta _j}} \right| } + \frac{\tau }{2}\left\| {\beta + {{\mathrm{z}}^{(t)}} + {u^{(t)}}} \right\| _2^2\\ s.t.\,\, \,\, X\beta + r = y \end{array} \end{aligned}$$
(41)

Fix \(\tilde{\tau } > 0\) and the augmented Lagrangian function of (41) is

$$\begin{aligned} {L_{\tilde{\tau } }}(\beta ,{\mathrm{r}},\theta )= & {} \frac{1}{n}\sum \limits _{i = 1}^n {\left| {{r_i}} \right| } + \sum \limits _{j = 1}^p {{\lambda _j}\left| {{\beta _j}} \right| } + \frac{\tau }{2}\left\| {\beta + {{\mathrm{z}}^{(t)}} + {u^{(t)}}} \right\| _2^2 - {\theta ^T}(X\beta + r - y) \nonumber \\&+ \frac{{\tilde{\tau } }}{2}\left\| {X\beta + r - y} \right\| _2^2. \end{aligned}$$
(42)

Denote \(({\beta ^{(t + 1,k)}},{r^{(k)}},{\theta ^{(k)}})\) as the kth iteration of the algorithm for \(k \ge 0\) and the next iteration is

$$\begin{aligned} \begin{aligned} {\beta ^{(t + 1,k + 1)}}&\leftarrow \mathop {\arg \min }\limits _\beta \sum \limits _{j = 1}^p {{\lambda _j}\left| {{\beta _j}} \right| } + \frac{\tau }{2}\left\| {\beta + {{\mathrm{z}}^{(t)}} + {u^{(t)}}} \right\| _2^2 - {\beta ^{\mathrm{T}}}{X^T}{\theta ^{(k)}}\\&\quad +\frac{{\tilde{\tau } }}{2}\left\| {X\beta + {r^{(k)}} - y} \right\| _2^2\\ {r^{(k + 1)}}&\leftarrow \mathop {\arg \min }\limits _r \frac{1}{n}\sum \limits _{i = 1}^n {\left| {{r_i}} \right| } - {r^T}{\theta ^{(k)}} + \frac{{\tilde{\tau } }}{2}\left\| {X{\beta ^{(t + 1,k + 1)}} + r - y} \right\| _2^2\\ {\theta ^{(k + 1)}}&\leftarrow {\theta ^{(k)}} - \tilde{\tau } (X{\beta ^{(t + 1,k + 1)}} + {r^{(k + 1)}} - y) \end{aligned} \end{aligned}$$
(43)

It is noteworthy that although the \(\beta \) update of (43) is not the same as (40) in Gu et al. (2017), one can expand the quadratic penalty into a ridge penalty \(\left\| \beta \right\| _2^2\) and a linear summation penalty \({\beta ^{\mathrm{T}}}{\mathrm{(}}{{\mathrm{z}}^{(t)}} + {u^{(t)}})\). Then,

$$\begin{aligned}&{\beta ^{(t + 1,k + 1)}} \leftarrow \mathop {\arg \min }\limits _\beta \left( \sum \limits _{j = 1}^p {{\lambda _j}\left| {{\beta _j}} \right| } + \frac{\tau }{2}\left\| \beta \right\| _2^2\right) - {\beta ^{\mathrm{T}}}[{X^T}{\theta ^{(k)}} - \tilde{\tau } ({{\mathrm{z}}^{(t)}} + {u^{(t)}})] \nonumber \\&\quad + \frac{{\tilde{\tau } }}{2}\left\| {X\beta + {r^{(k)}} - y} \right\| _2^2 \end{aligned}$$
(44)

It is the same as the update of elastic net penalized quantile regression.

Like algorithm 3 of Gu et al. (2017), when we use pADMM to solve (43), the update steps have the following closed formula.

$$\begin{aligned} \begin{aligned} {\beta ^{(t + 1,k + 1)}}&\leftarrow ({(\tilde{\tau } \eta + \tau )^{ - 1}}shirnk[\tilde{\tau } \eta {\beta ^{(t + 1k)}} + X_j^T({\theta ^{(k)}} + \tilde{\tau } y - \tilde{\tau } X{\beta ^{(t + 1,k)}} - \tilde{\tau } {r^{(k)}}) \\&\quad - \tau ({z^{(t)}} + {u^{(t)}}),{\lambda _j}])_{1 \le j \le p};\\ {r^{(k + 1)}}&\leftarrow pro{x_{\tilde{\tau } {{\left\| {} \right\| }_1}}}(y - x_i^T{\beta ^{(t + 1,k + 1)}} + {{\tilde{\tau } }^{ - 1}}{\theta ^{(k)}});\\ {\theta ^{(k + 1)}}&\leftarrow {\theta ^{(k)}} - \tilde{\tau } \gamma (X{\beta ^{(t + 1,k + 1)}} + {r^{(k + 1)}} - y); \end{aligned} \end{aligned}$$
(45)

where \(\eta \ge {\varLambda _{\max }}({X^T}X)\), \(\gamma \) is a constant which is controlling the step length for \(\theta \). \({\varLambda _{\max }}({X^T}X)\) denotes the largest eigenvalue of a real symmetric matrix and \(\text {shrink}[x,y] = \text {sgn}(x)\max (\left| x \right| - y,0)\) denotes the soft shrinkage operator with \(\text {sgn}\) being the sign function. These definitions are the same as those in Gu et al. (2017). Consider \(pro{x_{\tilde{\tau } {{\left\| {_{}} \right\| }_1}}}(v) = \mathop {\arg \min }\limits _r (\frac{1}{n}\sum \nolimits _{i = 1}^n {\left| {{r_i}} \right| } + \frac{{\tilde{\tau } }}{2}\left\| {r - v} \right\| _2^2)\) and as Lemma 1 proved in Gu et al. (2017), it has an explicit solution.

$$\begin{aligned} pro{x_{\tilde{\tau } {{\left\| {_{}} \right\| }_1}}}(v) = v - \max \left( - \frac{1}{{2\tilde{\tau } }},\min \left( v,\frac{1}{{2\tilde{\tau } }}\right) \right) \end{aligned}$$
(46)

The main difference between (45) and algorithm is that the \(\beta \) update has one more constant offset term \(\tau ({z^{(t)}} + {u^{(t)}})\) since the linear summation penalty \({\beta ^{\mathrm{T}}}{\mathrm{(}}{{\mathrm{z}}^{(t)}} + {u^{(t)}})\). Furthermore, we can also use scdADMM (algorithm 4 in Gu et al. (2017)) to solve (43) by adding an same constant offset term to the corresponding position of \(\beta \) update.

For the second update of (38), we use the same method as Gaines et al. (2018). \(pro{x_{\tau {\mathrm{g}}}}\) is a projection onto the affine space \(\mathbf{C} \) . This projection onto convex sets is well-studied. In many applications, the projection can be solved analytically (see Section 15.2 of Lange (2013) for several examples). For situations where an explicit projection operator is not available, the projection can be found by using quadratic programming to solve the dual problem, which always has a smaller number of variables.

To sum up the above discussion, the nested scale form ADMM is described in Algorithm 1. Although Algorithm 1 contains a nested ADMM iteration, both outer and inner iteration have explicit expressions which makes it calculate very fast in high dimensional setting. In fact, if lasso problem of the \(\beta \) update of sADMM is solved by ADMM, sADMM also has a nested ADMM iteration. For numerical evidence, see Sect. 5.1.

figure a

5 Simulation

In this section, we will show some numerical results when p is fixed and larger than n. All simulations were performed on the Inter E5-2650 2.0 GHz processor with 16 GB memory. In fixed dimension, we use the \({\lambda _j} = \frac{{5\log (p)}}{{n\left| {{{\tilde{\beta } }_j}} \right| }}\), where \(\tilde{\beta }_j\) is the jth element in the ordinary LAD estimation vector. As discussed in Wang et al. (2007), these \(\lambda _j\) satisfy \(\sqrt{n} {a_n} \rightarrow 0\) and \(\sqrt{n} {b_n} \rightarrow \infty \). When p is high dimensional, we adopt \(\lambda = \sqrt{{{1.1\log p}}/{n}}\), although it is smaller than \(\lambda = \sqrt{{{2\log p}}/{n}}\) which is used most in Wang (2013), it will not affect the result of Theorem 2.

In the simulation, the model for the simulated data is \(y_{i}=x_i^{{\prime }}\beta +{\varepsilon _i}, i=1,2,\dots ,n.\) Each experiment uses three different error terms \(\varepsilon \), that is N(0, 1), t(2), \(\text {Cauchy}(0,1)\). When the error term obeys the normal distribution, the \(\lambda \) of classo is selected according to James et al. (2013). When the error term does not obey the normal distribution, we use the 10 fold CV method to select \(\lambda \). Moreover, each row of the design matrix X is generated by \(N(0,\varOmega )\) distribution with Toeplitz correlation matrix \({\varOmega _{ij}} = {0.5^{\left| {i - j} \right| }}\) and normalized such that each column has \(L_2\) norm \(\sqrt{n}\).

5.1 Comparison of algorithms

In this subsection, we introduce some implementation details of the several algorithms and compare theirs the time-consuming. In all numerical experiments, we will use four \(\mathbf{R} \) packages, quantreg, osqp, glmnet, FHDQP. The first three packages can be found on \(\mathbf{R} \) official website, https://www.r-project.org/, and the link of FHDQP package is https://users.stat.umn.edu/zouxx019/ftpdir/code/fhdqr/. LP and QP for constrained regression are implemented by quantreg and osqp respectively. More details about QP can be found in Gaines et al. (2018). In fixed dimensional constrained regression, the initial values of ADMM are unconstrained penalized estimates calculated by quantreg, and under the setting of high dimension, the corresponding initial value is calculated by FHDQP. For the first update of (36) in classo, we use glmnet package. Other iterative steps of sADMM and nADMM does not need to use \(\mathbf{R} \) package since they have explicit solutions.

Fig. 1
figure 1

Object function values computed by several algorithms

As noted in Algorithm 1, nADMM includes three additional tuning parameters, \(\tau ,\tilde{\tau } ,\gamma \). We adopt \(\tau = \frac{1}{n}\) the (suggested by Gaines et al. (2018)), \(\tilde{\tau } = 0.05,\gamma = 1\)(default value in FHDQP) in all numerical experiments. All ADMM algorithms are iterated until some stopping criterion is met. We adopt the stopping criterion from Boyd et al. (2010). Specifically, the outer iteration of nADMM is terminated either when sequence \(\{ ({\beta ^{(t)}},{z^{(t)}},{u^{(t)}})\}\) meets the following criterion:

$$\begin{aligned} \begin{array}{l} {\left\| {X{\beta ^{(t)}} + {z^{(k)}} - y} \right\| _2} \le \sqrt{n} {\varepsilon _1} + {\varepsilon _2}\max \{ {\left\| {X{\beta ^{(t)}}} \right\| _2},{\left\| {{z^{(t)}}} \right\| _2},{\left\| y \right\| _2}\}; \\ \tau {\left\| {X{\beta ^{(t)}} + {z^{(k)}} - y} \right\| _2} \le \sqrt{p} {\varepsilon _1} + {\varepsilon _2}{\left\| {{X^T}{u^{(t)}}} \right\| _2}; \end{array} \end{aligned}$$
(47)

where typical choices are \({\varepsilon _1} = {10^{ - 3}}\) and \({\varepsilon _2} = {10^{ - 3}}\), or when the number of nADMM iterations exceeds a certain number, say \(10^5\). The conditions for termination of inner iteration is the same as outer iteration’s. If one wants to get faster convergence rate, the termination condition of inner iteration can be more relaxed, such as \({\varepsilon _1}={\varepsilon _2} = {10^{ - 2}}\). In order to verify the efficiency of LP and nADMM algorithm in estimating pcLAD under different dimensions. Specifically, we set n is fixed at 100, but \(p = (50,200,1000,2000)\). The true coefficient vector is \({\beta _0} = {( - 1, - 2, - 3,1,2,3,0_{p - 6}^T)^T}\) and \({\varepsilon _i} \sim N(0,1)\). To make this experiment representative, we use mixed constraint set \(\mathbf{C} = \{ 1_n^T\beta = 0,{\beta _1},{\beta _2},{\beta _3} \le 0\}\). We use QP (quadratic programming ) and sADMM to fit classo regression, LP and nADMM to fit pcLAD regression. All simulations used 100 replicates and record the running time in the Table 1.

Table 1 Timings (in seconds) for running pcLAD and classo regression with specific \(\lambda \)

For fixed dimension, LP outperform nADMM in time consuming, while with the increase of p, the performance of LP is worse and worse. In all settings, although we have used a very efficient ADMM algorithm proposed by Stellato et al. (2018), QP needs longer computation time than LP since the optimization form of QP is more complex. On the contrary, nADMM takes more time than sADMM, due to the nonsmoothness of LAD loss function. To be specific, the main difference between nADMM and sADMM is the iteration of the \(\beta \) step, the former is a variant of LADlasso, and the latter is a variant of lasso. One can also verify the different computation time by using glmnet (lasso) and FHDQR(LADlasso) packages to fit a same set of high dimensional regression data. Note also that to do a meaningful timing comparison, we need to check the objective function values of pcLAD and classo at the optimal solution computed by the different algorithm. To make sure different algorithms yield the same objective function values, it is sufficient to compare the optimal objective function value in (2) even though it’s unfair to classo. The results are illustrated in Fig. 1. From Fig. 1, we know the objective functions of LP and nADMM are almost the same, QP and sADMM are the almost same too.

5.2 Sum to zero constraints

The first simulation involves a sum-to-zero constraint on the true parameter vector, \(\sum \nolimits _j {{\beta _j}} = 0\). Recently, this type of constraint on the lasso has seen increased interest as it has been used in the analysis of compositional data as well as analyses involving many biological measurement analyzed relative to a reference point (Lin et al. 2014; Shi et al. 2016; Altenbuchinger et al. 2017). Written in the pcLAD formulation (2), this corresponds to \({C_1} = 1_p^\prime \) and \(b_{1}=0\). For this simulation, in order to distinguish the bounds of estimation error \(||\hat{\beta } - {\beta _0}||_2^2\) and prediction error \(||X\hat{\beta }-X\beta _0||_2^2/n\) in different cases, the true parameter vector \(\beta _0\) , was defined as \(\beta _{0} = (10,10,10, - 10,10, - 10,0, \ldots , 0)\). The true parameter satisfies the sum to zero constraint, then the constraints can be imposed on the estimations. The main results of the simulation are given in the Tables 2 and 3.

Table 2 Sum to zero constraint in fixed p

In Tables 2 and 3, we set \(n=200\) and \(p=50,400\), respectively. Each data in the table is the mean value after 100 repetitions and the numbers in brackets is its variance. We can see classo has the best performance when the \(\varepsilon \) obeys the normal distribution. When the \(\varepsilon \) obeys the t(2), which does not have bounded variance, the classo will not perform as well as LADlasso and pcLAD. When the \(\varepsilon \) obeys the \(\text {Cauchy}(0,1)\), classo will no longer make sense because the errors in estimation and prediction are intolerable. It is necessary to note that when the data in the table exceeds \(10 ^ 5\), in order to facilitate recording, we will only take the highest order. For example, 1234567 will be recorded as \(10^7\). By the way, when \(\varepsilon \) obeys Cauchy distribution, the median of prediction error and estimation error is not as large as the mean value, it is about between 10 and 20, which reflects that the classo estimation fluctuates greatly under Cauchy distribution.

Table 3 Sum to zero constraint in high dimensional p

In all dimensional settings, the unconstrained LADlasso can work normally under three kinds of error terms. Due to the existence of prior information, its estimation and prediction effects should be worse than that of equality constrained LADlasso. In Table 2, the obvious conclusion is true, However, the surprising results are appeared in Table 3, the estimation error of pcLAD is not as good as LADlasso in the three cases of \(\varepsilon \), but the prediction error is better than it.

At first, it puzzled us, but the results of many experiments are still the same. Therefore, we notice that this equality constraint is for all coefficients, while in the setting of the high dimensional model in Sect. 3.2, the equality constraints are only imposed on the significance coefficients. In order to verify the conclusion of Sect. 3.2, we check the selection of non-zero coefficients of pcLAD, and the results confirm our idea. There are many zero coefficients being mistakenly selected as non-zero coefficients. At the same time, we have done another group of experiments. All the settings of this group of experiments are the same as the previous high-dimensional experiments. The only difference is that we only restrict the significance coefficients. The constraint matrix is as follows:

$$\begin{aligned} \left( {\begin{array}{llllllll} 1&{}\quad 0&{}\quad 0&{}\quad { - 1}&{}\quad 0&{}\quad \cdots &{}\quad {}&{}\quad {}\\ 0&{}\quad 1&{}\quad 0&{}\quad 0&{}\quad { - 1}&{}\quad 0&{}\quad \cdots &{}\quad {}\\ 0&{}\quad 0&{}\quad 1&{}\quad 0&{}\quad 0&{}\quad { - 1}&{}\quad 0&{}\quad \cdots \end{array}} \right) , \end{aligned}$$

and \(b_{1}=(0,0,0)^{\prime }\).

Table 4 Three sum to zero constraints in high dimensional p

The main results of this experiment are shown in Table 4. From Table 4, we can see that LADlasso’s result naturally does not change much because it has no constraints. However, the results of pcLAD are greatly improved. This result shows when p is high dimension, constraints should be placed on the significance coefficients, instead of every coefficients. Otherwise it will lead to excessive selection of non-zero coefficients. One may ask why this constraint is not set on the sparse model with fixed dimensions in this paper. Because in the fixed dimension pcLAD, we choose the adaptive \(L_1\) penalty, which will impose a very large penalty on the insignificant coefficient. This forces the true zero coefficients to be estimated as 0.

5.3 Non negativity constraints

In this simulation, we choose the fixed dimension as \(n = 100\), \(p = 50\), the high dimension is \(n = 100\), \(p = 200\). This choice of n and p is different from the previous experiments and to verify whether the performance of the model is consistent under different n and p. In this experiment, we added type 1 error and type 2 error. The average type 1 error means the average number of significant variables that are unselected over 100 runs. The average type 2 error means the average number of insignificant variables that are selected over 100 runs. Because of the nonnegative constraint, the true coefficient we choose is \((1,2,3,4,5,6,0,\dots ).\)

The main results are shown in Tables 5 and 6. Just like the conclusion in Sect. 5.2, classo does better job in estimation and prediction errors under normal error. However, it also has a disadvantage, that is the type 2 error is relatively large. The reason for this is that the \(\lambda \) chosen by CV tends to choose more variables. For more details about this, one can refer to Leng et al. (2004) and Wang (2013). The estimation and prediction error of classo is worse than that of pcLAD in t(2), but it can still be acceptable. However, under the \(\text {Cauchy}(0,1)\) distribution, the classo results are unreliable.

Table 5 Non-negativity constraints in fixed p
Table 6 Non-negativity constraints in high dimensional p

The above results show that inequality pcLAD has better estimation and prediction effect than inequality classo in non normal data. Moreover, in terms of variable selection, pcLAD is better than classo in any case.

5.4 Complex constraints

In the above two simulations, we have considered the case of equality constraint and inequality constraint respectively. In this subsection, we consider the case where both equality and inequality constraints are exist simultaneously. And two models will be considered.

The first model we consider is complex constrained ordinary LADlasso with \(n=200,p=50, 400\). In fact, this complex constrained lasso has already appeared in Hu et al. (2015b). In order to compare the estimation error better, the coefficient of Hu et al. (2015b) is increased by 10 times. The true parameter vector is defined as \(\beta _{0}=(10,5,-10,0,\dots ,0,10,5,-10,0,\dots ,0)^{\prime }\), so only its 1st, 2nd, 3rd, 11th, 12th, and 13th elements are nonzero. The constrained pcLAD is estimated subject to the constraints:

$$\begin{aligned} \begin{array}{l} {\beta _1} + {\beta _2} + {\beta _3} \ge 0, \ {\beta _1} + {\beta _3} + {\beta _{11}} + {\beta _{13}} = 0,\\ {\beta _2} + {\beta _5} + {\beta _{11}} \ge 10, \ {\beta _2} + {\beta _8} + {\beta _{12}} = 10. \end{array} \end{aligned}$$

The main results are shown in Tables 7 and 8, the result of data presentation is the same as that in Sect. 5.3. We omit the data analysis in this section.

Table 7 Complex constraints in fixed dimensional p
Table 8 Complex constraints in high dimensional p

The second model we consider is the complex constrained LAD fused lasso and complex constrained fused lasso with \(n=200, p=100,1000\). Indeed, this complex constrained quantile lasso has already appeared in Liu et al. (2020). We adopt true parameter vector used in Liu et al. (2020), that is \(\beta _{0}=(-1,-1,1,1,0,\dots ,0)^{\prime }\). The linear constraints are:

$$\begin{aligned} \begin{array}{l} {\beta _1} - 2{\beta _2} - {\beta _3} + 2{\beta _4} \ge 1,\,\, \,\, \,\, \,\, 3{\beta _1} - 2{\beta _2} + {\beta _3} + {\beta _4} \ge 0,\\ {\beta _1} - {\beta _2} + 2{\beta _3} + 5{\beta _4} = 7,\,\, \,\, \,\, \,\, - 3{\beta _1} + {\beta _2} - 6{\beta _3} - {\beta _4} = - 5. \end{array} \end{aligned}$$

The main result of complex constrained LAD fused lasso in this synthetic data are shown in Tables 9 and 10. Note that our theoretical analysis is not suitable for fused LADlasso, so we utilize CV to select all penalty parameter \(\lambda \).

Table 9 Complex constrained in fixed dimensional p
Table 10 Complex constrained in high dimensional p

In the Sect. 3.3, we have clarified that LAD fused lasso and constrained LAD fused lasso may not have the Oracle theoretical properties under the assumption in this paper. However, from Tables 9 and 10, constrained LAD fused lasso has good estimation and prediction performances. This numerical result also shows that theoretically constrained LAD fused lasso may also have Oracle property with new technical assumptions and methods.

An interesting phenomenon is that the estimation error and prediction error, are of the same order in all cases. We have not proved it in theory, but we believe that the theoretical results should be close to the numerical results.

6 Real data applications

In this section, we apply pcLAD to three different real data, and compare with classo.

Fig. 2
figure 2

Global warming data

6.1 Global warming data

For our first application of the pcLAD on a real data set, we revisit the global temperature data provided by Jones et al. (2016). The data set contains of annual temperature anomalies from 1850 to 2015. As mentioned, there appears to be a monotone trend to the data over time, so it is natural to want to incorporate this information when the trend is estimated.

Wu et al. (2001) and Gaines et al. (2018) achieved this by using isotonic regression. The LAD version of isotonic regression which has been described in Sect. 2.2, so we will not repeat it here. Because we want to get the temperature fitting data of each year, the penalty term is unnecessary, so \(\lambda = 0\). In this experiment, the sample size n and dimension p are the same, and the value is not large, thus we use LP and QP to solve pcLAD and classo respectively. Significantly, the design matrix X is p-dimension identity matrix, so the optimal solution is the fitting value of y. Then we show the fitting effect of pcLAD and classo in Fig. 2. To be honest, we can’t obviously see that the method fits better, so we calculate the \(\left\| {y - \hat{y}} \right\| _1\) of classo and pcLAD. The values are 12.31 and 12.14 respectively, and show that the fitting of pcLAD is closer to the real value in this rule.

6.2 Brain tumor data

Our second application of the pcLAD uses a version of the comparative genomic hybridization (CGH) data from Bredel et al. (2005) which was modified and studied by Tibshirani and Wang (2008) and Gaines et al. (2018). This version of the dataset is available in the cghFLasso R package. The dataset includes CGH measurements from 2 glioblastoma multiforme (GBM) brain tumors. CGH array experiments are often used to estimate each gene’s DNA copy number by obtaining the \(\log _2\) ratio of the number of DNA copies of the gene in the tumor cells relative to the number of DNA copies in the reference cells. Mutations to cancerous cells result in amplifications or deletions of a gene from the chromosome, so the purpose of the analysis is to identify these gains or losses in the DNA copies of that gene (Michels et al. 2007). For a more detailed description of this data, one can see Bredel et al. (2005) and Michels et al. (2007). The form of pcLAD applied to this data set is as follows:

$$\begin{aligned} \mathop {{\mathop {\mathrm{minmize}}\nolimits } }\limits _\beta \left\| {y - \beta } \right\| _1 + n{\lambda _1}\left\| \beta \right\| _1 + n{\lambda _2}\sum \limits _{j = 2}^p {\left| {{\beta _j} - {\beta _{j - 1}}} \right| }. \end{aligned}$$
(48)

From (48), we know the optimal solution is a sparse sequence fitting y, which is the \(\log 2\) ratio mentioned above. In this experiment, the sample size n and dimension p are the same, and the value is large. Thus, we use sADAMM and nADMM to solve pcLAD and classo respectively. For the penalty parameters \(\lambda _1\) and \(\lambda _2\), we use 10-fold CV to select them because our theoretical analysis is not suitable for fused LADlasso. Moreover, (48) can also be solved by genlasso R package. We don’t show the results of genlasso package, because Gaines et al. (2018) has been used it in this real data applications, and proved that the experimental results of sADMM and genlasso are the same. We compare the fitting results of this form of pcLAD and classo (sparse fused lasso Tibshirani et al. (2005)) in Fig. 2. From the Fig. 2, we can see that pcLAD fitting is better, especially for some large data. In numerical terms, the \(\left\| {y - \hat{y}} \right\| _1\) of sparse fused lasso and pcLAD is 357.1 and 244.7 respectively. The difference between the two fitting values shows that pcLAD is better than sparse fused lasso to fit this dataset.

Fig. 3
figure 3

Brain tumor data

6.3 Stock index data

The last application of the real data is the Shanghai Stock Exchange 50 stock index (SSE 50) and Shanghai Shenzhen 300 stock index (CSI 300). SSE 50 Index is composed of 50 representative stocks with large scale and good liquidity in Shanghai securities market. CSI 300 Index is made up of 300 A-shares from Shanghai and Shenzhen stock markets.

Firstly, we give a brief introduction about stock index and index tracking. Stock index is a method for fitting and predicting the trend of the stock market by choosing some representative stocks. For example, SSE 50 contains 50 component stocks. There are many other famous stock index named by their Exchanges such as S&P 500, FTSE 100 and so on. Index tracking is a hot issue, the main idea is to select a few representative stocks to predict the whole stock index. The contribution of each component stock to the stock index must be positive, so we need to add nonnegative constraints on the regression coefficients. Because stock index tracking requires both sparse and nonnegative constraints, many nonnegative constrained penalty estimates have been proposed recently, include Wu et al. (2014), Yang and Wu (2016), Wu and Yang (2014), Li et al. (2019), etc.. Following the above work, in this section, the form of pcLAD applied to the stock index traching is the nonnegative LADlasso (nLADlasso) mentioned in Sect. 2.4, which is the LAD version of nonnegative lasso (nlasso) Wu et al. (2014). It is worth noting that the contribution of all component stocks to the index is positive, that is, the true model does not have sparseness and every true coefficient is positive. Then the assumption of sparsity is not tenable, and the selection criterion of \(\lambda \) is meaningless. Following Wu et al. (2014), we can choose some penalty parameters from 0 to a sufficiently large positive number which shrinks all coefficients to 0, and the interval between the two parameters is equal.

Because the latest two component stock adjustments of CSI 300 Index and SSE 50 Index are on December 16, 2019 and June 15, 2020, we selected SSE 50 Index and CSI 300 Index data from January 2 to June 12, 2020. It is necessary to note that some component stocks of SSE50 and CSI 300 have been closed for a short period of time. We use the average stock price of the constituent stock during the non-closing period to fill the stock price at these closing times. The data is divided into time windows: the first 80 days’ data used for modeling and the next 20 days’ data used for forecasting. In the process of tracking SSE 50 Index, the sample size n is 80 and the dimension of variable p is 50, which is a fixed dimension problem. But in CSI 300 Index, \(n = 80\), \(p = 300\), which is a high dimensional problem. Therefore, we use fixed pcLAD and classo in SSE 50 tracking and high dimensional pcLAD and classo in CSI 300. For these data, \(\lambda =100\) is enough large to shrink all coefficients to 0, so we choose 1000 penalty parameters from 0 to 100 and the interval between the two parameters is 0.1.

Let \(x_{t,j}\) and \(y_{t}\) represent the returns of the jth constituent stock and the index respectively, \(j = 1,2,\dots ,50(300)\). Then we can describe the relationship between \(x_{t,j}\) and \(y_{t}\) by a linear regression model:

$$\begin{aligned} {y_t} = \sum \limits _t {{\beta _j}} {x_{t,j}} + {\varepsilon _t},t = 1,2, \ldots T, \end{aligned}$$
(49)

where \(\beta _j\) is the weight of the ith chosen stock, \(\varepsilon _t\) is the error term. In practical application, the optimal estimate of \(\beta \) means the proportion of each stock. For example, if \(\hat{\beta _1}=1, \hat{\beta _2}=2\), then when tracking the stock index, for each unit of labeled 1 stock held, it is necessary to hold 2 units of labeled 2 stock.

The bias measure for tracking, called Annual Tracking Error (ATE), is defined by

$$\begin{aligned} Tracking~Erro{r_{Year}} = \sqrt{252} \times \sqrt{\frac{{{{\sum {(er{r_t} - mean(er{r_t})} }^2}}}{{T - 1}}}, \end{aligned}$$
(50)

where \(err_{t}=\hat{y}_{t}-y_{t}\) and \(\hat{y}_{t}\) is the fitted or predicted value of \(y_t\) ,for \(t=1,2,\dots ,T\).

The results of pcLAD (nLADlasso) and classo (nlasso) are shown in Table 11. In SSE 50 index tracking, we select 5, 10 and 20 component stocks respectively, and in CSI 300 index, we select 25, 30 and 40 component stocks. For the same number of non-zero coefficients, we only record the model with smallest training ATE value. From Table 11, in all the tracking experiments, the ATE of pcLAD method is better than that of classo, whether it is 80 days of modeling or 20 days of forecasting.

Table 11 SSE 50 and CSI 300 index tracking data

We are not surprised to see such a tracking result. Firstly, the data in financial market rarely satisfy Gauss’s hypothesis. Secondly, this year’s outbreak of novel coronavirus has led to more turbulence and uncertainty in stock index. Hence the reliability of the usual OLS-based estimation and model selection method is severely challenged, whereas the LAD-based methods become more attractive. In addition, with the increase of non-zero coefficients, the values of both ATE are decreasing. The reason for this phenomenon is that the true coefficients are all positive. More non-zero coefficients are selected, smaller ATE will be got. However, in the financial market, holding more stocks means more costs, so sparsity is indispensable in stock index tracking. Finally, we show the fitted and predicted results of classo and pcLAD selecting 30 stocks to track CSI 300 in Fig. 4.

Fig. 4
figure 4

The fitted and predicted results about tracking CSI 300 index

7 Discussion

When the noise does not obey Gaussian or near Gaussian, pcLAD is an effective alternative to classo method. In this paper, we prove that effective constraints can improve the accuracy of pcLAD estimation. In the fixed dimension, the constraints will reduce the variance of the estimation, and in the high dimension, the constraints will reduce the upper bound of the estimation bias. Furthermore, two algorithms named as linear programming and nested ADMM are proposed to solve pLAD effectively in fixed and high dimension respectively.

However, there are still many further works to be studied. In this paper, we assume that k is less than infinity when p is of order \({e^{{n^\pi }}}\), where \(0<\pi <1\) . This assumption is more strict than \(p<n\) and limits the number of equality constraints m. How to generalize the theoretical results to the case that both k and m tend to infinity with the growth of n is a challenging work. The upper bound (23) in high dimension can also be improved, because when the equality constraints m increases, the upper bound will not continue to decrease, especially when \(m = k\), the upper bound is equal to the unconstrained case.

The derivation method in theory and algorithm of this paper can be well applied to more general model such as constrained Huber’s estimation, quantile and composite quantile estimation (Gu and Zou 2020). For other penalty terms constrained regression such as elastic net, SCAD, MCP (Zhang 2010), the idea of nested ADMM is also available, but the theoretical analysis needs more technical methods. In fixed and high dimension, although the two proposed algorithms can be applied to generalized lasso and constrained generalized lasso, their theoretical analysis cannot be included in the framework of pcLAD. It’s also a challenge to get Oracle or near Oracle property of generalized lasso and constrained generalized lasso with other assumptions.

Recently, parallel algorithms have been applied to large scale penalized regresion, such as Liqun et al. (2017) and Fan et al. (2020), and achieved good performance in numerical experiments. Extending nested ADMM to parallel algorithms’ framework is a potentially valuable work for big data.