Abstract
The Lasso approach is widely adopted for screening and estimating active effects in sparse linear models with quantitative factors. Many design schemes have been proposed based on different criteria to make the Lasso estimator more accurate. This article applies \(\varPhi _l\)-optimality to the asymptotic covariance matrix of the Lasso estimator. Smaller mean squared error and higher power of significant hypothesis tests can be achieved. A theoretically converging algorithm is given for searching for \(\varPhi _l\)-optimal designs, and modified by intermittent diffusion to avoid local solutions. Some simulations are given to support the theoretical results.
Similar content being viewed by others
References
Belloni A, Wang L (2010) Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika 98(4):791–806
Cai T, Liu W, Luo X (2011) A constrained \(l_1\)-minimization approach to sparse precision matrix estimation. J Am Stat Assoc 106(494):594–607
Chow S, Yang T, Zhou H (2013) Global optimizations by intermittent diffusion. In: Adamatzky A, Chen G (eds) Chaos, CNN, memristors and beyond. World Scientific Press, Singapore, pp 466–479
Cook RD, Nachtsheim CJ (1980) A comparison of algorithms for constructing exact d-optimal designs. Technometrics 22(3):315–324
Deng X, Lin CD, Qian PZG (2013) The lasso with nearly orthogonal Latin hypercube designs. https://uq.wisc.edu/papers/Lasso_Design.pdf
Dette H, Melas VB, Guchenko R (2014) Bayesian t-optimal discriminating designs. Annal Stat 43(5):1959–1985
Dette H, Titoff S (2009) Optimal discrimination designs. Annal Stat 37(4):2056–2082
Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3):432–441
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22
Gilmour SG, Trinca LA (2012) Optimum design of experiments for statistical inference. J Royal Stat Soc Ser C 61(3):345–401
Javanmard A, Montanari A (2014) Confidence intervals and hypothesis testing for high-dimensional regression. J Mach Learn Res 15:2869–2909
Jones B, Lin DKJ, Nachtsheim CJ (2008) Bayesian d-optimal supersaturated designs. J Stat Plan Inference 138(1):86–92
Kaymal T (2013) Assessing the operational effectiveness of a small surface combat ship in an anti-surfacewarfare environment. Masters thesis, Naval Postgraduate School, Monterey, California. https://calhoun.nps.edu/handle/10945/34685
Kiefer JC (1974) General equivalence theory for optimum designs. Annal Stat 2(5):849–879
Li WW, Wu CFJ (1997) Columnwise-pairwise algorithms with applications to the construction of supersaturated designs. Technometrics 39(2):171–179
Nguyen NK (1996) An algorithmic approach to constructing supersaturated designs. Technometrics 38(1):69–73
Pukelsheim FJ (1993) Optimal design of experiments. Wiley, New York
Ravi SN, Ithapu VK, Johnson SC, Singh V (2016) Experimental design on a budget for sparse linear models and applications. In: Proceedings of the 33rd international conference on machine learning 48, 583-592
Satterthwaite FE (1959) Random balance experimentation. Technometrics 1(2):111–137
Silvey SD (1980) Optimal design. Springer, New York
Sun T, Zhang CH (2012) Scaled sparse linear regression. Biometrika 99(4):879–898
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B 58(1):267–288
van de Geer S, Bühlmann P, Ritov Y, Dezeure R (2014) On asymptotically optimal confidence regions and tests for high-dimensional models. Annal Stat 42(3):1166–1202
Wu CFJ, Hamada M (2009) Experiments: planning, analysis and parameter design optimization. Wiley, New York
Xing D, Wan H, Zhu MY, Sanchez SM, Kaymal T (2013) Simulation screening experiments using lasso-optimal supersatured design and analysis: a maritime operations application. In: Proceedings of the 2013 winter simulation conference: simulation: making decisions in a complex world, 497-508
Zhang CH, Zhang SS (2014) Confidence intervals for low dimensional parameters in high dimensional linear models. J Royal Stat Soc Ser B 76(1):217–242
Acknowledgements
Both Yimin Huang and Xiangshun Kong are the common first authors, and Mingyao Ai is the corresponding author. The authors sincerely thank the editor, associate editor, and two referees for their valuable comments and insightful suggestions, which lead to further improvement of this article. The work is supported by NSFC grants 11671019 and 11801033, LMEQF and Beijing Institute of Technology Research Fund Program for Young Scholars.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proofs of Theorems
Here we first introduce the following Lemma 1, which establishes the theoretical foundation how to calculate \(d(\cdot ,\xi )\) in Theorem 1.
Lemma 1
Let \(\varvec{\theta }(\varepsilon )\) be the solution of
where \(\varvec{\theta }\) is a p-dimensional row vector, j is a given integer in \(\{1\ldots p\}\), A is a \(p\times p\) symmetric positive definite matrix, B is a \(p\times p\) symmetric matrix, \(\varvec{e}_j\) is the p-dimensional vector with 1 in the jth entry and 0 in others and \(\mu >0\) is a constant. Then it follows that \(\varvec{\theta }(\varepsilon )\) is continuous and differentiable at \(\varepsilon =0\).
Proof
(a) First, we prove the continuity of \(\varvec{\theta }(\epsilon )\) at \(\epsilon =0\). We show this by contradiction. Note that the objective function in (6) is strictly convex near \(\varepsilon =0\). Hence, the solution of the quadratic optimization (6) is unique. And, the feasible area is a convex polygon.
In high dimensional space, the feasible area is viewed as a hyperplane and put below the bottom of the graph of objective function. Then, we keep raising the hyperplane. Until the two images intersected for the first time, the solution is found.
Since the feasible area and the graph of objective function are continuous with respect to \(\varepsilon \), their intersection is also continuous.
(b) Second, we prove the differentiability of \(\varvec{\theta }(\epsilon )\) at \(\epsilon =0\). Let \(\min _{\varvec{\theta }}\varvec{\theta } (A+\varepsilon B)\varvec{\theta }^T=c\). Note that \(\varvec{\theta }(A+\varepsilon B)\varvec{\theta }^T=c\) is an ellipsoid, and the restricted area is a convex polyhedron \(\mathcal {P}\). Hence, \(\varvec{\theta }(\varepsilon )\) is the unique intersection of the ellipsoid and one hyperplane which is the i-th boundary of the convex polyhedron \(\mathcal {P}\). To be precise, \(\varvec{\theta }(\varepsilon )\) is the unique solution of
where \((A+\varepsilon B)_i\) is the i-th row of the matrix \(A+\varepsilon B\), j is the integer in (6) and \(\mathbf{1 }_{\{i=j\}}\) is the indicator function.
We notice that the value of c will change if \(\varepsilon \) changes a little. However, the value of i will not change, because both the convex polyhedron and \(\varvec{\theta }(\varepsilon )\) are continuous with respect to \(\varepsilon \) according to (a), i.e., the solution is still on the i-th boundary. If \(\varvec{\theta }(\varepsilon )\) locates on the both i-th and \(i'\)-th boundary of the polyhedron, the solution will locate on one of them after \(\varepsilon \) changes. In other words, there exists an i such that equation (7) holds for any sufficiently small \(\varepsilon \). Further, the conclusion is the same for the situation that \(\varvec{\theta }(\varepsilon )\) locates on more boundaries (the intersection of more than two hyperplanes).
Since the solution is unique, there exist a set of vectors \((A_k, c_k), k=1,\ldots p,\) such that (7) are equivalent to \(\sum _{k=1}^p(A_k\varvec{\theta }^T+c_k)^2=0\). Note that the solution \(\varvec{\theta }(\varepsilon )\) doesn’t depend on the parameter c in (7), since c is only used to match the constant terms \(c_k\)’s. Thus, the solution only depends on \(\varepsilon \). We obtain the differentiability of \(\varvec{\theta }(\varepsilon )\) because the solution of the linear equations is an algebraic expression. \(\square \)
Let \({\dot{\varTheta }}=\lim _{\varepsilon \rightarrow 0+}\varepsilon ^{-1}(\varTheta (\varepsilon )-\varTheta (0)),\) in which \(\varTheta (\varepsilon )\) is the solution of
The existence of \({\dot{\varTheta }}\) can be verified by substituting \(A={\hat{\varSigma }}_\xi \) and \(B=\varvec{x}\varvec{x}^T-{\hat{\varSigma }}_\xi \) into Lemma 1.
Proof of Theorem 1
The approximate design \(\xi \) is a minimizer or a stationary point in (3) if and only if the Fréchet derivative
is non-negative for any \(\varvec{x}\in {\mathcal {X}}\), where \(\delta _{\varvec{x}}\) denotes a one-point design on \(\varvec{x}\). To get the Fréchet derivative, we can calculate the G\(\hat{a}\)teaux derivative of \(\log (\varPhi _l(M(\cdot )))\) at \(\xi \) in the direction \(\delta _{\varvec{x}}\), i.e., \(G(\xi ,\delta _{\varvec{x}})=\lim _{\varepsilon \rightarrow 0+}\varepsilon ^{-1}[\log \varPhi _l(M(\xi +\varepsilon \delta _{\varvec{x}}))-\log \varPhi _l(M(\xi ))].\) Note that \(\log \varPhi _l(M(\cdot ))=\log [\mathrm{tr}(M(\cdot )^l)]^{1/l}=l^{-1}\log \mathrm{tr}(M(\cdot )^l).\) We have
where \(M_1=M(\xi +\varepsilon \delta _{\varvec{x}})\) and \(M_2=M(\xi )\).
The definition of \(M_1\) gives that \(M_1=\varTheta _1({\hat{\varSigma }}_\xi +\varepsilon \varvec{x}\varvec{x}^T)\varTheta _1^T,\) where \(\varTheta _1\) is the solution of
Similarly, we have \(M_2=\varTheta _2{\hat{\varSigma }}_\xi \varTheta _2^T,\) where \(\varTheta _2\) is the solution of
From Lemma 1, the relation between \(\varTheta _1\) and \(\varTheta _2\) can be represented as
Denote \(\dot{\varTheta }_1(0)\) by \(\dot{\varTheta }_1\) for convenience. Then, it follows that
Substituting the above decomposition of \(\mathrm{tr}(M_1^l)\) into the (8), we have
Therefore, the G\(\hat{a}\)teaux derivative can be calculated by
The Fréchet derivative of \(\log (\varPhi _l(M(\cdot )))\) at \(\xi \) in the direction of \(\delta _{\varvec{x}}\) is obtained by \(d(\varvec{x},\xi )=G(\xi ,\delta _{\varvec{x}}-\xi ).\) We need only to replace the design matrix \(\varvec{x}\varvec{x}^T\) of the design \(\delta _{\varvec{x}}\) in G\(\hat{a}\)teaux derivative of \(\log (\varPhi _l(M(\cdot )))\) with \(\varvec{x}\varvec{x}^T-{\hat{\varSigma }}_\xi \). Simple calculation shows that
\(\square \)
For realization of the proposed algorithm, an approximation for \({\dot{\varTheta }}\) must be given first. When \(\mu =0\), which could happen if \(n\ge p\), we have \(\varTheta (\varepsilon )[{\hat{\varSigma }}_\xi +\varepsilon (\varvec{x}\varvec{x}^T-{\hat{\varSigma }}_\xi )]= I.\) By taking the derivative of both sides at \(\varepsilon =0\), we have
Right multiplying both sides of the above equation by \(\varTheta ^T\) gives
Substituting the above equation into d, the approximate form of d is given by
For other cases, the inverse of \({\hat{\varSigma }}\) doesn’t exist, but the form is still used with the restriction of \(\varTheta \) in (3). Analogous to that, we also use it to approach the asymptotic covariance matrix. Therefore, in the simulations, we still use this approximation of d even for the case of \(n<p\).
Proof of Theorem 2
Note that in Algorithm 1, the function \(\varPhi _l(M(\xi _t))>0\) decreases with respect to t. So there must exist a non-negative real number \(\varPhi _l^*\) such that \(\lim _{t\rightarrow \infty }\varPhi _l(M(\xi _t))=\varPhi _l^*.\)
Without loss of generality, we assume that there exists a design \(\xi ^{**}\) such that \(\varPhi _l(M(\xi ^{**}))\)\(=\varPhi _l^*.\) Now we are ready to prove by contradiction. If \(\xi ^{**}\) is not the optimal design, i.e., \(\varPhi _l(M(\xi ^*))<\varPhi _l^*\), then there exists an \(x^*\) such that \(d(\varvec{x}^*,\xi ^{**})<0\) by Theorem 1, or \(\inf _{\varvec{x}\in {\mathcal {X}}}d(\varvec{x},\xi ^{**})=-2\gamma <0\). Hence, for any sufficiently large t, we have \(\inf _{\varvec{x}\in {\mathcal {X}}}d(\varvec{x},\xi _t)\le -\gamma .\) Let \({\widetilde{\xi }}_{t+1}(\alpha )=(1-\alpha )\xi _t+\alpha \delta _{\varvec{x}_t},\) where \(\varvec{x}_t=\arg \min _{\varvec{x}\in {\mathcal {X}}}d(\varvec{x},\xi _t)\), and \(\alpha _t=\arg \min _{\alpha \in [0,1]}\)\(\varPhi _l(M({\widetilde{\xi }}_t)).\) A Taylor expansion of \(\varPhi _l[M({\widetilde{\xi }}_{t+1}(\alpha ))]\) at \(\alpha =0\) gives that
where U is the upper bound of the second order derivative of \(\varPhi _l[M({\widetilde{\xi }}_{t+1}(\alpha ))]\) over [0, 1].
Note that in the proof of Theorem 1, we have
By substituting (11) into (10), it follows that
Therefore, for \(t>N\), we have that
The left hand side tends to a finite number while the right hand side tends to \(-\infty \). This leads to a contradiction as \(t\rightarrow \infty \). \(\square \)
Proof of Theorem 3
Analogous to the proof of Theorem 2, here we need only to prove that there exists a real number \(\gamma \ge 0\) such that
It should be noted that if the left hand side is greater than or equal to zero, \(\alpha =0\) will be a minimizer, which contradicts with the original hypothesis that \(\alpha =0\) is not a minimizer.\(\square \)
Appendix B: The four designs in Example 2
Rights and permissions
About this article
Cite this article
Huang, Y., Kong, X. & Ai, M. Optimal designs in sparse linear models. Metrika 83, 255–273 (2020). https://doi.org/10.1007/s00184-019-00722-9
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-019-00722-9