Abstract
The experimental design problem concerns the selection of k points from a potentially large design pool of p-dimensional vectors, so as to maximize the statistical efficiency regressed on the selected k design points. Statistical efficiency is measured by optimality criteria, including A(verage), D(eterminant), T(race), E(igen), V(ariance) and G-optimality. Except for the T-optimality, exact optimization is challenging, and for certain instances of D/E-optimality exact or even approximate optimization is proven to be NP-hard. We propose a polynomial-time regret minimization framework to achieve a \((1+\varepsilon )\) approximation with only \(O(p/\varepsilon ^2)\) design points, for all the optimality criteria above. In contrast, to the best of our knowledge, before our work, no polynomial-time algorithm achieves \((1+\varepsilon )\) approximations for D/E/G-optimality, and the best poly-time algorithm achieving \((1+\varepsilon )\)-approximation for A/V-optimality requires \(k=\varOmega (p^2/\varepsilon )\) design points.
Similar content being viewed by others
Notes
In the worst case, even the exact minimum \(\min _{|S|\le k} f( X_S^\top X_S)\) can be indeed O(n/k) times larger than \(f( X^\top X)\) [8]. This worst-case scenario may not always happen, but to the best our knowledge, their proof is tight in this worst case.
Indeed, the maximum possible value for \(c_t\) is \(c_u = \sqrt{p}\) because \(\mathrm {tr}[(\sqrt{p} I+\alpha Z)^{-2}] \ge \mathrm {tr}[(\sqrt{p} I)^{-2}] = 1\), and the minimum possible value of \(c_t\) must be above \(c_\ell = -\alpha \lambda _{\min }(Z)\), because \(\mathrm {tr}[(c_\ell I+\alpha Z)^{-2}] \ge \mathrm {tr}[(\sqrt{p} I)^{-2}] = 1\). One can show (see the proof of Claim 2.4) that the value \(\mathrm {tr}[(c I+\alpha Z)^{-2}]\) is a monotone function in \(c \in (c_\ell , c_u]\).
A careful stability analysis for the \(\ell _{1/2}\) strategy has already appeared in the appendix of [4].
\((A+UCV)^{-1} = A^{-1} - A^{-1} U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}\), provided that all inverses exist.
\(|x_1|+\cdots +|x_d|\le \sqrt{d}\cdot \sqrt{x_1^2+\cdots +x_d^2}\) for any sequences of d real numbers \(x_1,\dots ,x_n\).
The ellipsoid method runs in polynomial time as long as (1) the domain is polynomially bounded (which is the case since the probability simplex is bounded and (2) the separation oracle can be implemented in polynomial time (which is the case for all the optimality criteria that we study in this paper.)
A similar landscape also appears in stochastic gradient methods. For instance, given the ridge regression problem, although interior point or ellipsoid method gives polynomial time algorithm, in practice, one still prefer stochastic gradient methods such as SDCA [41] which depend on the properties of the covariance matrix.
Of course, if the current solution \({\hat{s}}\) leads to \(\sum _{i}{\hat{s}}_ix_ix_i^\top \) that is not full rank, then we always continue to the next iteration.
In backtracking line search, for every iteration t a preliminary step size of \(\eta _t=1\) is used and the step size is repeatedly halved until the Armijo-Goldstein condition \(f(\omega ^{(t+1)})\le f(\omega ^{(t)}) + 0.5\langle g^{(t)}, \omega ^{(t+1)}-\omega ^{(t)}\rangle \) is satisfied, where \(\omega ^{(t+1)}\) is the (projected) next step under step size \(\eta _t\).
References
Ageev, A.A., Sviridenko, M.I.: Pipage rounding: a new method of constructing algorithms with proven performance guarantee. J. Comb. Optim. 8(3), 307–328 (2004)
Allen-Zhu, Z., Li, Y.: Follow the compressed leader: faster online learning of eigenvectors and faster MMWU. In: Proceedings of the International Conference on Machine Learning (ICML). Full version available at arxiv:1701.01722 (2017)
Allen-Zhu, Z., Li, Y., Singh, A., Wang, Y.: Near-optimal design of experiments via regret minimization. In: Proceedings of the International Conference on Machine Learning (ICML), (2017)
Allen-Zhu, Z., Liao, Z., Orecchia, L.: Spectral sparsification and regret minimization beyond matrix multiplicative updates. In: Proceedings of Annual Symposium on the Theory of Computing (STOC), (2015)
Angluin, D.: Queries and concept learning. Mach. Learn. 2(4), 319–342 (1988)
Arora, S., Kale, S.: A combinatorial, primal-dual approach to semidefinite programs. In: Proceedings of the annual ACM Symposium on Theory of Computing (STOC), (2007)
Audibert, J.-Y., Bubeck, S., Lugosi, G.: Minimax policies for combinatorial prediction games. In: Proceedings of Conference on Learning Theory (COLT), (2011)
Avron, H., Boutsidis, C.: Faster subset selection for matrices and applications. SIAM J. Matrix Anal. Appl. 34(4), 1464–1499 (2013)
Bach, F.: Submodular functions: from discrete to continous domains. arXiv preprint arXiv:1511.00394, (2015)
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)
Bhatia, R.: Matrix Analysis, volume 169 of Graduate Texts in Mathematics. Springer, New York, NY (1997)
Bian, A.A., Buhmann, J.M., Krause, A., Tschiatschek, S.: Guarantees for greedy maximization of non-submodular functions with applications. In: Proceedings of International Conference on Machine Learning (ICML), (2017)
Bouhtou, M., Gaubert, S., Sagnol, G.: Submodularity and randomized rounding techniques for optimal experimental design. Electron. Notes Discret. Math. 36, 679–686 (2010)
Boutsidis, C., Woodruff, D.P.: Optimal CUR matrix decompositions. In: Proceedings of Annual Symposium on the Theory of Computing (STOC), (2014)
Boyd, S., Vandenberghe, L.: Convex Optim. Cambridge University Press, Cambridge (2004)
Černỳ, M., Hladík, M.: Two complexity results on C-optimality in experimental design. Comput. Optim. Appl. 51(3), 1397–1408 (2012)
Chaloner, K., Verdinelli, I.: Bayesian experimental design: a review. Stat. Sci. 10(3), 273–304 (1995)
Chamon, L.F.O., Ribeiro, A.: Greedy sampling of graph signals. arXiv preprint arXiv:1704.01223, (2017)
Chaudhuri, K., Kakade, S., Netrapalli, P., Sanghavi, S.: Convergence rates of active learning for maximum likelihood estimation. In: Proceedings of Advances in Neural Information Processing Systems (NIPS), (2015)
Chen, S., Sandryhaila, A., Moura, J.M.F., Kovačević, J.: Signal recovery on graphs: Variation minimization. IEEE Trans. Signal Process. 63(17), 4609–4624 (2015)
Chen, S., Varma, R., Singh, A., Kovačević, J.: Signal representations on graphs: Tools and applications. arXiv preprint arXiv:1512.05406, (2015)
Chen, S., Varma, R., Singh, A., Kovačević, J.: Signal recovery on graphs: fundamental limits of sampling strategies. IEEE Trans. Signal Inf. Process. Over Netw. 2(4), 539–554 (2016)
Çivril, A., Magdon-Ismail, M.: On selecting a maximum volume sub-matrix of a matrix and related problems. Theoret. Comput. Sci. 410(47–49), 4801–4811 (2009)
Condat, L.: Fast projection onto the simplex and the L1-ball. Math. Program. 158, 575–585 (2015)
Dereziński, M., Warmuth, M.K.: Reverse iterative volume sampling for linear regression. J. Mach. Learn. Res. 19(1), 853–891 (2018)
Dhillon, P., Lu, Y., Foster, D.P., Ungar, L.: New subsampling algorithms for fast least squares regression. In: Proceedings of Advances in Neural Information Processing Systems (NIPS) (2013)
Drineas, P., Mahoney, M.W.: On the Nyström method for approximating a gram matrix for improved kernel-based learning. J. Mach. Learn. Res. 6(12), 2153–2175 (2005)
Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Relative-error CUR matrix decompositions. SIAM J. Matrix Anal. Appl. 30(2), 844–881 (2008)
Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the L1-ball for learning in high dimensions. In: Proceedings of International Conference on Machine learning (ICML) (2008)
Fedorov, V.V.: Theory of Optimal Experiments. Elsevier, Amsterdam (1972)
Joshi, S., Boyd, S.: Sensor selection via convex optimization. IEEE Trans. Signal Process. 57(2), 451–462 (2009)
Li, C., Jegelka, S., Sra, S.: Polynomial time algorithms for dual volume sampling. In: Proceedings of Advances in Neural Information Processing Systems (NIPS) (2017)
Lieb, E.H.: Convex trace functions and the wigner-yanase-dyson conjecture. Adv. Math. 11(3), 267–288 (1973)
McMahan, H.B.: Follow-the-regularized-leader and mirror descent: Equivalence theorems and L1 regularization. In: Procedings of International Conference on Artificial Intelligence and Statistics (AISTATS), (2011)
Miller, A., Nguyen, N.-K.: A Fedorov exchange algorithm for d-optimal design. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 43(4), 669–677 (1994)
Nikolov, A.: Randomized rounding for the largest simplex problem. In: Proceedings of the annual ACM symposium on Theory of computing (STOC) (2015)
Nikolov, A., Singh, M.: Maximizing determinants under partition constraints. In: Proceedings of the Annual ACM Symposium on Theory of Computing (STOC) (2016)
Nikolov, A., Singh, M., Tantipongpipat, U.T.: Proportional volume sampling and approximation algorithms for a-optimal design. In: Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2019)
Pukelsheim, F.: Optimal Design of Experiments. SIAM, Philadelphia (2006)
Rakhlin, A.: Lecture notes on online learning. Draft (2009). http://www-stat.wharton.upenn.edu/~rakhlin/courses/stat991/papers/lecture_notes.pdf
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
Singh, M., Xie, W.: Approximate positive correlated distributions and approximation algorithms for d-optimal design. In: Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), (2017)
Spielman, D.A., Srivastava, N.: Graph sparsification by effective resistances. SIAM J. Comput. 40(6), 1913–1926 (2011)
Summa, M.D., Eisenbrand, F., Faenza, Y., Moldenhauer, C.: On largest volume simplices and sub-determinants. In: Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2015)
Van der Vaart, A.W.: Asymptotic Statistics, vol. 3. Cambridge University Press, Cambridge (2000)
Wang, S., Zhang, Z.: Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling. J. Mach. Learn. Res. 14(1), 2729–2769 (2013)
Wang, Y., Singh, A.: Provably correct active sampling algorithms for matrix column subset selection with missing data. J. Mach. Learn. Res. 18(156), 1–42 (2018)
Wang, Y., Wei Adams, Y., Singh, A.: On computationally tractable selection of experiments in regression models. J. Mach. Learn. Res. 18(143), 1–41 (2017)
Welch, W.J.: Algorithmic complexity: three np-hard problems in computational statistics. J. Stat. Comput. Simul. 15(1), 17–25 (1982)
Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the International Conference on Machine Learning (ICML) (2003)
Acknowledgements
We thank Adams Wei Yu for helpful discussions regarding the implementation of the entropic mirror descent solver for the continuous (convex) relaxation problem, thank Aleksandar Nikolov, Shayan Oveis Gharan, and Mohit Singh for discussions on the references. This work is supported by NSF CCF-1563918, NSF CAREER IIS-1252412 and AFRL FA87501720212.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A preliminary 8-paged extended abstract of this paper with weaker results appeared in the ICML 2017 conference [3]. Author names listed in alphabetical order.
Appendix A: Missing proofs
Appendix A: Missing proofs
1.1 A.1 Proof of Claim 2.4
Claim 2.4
(closed form \(\ell _{1/2}\) strategy) Assume without loss of generality that \( A_0=(c_0 I+\alpha Z_0)^{-2}\) for some \(c_0\in {\mathbb {R}}\), \(\alpha >0\) and positive semi-definite matrix \( Z_0\) such that \(c_0 I+\alpha Z_0\succ 0\). Then,
where \(c_t\in {\mathbb {R}}\) is the unique constant so that \(c_t I+\alpha Z_0 + \alpha \sum _{\ell =0}^{t-1}{ F_\ell } \succ 0\) and \(\mathrm {tr}( A_t)=1\).
Proof of Claim 2.4
We first show that for any symmetric matrix \(Z\in {\mathbb {R}}^{p\times p}\), there exists unique \(c\in {\mathbb {R}}\) such that \(\alpha Z+c I\succ 0\) and \(\mathrm {tr}[(\alpha Z+cI)^{-2}]=1\). By simple asymptotic analysis, \(\lim _{c\rightarrow (-\alpha \lambda _{\min }(Z))^+}\mathrm {tr}[(\alpha Z+cI)^{-2}]=+\infty \) and \(\lim _{c\rightarrow +\infty }\mathrm {tr}[(\alpha Z+cI)^{-2}]=0\). Because \(\mathrm {tr}[(\alpha Z+cI)^{-2}]\) is a continuous and strictly decreasing function in c on the open interval \((-\alpha \lambda _{\min }(Z), +\infty )\), we conclude that there must exist a unique c such that \(\mathrm {tr}[(\alpha Z+cI)^{-2}]=1\). The range of c also ensures that \(\alpha Z+cI\succ 0\).
We now use induction to prove this proposition. For \(t=0\) the proposition is obviously correct. We shall now assume that the proposition holds true for \( A_{t-1}\) (i.e., \( A_{t-1}=(c_{t-1} I+\alpha Z_0+\sum _{\ell =0}^{t-2}{ F_\ell })^{-2}\)) for some \(t\ge 1\), and try to prove the same for \( A_t\).
The KKT condition of the optimization problem and the gradients of the Bregman divergence \(\varDelta _\psi \) yields
where the \(d_t I\) term arises from the Lagrangian multiplier and \(d_t\in {\mathbb {R}}\) is the unique number that makes \(-\nabla \psi ( A_{t-1})+\alpha F_t-d_t I\preceq 0\) (because \(\nabla \psi ( A_t)\succeq 0\)) and \(\mathrm {tr}( A_t)=\mathrm {tr}((\nabla \psi )^{-1}(\nabla \psi ( A_{t-1})+d_t I-\alpha F_{t-1}))=1\). Re-organizing terms in Eq. (A.1) and invoking the induction hypothesis we have
Because \(d_t\) is the unique number that ensures \( A_t\succeq 0\) and \(\mathrm {tr}( A_t)=1\), and \( Z_0+\sum _{\ell =0}^{t-1}{ F_\ell }\succeq 0\), it must hold that \(-c_{t+1}+d_t=c_t\). Subsequently, \(\nabla \psi ( A_t)=-( A_t^{-1/2})=-c_t I-\alpha Z-\alpha \sum _{\ell =0}^{t-1}{ F_\ell }\). The claim is thus proved by raising both sides of the identity to the power of \(-2\). \(\square \)
1.2 A.2 Proof of Claim 2.13
Claim 2.13
Suppose \(P_t^\top A_t^{1/2}P_t=[b\;\; d; d \;\;c]\in {\mathbb {R}}^{2\times 2}\) and \(2\alpha \langle A_t^{1/2}, v_tv_t^\top \rangle < 1\). Then
Proof of Claim 2.13
Define \(R=\left[ \begin{array}{cc} b&{} -d\\ -d&{} c\end{array}\right] \). Because \(P_t^\top A_t^{1/2}P_t = \left[ \begin{array}{cc} b&{} d\\ d&{} c\end{array}\right] \) is positive semi-definite, we conclude that R is also positive semi-definite and hence can be written as \(R=QQ^\top \). To prove Claim 2.13, we only need to establish the positive semi-definiteness of the following difference matrix:
Here in the last equality we again use the Woodbury matrix identity. It is clear that to prove the positive semi-definiteness right-hand side of the above equality, It suffices to show \(Q^\top (J+\mathrm {diag}(2b,2c))^{-1}Q\prec I\). By standard matrix analysis and the fact that \(J=\mathrm {diag}(1,-1)\),
Some steps in the above derivation require additional explanation. In (a), we use the fact that \(2c=2\alpha \langle A_t^{1/2},v_tv_t^\top \rangle <1\), and hence \((1-2c)^{-1}>0\); in (b), we use the fact that \(QQ^\top = \left[ \begin{array}{cc} b&{} -d\\ -d&{} c\end{array}\right] \preceq \left[ \begin{array}{cc} 2b&{} 0\\ 0&{} 2c\end{array}\right] \); finally, (c) holds because \(b=2\alpha \langle A_t^{1/2},u_tu_t^\top \rangle \ge 0\), \(\frac{2b}{1+2b}<1\) and \(2c<1\). The proof of Claim 2.13 is thus completed. \(\square \)
1.3 A.3 Proof of Claim 2.14
Claim 2.14
Suppose \(Z\succeq 0\) is a p-dimensional PSD matrix with \(\lambda _{\min }(Z)\le 1\). Let \(A=(\alpha Z+cI)^{-2}\), where \(c\in {\mathbb {R}}\) is the unique real number such that \(A\succeq 0\) and \(\mathrm {tr}(A)=1\). Then
-
1.
\(\alpha \langle A^{1/2}, Z\rangle \le p+\alpha \sqrt{p}\);
-
2.
\(\langle A, Z\rangle \le \sqrt{p}/\alpha + \lambda _{\min }(Z)\).
Proof of Claim 2.14
For any orthogonal matrix U, the transform \(Z\mapsto UZU^\top \) leads to \(X\mapsto U XU^\top \) and \(X^{1/2}\mapsto UX^{1/2}U^\top \); thus both inner products are invariant to orthogonal transform of Z. Therefore, we may assume without loss of generality that \(Z=\mathrm {diag}(\sigma _1,\dots ,\sigma _p)\) for \(\lambda _1\ge \cdots \lambda _p\ge 0\), because \(Z\succeq 0\). Subsequently,
If \(c\ge 0\), then \(\alpha \langle A^{1/2},Z\rangle \le p\) and the first property is clearly true. For the case of \(c<0\), note that c must be strictly larger than \(-\alpha \lambda _p\), as we established in Claim 2.4. Subsequently, by the Cauchy-Schwarz inequality,
Because \(\lambda _p=\lambda _{\min }(Z)\le 1\) and \(\mathrm {tr}(A)=\mathrm {tr}[(\alpha Z+cI)^{-2}]=1\), we have that \(c\ge -\alpha \) and \(\sqrt{\sum _{i=1}^p{(\alpha \lambda _i+c)^{-2}}} = 1\). Therefore, \(\alpha \langle A^{1/2},Z\rangle \le p+\alpha \sqrt{p}\), which establishes the first property in Claim 2.14.
We next turn to the second property. Using similar analysis, we have
Property 2 is then proved by noting that \(c>-\lambda _{\min }(Z)\). \(\square \)
1.4 A.4 Proof of correctness of Algorithm 4
Recall the KL divergence function \(\mathrm {KL}(y\Vert \omega ):= \sum _i y_i \log \frac{y_i}{\omega _i}\).
Claim A.1
The output of Algorithm 4 is exactly the projection \(\omega '=\arg \min _{y\in \varDelta _n,y_i\le b} \mathrm {KL}(y\Vert \omega )\) in KL divergence, provided that the input \(\omega \) itself is in the probabilistic simplex \(\varDelta _n\).
Proof
We first show that if \(\omega _i>\omega _j\) then \(\omega _i'\ge \omega _j'\). Assume the contrary that \(\omega _i>\omega _j\) and \(\omega _i'<\omega _j'\). Define \(\omega ''=\omega '\) except that the \(i\hbox {th}\) and \(j\hbox {th}\) components are swapped; that is, \(\omega _i''=\omega _j'\) and \(\omega _j''=\omega _i'\). It is clear that \(\omega ''\in \varDelta _n\) and satisfies the box constraint \(\Vert \omega ''\Vert _{\infty }\le b\). In addition,
which violates the optimality of \(\omega '\).
We next consider the Lagrangian multiplier of the constrained optimization problem
By KKT condition, we have that \(\partial {\mathcal {L}}(\omega ')/\partial y_i = \log (y_i/\omega _i) + (1+\mu +\lambda _i) = 0\). By complementary slackness, if \(\omega _i'<b\) then \(\lambda _i=0\) and hence there exists a unique \(C>0\) such that \(\omega _i'=C\cdot \omega _i\) holds for all \(\omega _i'<b\). Combining this fact with the monotonic correspondence between \(\omega '\) and \(\omega \), one only needs to search for the exact number of components in \(\omega '\) that are equal to b, and compute the unique constant C and the remaining coordinates. Since there are at most n such possibilities, by a linear scan over the choices we can choose the solution that gives rise to the minimum KL divergence. This is exactly what is computed in Algorithm 4: we have a O(n) time linear scan over parameter q (meaning that there are exactly \(q-1\) components that are equal to b, proceeded by a \(O(n \log n)\)-time pre-processing to sort the coordinates of \(\omega \).\(\square \)
Rights and permissions
About this article
Cite this article
Allen-Zhu, Z., Li, Y., Singh, A. et al. Near-optimal discrete optimization for experimental design: a regret minimization approach. Math. Program. 186, 439–478 (2021). https://doi.org/10.1007/s10107-019-01464-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-019-01464-2