Skip to main content
Log in

Near-optimal discrete optimization for experimental design: a regret minimization approach

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

The experimental design problem concerns the selection of k points from a potentially large design pool of p-dimensional vectors, so as to maximize the statistical efficiency regressed on the selected k design points. Statistical efficiency is measured by optimality criteria, including A(verage), D(eterminant), T(race), E(igen), V(ariance) and G-optimality. Except for the T-optimality, exact optimization is challenging, and for certain instances of D/E-optimality exact or even approximate optimization is proven to be NP-hard. We propose a polynomial-time regret minimization framework to achieve a \((1+\varepsilon )\) approximation with only \(O(p/\varepsilon ^2)\) design points, for all the optimality criteria above. In contrast, to the best of our knowledge, before our work, no polynomial-time algorithm achieves \((1+\varepsilon )\) approximations for D/E/G-optimality, and the best poly-time algorithm achieving \((1+\varepsilon )\)-approximation for A/V-optimality requires \(k=\varOmega (p^2/\varepsilon )\) design points.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Very often, an algorithm to Problem (1.1) can be directly generalized to solve Problem (1.2) without blowing up the problem size. Our algorithm proposed in this paper is one such algorithm.

  2. In the worst case, even the exact minimum \(\min _{|S|\le k} f( X_S^\top X_S)\) can be indeed O(n/k) times larger than \(f( X^\top X)\) [8]. This worst-case scenario may not always happen, but to the best our knowledge, their proof is tight in this worst case.

  3. Indeed, the maximum possible value for \(c_t\) is \(c_u = \sqrt{p}\) because \(\mathrm {tr}[(\sqrt{p} I+\alpha Z)^{-2}] \ge \mathrm {tr}[(\sqrt{p} I)^{-2}] = 1\), and the minimum possible value of \(c_t\) must be above \(c_\ell = -\alpha \lambda _{\min }(Z)\), because \(\mathrm {tr}[(c_\ell I+\alpha Z)^{-2}] \ge \mathrm {tr}[(\sqrt{p} I)^{-2}] = 1\). One can show (see the proof of Claim 2.4) that the value \(\mathrm {tr}[(c I+\alpha Z)^{-2}]\) is a monotone function in \(c \in (c_\ell , c_u]\).

  4. A careful stability analysis for the \(\ell _{1/2}\) strategy has already appeared in the appendix of [4].

  5. The convexity of this objective follows from Lieb’s concavity theorem [11, 33], and is already a known fact in matrix regret minimization literatures [4].

  6. \((A+UCV)^{-1} = A^{-1} - A^{-1} U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}\), provided that all inverses exist.

  7. \(|x_1|+\cdots +|x_d|\le \sqrt{d}\cdot \sqrt{x_1^2+\cdots +x_d^2}\) for any sequences of d real numbers \(x_1,\dots ,x_n\).

  8. The ellipsoid method runs in polynomial time as long as (1) the domain is polynomially bounded (which is the case since the probability simplex is bounded and (2) the separation oracle can be implemented in polynomial time (which is the case for all the optimality criteria that we study in this paper.)

  9. A similar landscape also appears in stochastic gradient methods. For instance, given the ridge regression problem, although interior point or ellipsoid method gives polynomial time algorithm, in practice, one still prefer stochastic gradient methods such as SDCA [41] which depend on the properties of the covariance matrix.

  10. Of course, if the current solution \({\hat{s}}\) leads to \(\sum _{i}{\hat{s}}_ix_ix_i^\top \) that is not full rank, then we always continue to the next iteration.

  11. In backtracking line search, for every iteration t a preliminary step size of \(\eta _t=1\) is used and the step size is repeatedly halved until the Armijo-Goldstein condition \(f(\omega ^{(t+1)})\le f(\omega ^{(t)}) + 0.5\langle g^{(t)}, \omega ^{(t+1)}-\omega ^{(t)}\rangle \) is satisfied, where \(\omega ^{(t+1)}\) is the (projected) next step under step size \(\eta _t\).

  12. For instance, in Tables 3, 4 and 5, it can be observed that Weighted and Swapping have very similar running time. This means the computational overhead (i.e., the rounding process) introduced by the swapping algorithm is negligible.

References

  1. Ageev, A.A., Sviridenko, M.I.: Pipage rounding: a new method of constructing algorithms with proven performance guarantee. J. Comb. Optim. 8(3), 307–328 (2004)

    Article  MathSciNet  Google Scholar 

  2. Allen-Zhu, Z., Li, Y.: Follow the compressed leader: faster online learning of eigenvectors and faster MMWU. In: Proceedings of the International Conference on Machine Learning (ICML). Full version available at arxiv:1701.01722 (2017)

  3. Allen-Zhu, Z., Li, Y., Singh, A., Wang, Y.: Near-optimal design of experiments via regret minimization. In: Proceedings of the International Conference on Machine Learning (ICML), (2017)

  4. Allen-Zhu, Z., Liao, Z., Orecchia, L.: Spectral sparsification and regret minimization beyond matrix multiplicative updates. In: Proceedings of Annual Symposium on the Theory of Computing (STOC), (2015)

  5. Angluin, D.: Queries and concept learning. Mach. Learn. 2(4), 319–342 (1988)

    MathSciNet  Google Scholar 

  6. Arora, S., Kale, S.: A combinatorial, primal-dual approach to semidefinite programs. In: Proceedings of the annual ACM Symposium on Theory of Computing (STOC), (2007)

  7. Audibert, J.-Y., Bubeck, S., Lugosi, G.: Minimax policies for combinatorial prediction games. In: Proceedings of Conference on Learning Theory (COLT), (2011)

  8. Avron, H., Boutsidis, C.: Faster subset selection for matrices and applications. SIAM J. Matrix Anal. Appl. 34(4), 1464–1499 (2013)

    Article  MathSciNet  Google Scholar 

  9. Bach, F.: Submodular functions: from discrete to continous domains. arXiv preprint arXiv:1511.00394, (2015)

  10. Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)

    Article  MathSciNet  Google Scholar 

  11. Bhatia, R.: Matrix Analysis, volume 169 of Graduate Texts in Mathematics. Springer, New York, NY (1997)

    Google Scholar 

  12. Bian, A.A., Buhmann, J.M., Krause, A., Tschiatschek, S.: Guarantees for greedy maximization of non-submodular functions with applications. In: Proceedings of International Conference on Machine Learning (ICML), (2017)

  13. Bouhtou, M., Gaubert, S., Sagnol, G.: Submodularity and randomized rounding techniques for optimal experimental design. Electron. Notes Discret. Math. 36, 679–686 (2010)

    Article  Google Scholar 

  14. Boutsidis, C., Woodruff, D.P.: Optimal CUR matrix decompositions. In: Proceedings of Annual Symposium on the Theory of Computing (STOC), (2014)

  15. Boyd, S., Vandenberghe, L.: Convex Optim. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  16. Černỳ, M., Hladík, M.: Two complexity results on C-optimality in experimental design. Comput. Optim. Appl. 51(3), 1397–1408 (2012)

    Article  MathSciNet  Google Scholar 

  17. Chaloner, K., Verdinelli, I.: Bayesian experimental design: a review. Stat. Sci. 10(3), 273–304 (1995)

    Article  MathSciNet  Google Scholar 

  18. Chamon, L.F.O., Ribeiro, A.: Greedy sampling of graph signals. arXiv preprint arXiv:1704.01223, (2017)

  19. Chaudhuri, K., Kakade, S., Netrapalli, P., Sanghavi, S.: Convergence rates of active learning for maximum likelihood estimation. In: Proceedings of Advances in Neural Information Processing Systems (NIPS), (2015)

  20. Chen, S., Sandryhaila, A., Moura, J.M.F., Kovačević, J.: Signal recovery on graphs: Variation minimization. IEEE Trans. Signal Process. 63(17), 4609–4624 (2015)

    Article  MathSciNet  Google Scholar 

  21. Chen, S., Varma, R., Singh, A., Kovačević, J.: Signal representations on graphs: Tools and applications. arXiv preprint arXiv:1512.05406, (2015)

  22. Chen, S., Varma, R., Singh, A., Kovačević, J.: Signal recovery on graphs: fundamental limits of sampling strategies. IEEE Trans. Signal Inf. Process. Over Netw. 2(4), 539–554 (2016)

    MathSciNet  Google Scholar 

  23. Çivril, A., Magdon-Ismail, M.: On selecting a maximum volume sub-matrix of a matrix and related problems. Theoret. Comput. Sci. 410(47–49), 4801–4811 (2009)

    Article  MathSciNet  Google Scholar 

  24. Condat, L.: Fast projection onto the simplex and the L1-ball. Math. Program. 158, 575–585 (2015)

    Article  MathSciNet  Google Scholar 

  25. Dereziński, M., Warmuth, M.K.: Reverse iterative volume sampling for linear regression. J. Mach. Learn. Res. 19(1), 853–891 (2018)

    MathSciNet  MATH  Google Scholar 

  26. Dhillon, P., Lu, Y., Foster, D.P., Ungar, L.: New subsampling algorithms for fast least squares regression. In: Proceedings of Advances in Neural Information Processing Systems (NIPS) (2013)

  27. Drineas, P., Mahoney, M.W.: On the Nyström method for approximating a gram matrix for improved kernel-based learning. J. Mach. Learn. Res. 6(12), 2153–2175 (2005)

    MathSciNet  MATH  Google Scholar 

  28. Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Relative-error CUR matrix decompositions. SIAM J. Matrix Anal. Appl. 30(2), 844–881 (2008)

    Article  MathSciNet  Google Scholar 

  29. Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the L1-ball for learning in high dimensions. In: Proceedings of International Conference on Machine learning (ICML) (2008)

  30. Fedorov, V.V.: Theory of Optimal Experiments. Elsevier, Amsterdam (1972)

    Google Scholar 

  31. Joshi, S., Boyd, S.: Sensor selection via convex optimization. IEEE Trans. Signal Process. 57(2), 451–462 (2009)

    Article  MathSciNet  Google Scholar 

  32. Li, C., Jegelka, S., Sra, S.: Polynomial time algorithms for dual volume sampling. In: Proceedings of Advances in Neural Information Processing Systems (NIPS) (2017)

  33. Lieb, E.H.: Convex trace functions and the wigner-yanase-dyson conjecture. Adv. Math. 11(3), 267–288 (1973)

    Article  MathSciNet  Google Scholar 

  34. McMahan, H.B.: Follow-the-regularized-leader and mirror descent: Equivalence theorems and L1 regularization. In: Procedings of International Conference on Artificial Intelligence and Statistics (AISTATS), (2011)

  35. Miller, A., Nguyen, N.-K.: A Fedorov exchange algorithm for d-optimal design. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 43(4), 669–677 (1994)

    Google Scholar 

  36. Nikolov, A.: Randomized rounding for the largest simplex problem. In: Proceedings of the annual ACM symposium on Theory of computing (STOC) (2015)

  37. Nikolov, A., Singh, M.: Maximizing determinants under partition constraints. In: Proceedings of the Annual ACM Symposium on Theory of Computing (STOC) (2016)

  38. Nikolov, A., Singh, M., Tantipongpipat, U.T.: Proportional volume sampling and approximation algorithms for a-optimal design. In: Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2019)

  39. Pukelsheim, F.: Optimal Design of Experiments. SIAM, Philadelphia (2006)

    Book  Google Scholar 

  40. Rakhlin, A.: Lecture notes on online learning. Draft (2009). http://www-stat.wharton.upenn.edu/~rakhlin/courses/stat991/papers/lecture_notes.pdf

  41. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)

    MathSciNet  MATH  Google Scholar 

  42. Singh, M., Xie, W.: Approximate positive correlated distributions and approximation algorithms for d-optimal design. In: Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), (2017)

  43. Spielman, D.A., Srivastava, N.: Graph sparsification by effective resistances. SIAM J. Comput. 40(6), 1913–1926 (2011)

    Article  MathSciNet  Google Scholar 

  44. Summa, M.D., Eisenbrand, F., Faenza, Y., Moldenhauer, C.: On largest volume simplices and sub-determinants. In: Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2015)

  45. Van der Vaart, A.W.: Asymptotic Statistics, vol. 3. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  46. Wang, S., Zhang, Z.: Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling. J. Mach. Learn. Res. 14(1), 2729–2769 (2013)

    MathSciNet  MATH  Google Scholar 

  47. Wang, Y., Singh, A.: Provably correct active sampling algorithms for matrix column subset selection with missing data. J. Mach. Learn. Res. 18(156), 1–42 (2018)

    Google Scholar 

  48. Wang, Y., Wei Adams, Y., Singh, A.: On computationally tractable selection of experiments in regression models. J. Mach. Learn. Res. 18(143), 1–41 (2017)

    MathSciNet  MATH  Google Scholar 

  49. Welch, W.J.: Algorithmic complexity: three np-hard problems in computational statistics. J. Stat. Comput. Simul. 15(1), 17–25 (1982)

    Article  MathSciNet  Google Scholar 

  50. Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the International Conference on Machine Learning (ICML) (2003)

Download references

Acknowledgements

We thank Adams Wei Yu for helpful discussions regarding the implementation of the entropic mirror descent solver for the continuous (convex) relaxation problem, thank Aleksandar Nikolov, Shayan Oveis Gharan, and Mohit Singh for discussions on the references. This work is supported by NSF CCF-1563918, NSF CAREER IIS-1252412 and AFRL FA87501720212.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yining Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary 8-paged extended abstract of this paper with weaker results appeared in the ICML 2017 conference [3]. Author names listed in alphabetical order.

Appendix A: Missing proofs

Appendix A: Missing proofs

1.1 A.1 Proof of Claim 2.4

Claim 2.4

(closed form \(\ell _{1/2}\) strategy) Assume without loss of generality that \( A_0=(c_0 I+\alpha Z_0)^{-2}\) for some \(c_0\in {\mathbb {R}}\), \(\alpha >0\) and positive semi-definite matrix \( Z_0\) such that \(c_0 I+\alpha Z_0\succ 0\). Then,

figure g

where \(c_t\in {\mathbb {R}}\) is the unique constant so that \(c_t I+\alpha Z_0 + \alpha \sum _{\ell =0}^{t-1}{ F_\ell } \succ 0\) and \(\mathrm {tr}( A_t)=1\).

Proof of Claim 2.4

We first show that for any symmetric matrix \(Z\in {\mathbb {R}}^{p\times p}\), there exists unique \(c\in {\mathbb {R}}\) such that \(\alpha Z+c I\succ 0\) and \(\mathrm {tr}[(\alpha Z+cI)^{-2}]=1\). By simple asymptotic analysis, \(\lim _{c\rightarrow (-\alpha \lambda _{\min }(Z))^+}\mathrm {tr}[(\alpha Z+cI)^{-2}]=+\infty \) and \(\lim _{c\rightarrow +\infty }\mathrm {tr}[(\alpha Z+cI)^{-2}]=0\). Because \(\mathrm {tr}[(\alpha Z+cI)^{-2}]\) is a continuous and strictly decreasing function in c on the open interval \((-\alpha \lambda _{\min }(Z), +\infty )\), we conclude that there must exist a unique c such that \(\mathrm {tr}[(\alpha Z+cI)^{-2}]=1\). The range of c also ensures that \(\alpha Z+cI\succ 0\).

We now use induction to prove this proposition. For \(t=0\) the proposition is obviously correct. We shall now assume that the proposition holds true for \( A_{t-1}\) (i.e., \( A_{t-1}=(c_{t-1} I+\alpha Z_0+\sum _{\ell =0}^{t-2}{ F_\ell })^{-2}\)) for some \(t\ge 1\), and try to prove the same for \( A_t\).

The KKT condition of the optimization problem and the gradients of the Bregman divergence \(\varDelta _\psi \) yields

$$\begin{aligned} \nabla \psi ( A_t)-\nabla \psi ( A_{t-1})+\alpha F_{t-1}-d_t I= 0, \end{aligned}$$
(A.1)

where the \(d_t I\) term arises from the Lagrangian multiplier and \(d_t\in {\mathbb {R}}\) is the unique number that makes \(-\nabla \psi ( A_{t-1})+\alpha F_t-d_t I\preceq 0\) (because \(\nabla \psi ( A_t)\succeq 0\)) and \(\mathrm {tr}( A_t)=\mathrm {tr}((\nabla \psi )^{-1}(\nabla \psi ( A_{t-1})+d_t I-\alpha F_{t-1}))=1\). Re-organizing terms in Eq. (A.1) and invoking the induction hypothesis we have

$$\begin{aligned} A_t&= (\nabla \psi )^{-1}\left( \nabla \psi ( A_{t-1})+d_t I-\alpha F_{t-1}\right) \\&= (\nabla \psi )^{-1}\left( -c_{t-1} I-\alpha Z_0-\alpha \sum _{\ell =0}^{t-1}{ F_\ell }+d_t I\right) . \end{aligned}$$

Because \(d_t\) is the unique number that ensures \( A_t\succeq 0\) and \(\mathrm {tr}( A_t)=1\), and \( Z_0+\sum _{\ell =0}^{t-1}{ F_\ell }\succeq 0\), it must hold that \(-c_{t+1}+d_t=c_t\). Subsequently, \(\nabla \psi ( A_t)=-( A_t^{-1/2})=-c_t I-\alpha Z-\alpha \sum _{\ell =0}^{t-1}{ F_\ell }\). The claim is thus proved by raising both sides of the identity to the power of \(-2\). \(\square \)

1.2 A.2 Proof of Claim 2.13

Claim 2.13

Suppose \(P_t^\top A_t^{1/2}P_t=[b\;\; d; d \;\;c]\in {\mathbb {R}}^{2\times 2}\) and \(2\alpha \langle A_t^{1/2}, v_tv_t^\top \rangle < 1\). Then

$$\begin{aligned} (J+P_t^\top A_t^{1/2}P_t)^{-1} = \left( J+\left[ \begin{array}{cc} b&{} d\\ d&{} c\end{array}\right] \right) ^{-1} \succeq \left( J+\left[ \begin{array}{cc} 2b&{} 0\\ 0&{} 2c\end{array}\right] \right) ^{-1}. \end{aligned}$$

Proof of Claim 2.13

Define \(R=\left[ \begin{array}{cc} b&{} -d\\ -d&{} c\end{array}\right] \). Because \(P_t^\top A_t^{1/2}P_t = \left[ \begin{array}{cc} b&{} d\\ d&{} c\end{array}\right] \) is positive semi-definite, we conclude that R is also positive semi-definite and hence can be written as \(R=QQ^\top \). To prove Claim 2.13, we only need to establish the positive semi-definiteness of the following difference matrix:

$$\begin{aligned}&\left( J+\left[ \begin{array}{cc} b&{} d\\ d&{} c\end{array}\right] \right) ^{-1} - \left( J+\left[ \begin{array}{cc} 2b&{} 0\\ 0&{} 2c\end{array}\right] \right) ^{-1}\\&\quad = \left( J+\left[ \begin{array}{cc} 2b&{} 0\\ 0&{} 2c\end{array}\right] - \left[ \begin{array}{cc} b&{} -d\\ -d&{} c\end{array}\right] \right) ^{-1} - \left( J+\left[ \begin{array}{cc} 2b&{} 0\\ 0&{} 2c\end{array}\right] \right) ^{-1}\\&\quad = \left( J+\left[ \begin{array}{cc} 2b&{} 0\\ 0&{} 2c\end{array}\right] \right) ^{-1}Q\left( I-Q^\top \left( J+\left[ \begin{array}{cc} 2b&{} 0\\ 0&{} 2c\end{array}\right] \right) ^{-1}Q\right) ^{-1}Q^\top \left( J+\left[ \begin{array}{cc} 2b&{} 0\\ 0&{} 2c\end{array}\right] \right) ^{-1}. \end{aligned}$$

Here in the last equality we again use the Woodbury matrix identity. It is clear that to prove the positive semi-definiteness right-hand side of the above equality, It suffices to show \(Q^\top (J+\mathrm {diag}(2b,2c))^{-1}Q\prec I\). By standard matrix analysis and the fact that \(J=\mathrm {diag}(1,-1)\),

$$\begin{aligned} Q^\top \left( J+\left[ \begin{array}{cc} 2b&{} 0\\ 0&{} 2c\end{array}\right] \right) ^{-1}Q&= Q^\top \left[ \begin{array}{cc} (1+2b)^{-1}&{} 0\\ 0&{} -(1-2c)^{-1}\end{array}\right] Q\\&\overset{(a)}{\preceq } Q^\top \left[ \begin{array}{cc} (1+2b)^{-1}&{} 0\\ 0&{} 0\end{array}\right] Q \preceq \frac{\Vert QQ^\top \Vert _{\mathrm {op}}}{1+2b}\cdot I\\&\overset{(b)}{\preceq } \frac{\max \{2b, 2c\}}{1+2b}\cdot I \overset{(c)}{\prec } I. \end{aligned}$$

Some steps in the above derivation require additional explanation. In (a), we use the fact that \(2c=2\alpha \langle A_t^{1/2},v_tv_t^\top \rangle <1\), and hence \((1-2c)^{-1}>0\); in (b), we use the fact that \(QQ^\top = \left[ \begin{array}{cc} b&{} -d\\ -d&{} c\end{array}\right] \preceq \left[ \begin{array}{cc} 2b&{} 0\\ 0&{} 2c\end{array}\right] \); finally, (c) holds because \(b=2\alpha \langle A_t^{1/2},u_tu_t^\top \rangle \ge 0\), \(\frac{2b}{1+2b}<1\) and \(2c<1\). The proof of Claim 2.13 is thus completed. \(\square \)

1.3 A.3 Proof of Claim 2.14

Claim 2.14

Suppose \(Z\succeq 0\) is a p-dimensional PSD matrix with \(\lambda _{\min }(Z)\le 1\). Let \(A=(\alpha Z+cI)^{-2}\), where \(c\in {\mathbb {R}}\) is the unique real number such that \(A\succeq 0\) and \(\mathrm {tr}(A)=1\). Then

  1. 1.

    \(\alpha \langle A^{1/2}, Z\rangle \le p+\alpha \sqrt{p}\);

  2. 2.

    \(\langle A, Z\rangle \le \sqrt{p}/\alpha + \lambda _{\min }(Z)\).

Proof of Claim 2.14

For any orthogonal matrix U, the transform \(Z\mapsto UZU^\top \) leads to \(X\mapsto U XU^\top \) and \(X^{1/2}\mapsto UX^{1/2}U^\top \); thus both inner products are invariant to orthogonal transform of Z. Therefore, we may assume without loss of generality that \(Z=\mathrm {diag}(\sigma _1,\dots ,\sigma _p)\) for \(\lambda _1\ge \cdots \lambda _p\ge 0\), because \(Z\succeq 0\). Subsequently,

$$\begin{aligned} \alpha \langle A^{1/2},Z\rangle = \sum _{i=1}^p{\frac{\alpha \lambda _i}{\alpha \lambda _i+c}} = p - c\cdot \sum _{i=1}^p{\frac{1}{\alpha \lambda _i + c}}. \end{aligned}$$

If \(c\ge 0\), then \(\alpha \langle A^{1/2},Z\rangle \le p\) and the first property is clearly true. For the case of \(c<0\), note that c must be strictly larger than \(-\alpha \lambda _p\), as we established in Claim 2.4. Subsequently, by the Cauchy-Schwarz inequality,

$$\begin{aligned} \alpha \langle A^{1/2},Z\rangle = p - c\cdot \sum _{i=1}^p{\frac{1}{\alpha \lambda _i + c}} \le p-c\cdot \sqrt{p}\cdot \sqrt{\sum _{i=1}^p{\frac{1}{(\alpha \lambda _i+c)^2}}}. \end{aligned}$$

Because \(\lambda _p=\lambda _{\min }(Z)\le 1\) and \(\mathrm {tr}(A)=\mathrm {tr}[(\alpha Z+cI)^{-2}]=1\), we have that \(c\ge -\alpha \) and \(\sqrt{\sum _{i=1}^p{(\alpha \lambda _i+c)^{-2}}} = 1\). Therefore, \(\alpha \langle A^{1/2},Z\rangle \le p+\alpha \sqrt{p}\), which establishes the first property in Claim 2.14.

We next turn to the second property. Using similar analysis, we have

$$\begin{aligned} \alpha \langle Z,A\rangle&= \sum _{i=1}^p{\frac{\alpha \lambda _i}{(\alpha \lambda _i+c)^2}} = \sum _{i=1}^p{\frac{1}{\alpha \lambda _i+c}} - c\cdot \sum _{i=1}^p{\frac{1}{(\alpha \lambda _i+c)^2}}\\&\le \sqrt{p}\cdot \sqrt{\sum _{i=1}^p{\frac{1}{(\alpha \lambda _i+c)^2}}} - c\cdot \sum _{i=1}^p{\frac{1}{(\alpha \lambda _i+c)^2}} \le \sqrt{p}-c. \end{aligned}$$

Property 2 is then proved by noting that \(c>-\lambda _{\min }(Z)\). \(\square \)

1.4 A.4 Proof of correctness of Algorithm 4

Recall the KL divergence function \(\mathrm {KL}(y\Vert \omega ):= \sum _i y_i \log \frac{y_i}{\omega _i}\).

Claim A.1

The output of Algorithm 4 is exactly the projection \(\omega '=\arg \min _{y\in \varDelta _n,y_i\le b} \mathrm {KL}(y\Vert \omega )\) in KL divergence, provided that the input \(\omega \) itself is in the probabilistic simplex \(\varDelta _n\).

Proof

We first show that if \(\omega _i>\omega _j\) then \(\omega _i'\ge \omega _j'\). Assume the contrary that \(\omega _i>\omega _j\) and \(\omega _i'<\omega _j'\). Define \(\omega ''=\omega '\) except that the \(i\hbox {th}\) and \(j\hbox {th}\) components are swapped; that is, \(\omega _i''=\omega _j'\) and \(\omega _j''=\omega _i'\). It is clear that \(\omega ''\in \varDelta _n\) and satisfies the box constraint \(\Vert \omega ''\Vert _{\infty }\le b\). In addition,

$$\begin{aligned} \mathrm {KL}(\omega ''\Vert \omega )-\mathrm {KL}(\omega '\Vert \omega ) = \omega _i'\log \frac{\omega _i}{\omega _j} + \omega _j'\log \frac{\omega _j}{\omega _i} = (\omega _i'-\omega _j')\log \frac{\omega _i}{\omega _j} < 0, \end{aligned}$$

which violates the optimality of \(\omega '\).

We next consider the Lagrangian multiplier of the constrained optimization problem

$$\begin{aligned} {\mathcal {L}}(y;\mu ,\lambda ) = \sum _{i=1}^n{y_i\log \frac{y_i}{\omega _i}} + \mu \left( \sum _{i=1}^n{y_i}-1\right) + \sum _{i=1}^n{\lambda _i(y_i-b)}. \end{aligned}$$

By KKT condition, we have that \(\partial {\mathcal {L}}(\omega ')/\partial y_i = \log (y_i/\omega _i) + (1+\mu +\lambda _i) = 0\). By complementary slackness, if \(\omega _i'<b\) then \(\lambda _i=0\) and hence there exists a unique \(C>0\) such that \(\omega _i'=C\cdot \omega _i\) holds for all \(\omega _i'<b\). Combining this fact with the monotonic correspondence between \(\omega '\) and \(\omega \), one only needs to search for the exact number of components in \(\omega '\) that are equal to b, and compute the unique constant C and the remaining coordinates. Since there are at most n such possibilities, by a linear scan over the choices we can choose the solution that gives rise to the minimum KL divergence. This is exactly what is computed in Algorithm 4: we have a O(n) time linear scan over parameter q (meaning that there are exactly \(q-1\) components that are equal to b, proceeded by a \(O(n \log n)\)-time pre-processing to sort the coordinates of \(\omega \).\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Allen-Zhu, Z., Li, Y., Singh, A. et al. Near-optimal discrete optimization for experimental design: a regret minimization approach. Math. Program. 186, 439–478 (2021). https://doi.org/10.1007/s10107-019-01464-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-019-01464-2

Keywords

Mathematics Subject Classification

Navigation