The experimental design problem concerns the selection of k points from a potentially large design pool of p-dimensional vectors, so as to maximize the statistical efficiency regressed on the selected k design points. Statistical efficiency is measured by optimality criteria, including A(verage), D(eterminant), T(race), E(igen), V(ariance) and G-optimality. Except for the T-optimality, exact optimization is challenging, and for certain instances of D/E-optimality exact or even approximate optimization is proven to be NP-hard. We propose a polynomial-time regret minimization framework to achieve a \((1+\varepsilon )\) approximation with only \(O(p/\varepsilon ^2)\) design points, for all the optimality criteria above. In contrast, to the best of our knowledge, before our work, no polynomial-time algorithm achieves \((1+\varepsilon )\) approximations for D/E/G-optimality, and the best poly-time algorithm achieving \((1+\varepsilon )\)-approximation for A/V-optimality requires \(k=\varOmega (p^2/\varepsilon )\) design points.
In the worst case, even the exact minimum \(\min _{|S|\le k} f( X_S^\top X_S)\) can be indeed O(n/k) times larger than \(f( X^\top X)\) [8]. This worst-case scenario may not always happen, but to the best our knowledge, their proof is tight in this worst case.
Indeed, the maximum possible value for \(c_t\) is \(c_u = \sqrt{p}\) because \(\mathrm {tr}[(\sqrt{p} I+\alpha Z)^{-2}] \ge \mathrm {tr}[(\sqrt{p} I)^{-2}] = 1\), and the minimum possible value of \(c_t\) must be above \(c_\ell = -\alpha \lambda _{\min }(Z)\), because \(\mathrm {tr}[(c_\ell I+\alpha Z)^{-2}] \ge \mathrm {tr}[(\sqrt{p} I)^{-2}] = 1\). One can show (see the proof of Claim 2.4) that the value \(\mathrm {tr}[(c I+\alpha Z)^{-2}]\) is a monotone function in \(c \in (c_\ell , c_u]\).
A careful stability analysis for the \(\ell _{1/2}\) strategy has already appeared in the appendix of [4].
\((A+UCV)^{-1} = A^{-1} - A^{-1} U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}\), provided that all inverses exist.
\(|x_1|+\cdots +|x_d|\le \sqrt{d}\cdot \sqrt{x_1^2+\cdots +x_d^2}\) for any sequences of d real numbers \(x_1,\dots ,x_n\).
The ellipsoid method runs in polynomial time as long as (1) the domain is polynomially bounded (which is the case since the probability simplex is bounded and (2) the separation oracle can be implemented in polynomial time (which is the case for all the optimality criteria that we study in this paper.)
A similar landscape also appears in stochastic gradient methods. For instance, given the ridge regression problem, although interior point or ellipsoid method gives polynomial time algorithm, in practice, one still prefer stochastic gradient methods such as SDCA [41] which depend on the properties of the covariance matrix.
Of course, if the current solution \({\hat{s}}\) leads to \(\sum _{i}{\hat{s}}_ix_ix_i^\top \) that is not full rank, then we always continue to the next iteration.
In backtracking line search, for every iteration t a preliminary step size of \(\eta _t=1\) is used and the step size is repeatedly halved until the Armijo-Goldstein condition \(f(\omega ^{(t+1)})\le f(\omega ^{(t)}) + 0.5\langle g^{(t)}, \omega ^{(t+1)}-\omega ^{(t)}\rangle \) is satisfied, where \(\omega ^{(t+1)}\) is the (projected) next step under step size \(\eta _t\).
Appendix A: Missing proofs
Appendix A: Missing proofs
1.1 A.1 Proof of Claim 2.4
Claim 2.4
(closed form \(\ell _{1/2}\) strategy) Assume without loss of generality that \( A_0=(c_0 I+\alpha Z_0)^{-2}\) for some \(c_0\in {\mathbb {R}}\), \(\alpha >0\) and positive semi-definite matrix \( Z_0\) such that \(c_0 I+\alpha Z_0\succ 0\). Then,
where \(c_t\in {\mathbb {R}}\) is the unique constant so that \(c_t I+\alpha Z_0 + \alpha \sum _{\ell =0}^{t-1}{ F_\ell } \succ 0\) and \(\mathrm {tr}( A_t)=1\).
Proof of Claim 2.4
We first show that for any symmetric matrix \(Z\in {\mathbb {R}}^{p\times p}\), there exists unique \(c\in {\mathbb {R}}\) such that \(\alpha Z+c I\succ 0\) and \(\mathrm {tr}[(\alpha Z+cI)^{-2}]=1\). By simple asymptotic analysis, \(\lim _{c\rightarrow (-\alpha \lambda _{\min }(Z))^+}\mathrm {tr}[(\alpha Z+cI)^{-2}]=+\infty \) and \(\lim _{c\rightarrow +\infty }\mathrm {tr}[(\alpha Z+cI)^{-2}]=0\). Because \(\mathrm {tr}[(\alpha Z+cI)^{-2}]\) is a continuous and strictly decreasing function in c on the open interval \((-\alpha \lambda _{\min }(Z), +\infty )\), we conclude that there must exist a unique c such that \(\mathrm {tr}[(\alpha Z+cI)^{-2}]=1\). The range of c also ensures that \(\alpha Z+cI\succ 0\).
We now use induction to prove this proposition. For \(t=0\) the proposition is obviously correct. We shall now assume that the proposition holds true for \( A_{t-1}\) (i.e., \( A_{t-1}=(c_{t-1} I+\alpha Z_0+\sum _{\ell =0}^{t-2}{ F_\ell })^{-2}\)) for some \(t\ge 1\), and try to prove the same for \( A_t\).
The KKT condition of the optimization problem and the gradients of the Bregman divergence \(\varDelta _\psi \) yields
where the \(d_t I\) term arises from the Lagrangian multiplier and \(d_t\in {\mathbb {R}}\) is the unique number that makes \(-\nabla \psi ( A_{t-1})+\alpha F_t-d_t I\preceq 0\) (because \(\nabla \psi ( A_t)\succeq 0\)) and \(\mathrm {tr}( A_t)=\mathrm {tr}((\nabla \psi )^{-1}(\nabla \psi ( A_{t-1})+d_t I-\alpha F_{t-1}))=1\). Re-organizing terms in Eq. (A.1) and invoking the induction hypothesis we have
Because \(d_t\) is the unique number that ensures \( A_t\succeq 0\) and \(\mathrm {tr}( A_t)=1\), and \( Z_0+\sum _{\ell =0}^{t-1}{ F_\ell }\succeq 0\), it must hold that \(-c_{t+1}+d_t=c_t\). Subsequently, \(\nabla \psi ( A_t)=-( A_t^{-1/2})=-c_t I-\alpha Z-\alpha \sum _{\ell =0}^{t-1}{ F_\ell }\). The claim is thus proved by raising both sides of the identity to the power of \(-2\). \(\square \)
1.2 A.2 Proof of Claim 2.13
Claim 2.13
Suppose \(P_t^\top A_t^{1/2}P_t=[b\;\; d; d \;\;c]\in {\mathbb {R}}^{2\times 2}\) and \(2\alpha \langle A_t^{1/2}, v_tv_t^\top \rangle < 1\). Then
Proof of Claim 2.13
Define \(R=\left[ \begin{array}{cc} b&{} -d\\ -d&{} c\end{array}\right] \). Because \(P_t^\top A_t^{1/2}P_t = \left[ \begin{array}{cc} b&{} d\\ d&{} c\end{array}\right] \) is positive semi-definite, we conclude that R is also positive semi-definite and hence can be written as \(R=QQ^\top \). To prove Claim 2.13, we only need to establish the positive semi-definiteness of the following difference matrix:
Here in the last equality we again use the Woodbury matrix identity. It is clear that to prove the positive semi-definiteness right-hand side of the above equality, It suffices to show \(Q^\top (J+\mathrm {diag}(2b,2c))^{-1}Q\prec I\). By standard matrix analysis and the fact that \(J=\mathrm {diag}(1,-1)\),
Some steps in the above derivation require additional explanation. In (a), we use the fact that \(2c=2\alpha \langle A_t^{1/2},v_tv_t^\top \rangle <1\), and hence \((1-2c)^{-1}>0\); in (b), we use the fact that \(QQ^\top = \left[ \begin{array}{cc} b&{} -d\\ -d&{} c\end{array}\right] \preceq \left[ \begin{array}{cc} 2b&{} 0\\ 0&{} 2c\end{array}\right] \); finally, (c) holds because \(b=2\alpha \langle A_t^{1/2},u_tu_t^\top \rangle \ge 0\), \(\frac{2b}{1+2b}<1\) and \(2c<1\). The proof of Claim 2.13 is thus completed. \(\square \)
1.3 A.3 Proof of Claim 2.14
Claim 2.14
Suppose \(Z\succeq 0\) is a p-dimensional PSD matrix with \(\lambda _{\min }(Z)\le 1\). Let \(A=(\alpha Z+cI)^{-2}\), where \(c\in {\mathbb {R}}\) is the unique real number such that \(A\succeq 0\) and \(\mathrm {tr}(A)=1\). Then
\(\alpha \langle A^{1/2}, Z\rangle \le p+\alpha \sqrt{p}\);
\(\langle A, Z\rangle \le \sqrt{p}/\alpha + \lambda _{\min }(Z)\).
Proof of Claim 2.14
For any orthogonal matrix U, the transform \(Z\mapsto UZU^\top \) leads to \(X\mapsto U XU^\top \) and \(X^{1/2}\mapsto UX^{1/2}U^\top \); thus both inner products are invariant to orthogonal transform of Z. Therefore, we may assume without loss of generality that \(Z=\mathrm {diag}(\sigma _1,\dots ,\sigma _p)\) for \(\lambda _1\ge \cdots \lambda _p\ge 0\), because \(Z\succeq 0\). Subsequently,
If \(c\ge 0\), then \(\alpha \langle A^{1/2},Z\rangle \le p\) and the first property is clearly true. For the case of \(c<0\), note that c must be strictly larger than \(-\alpha \lambda _p\), as we established in Claim 2.4. Subsequently, by the Cauchy-Schwarz inequality,
Because \(\lambda _p=\lambda _{\min }(Z)\le 1\) and \(\mathrm {tr}(A)=\mathrm {tr}[(\alpha Z+cI)^{-2}]=1\), we have that \(c\ge -\alpha \) and \(\sqrt{\sum _{i=1}^p{(\alpha \lambda _i+c)^{-2}}} = 1\). Therefore, \(\alpha \langle A^{1/2},Z\rangle \le p+\alpha \sqrt{p}\), which establishes the first property in Claim 2.14.
We next turn to the second property. Using similar analysis, we have
Property 2 is then proved by noting that \(c>-\lambda _{\min }(Z)\). \(\square \)
1.4 A.4 Proof of correctness of Algorithm 4
Recall the KL divergence function \(\mathrm {KL}(y\Vert \omega ):= \sum _i y_i \log \frac{y_i}{\omega _i}\).
Claim A.1
The output of Algorithm 4 is exactly the projection \(\omega '=\arg \min _{y\in \varDelta _n,y_i\le b} \mathrm {KL}(y\Vert \omega )\) in KL divergence, provided that the input \(\omega \) itself is in the probabilistic simplex \(\varDelta _n\).
We first show that if \(\omega _i>\omega _j\) then \(\omega _i'\ge \omega _j'\). Assume the contrary that \(\omega _i>\omega _j\) and \(\omega _i'<\omega _j'\). Define \(\omega ''=\omega '\) except that the \(i\hbox {th}\) and \(j\hbox {th}\) components are swapped; that is, \(\omega _i''=\omega _j'\) and \(\omega _j''=\omega _i'\). It is clear that \(\omega ''\in \varDelta _n\) and satisfies the box constraint \(\Vert \omega ''\Vert _{\infty }\le b\). In addition,
which violates the optimality of \(\omega '\).
We next consider the Lagrangian multiplier of the constrained optimization problem
By KKT condition, we have that \(\partial {\mathcal {L}}(\omega ')/\partial y_i = \log (y_i/\omega _i) + (1+\mu +\lambda _i) = 0\). By complementary slackness, if \(\omega _i'<b\) then \(\lambda _i=0\) and hence there exists a unique \(C>0\) such that \(\omega _i'=C\cdot \omega _i\) holds for all \(\omega _i'<b\). Combining this fact with the monotonic correspondence between \(\omega '\) and \(\omega \), one only needs to search for the exact number of components in \(\omega '\) that are equal to b, and compute the unique constant C and the remaining coordinates. Since there are at most n such possibilities, by a linear scan over the choices we can choose the solution that gives rise to the minimum KL divergence. This is exactly what is computed in Algorithm 4: we have a O(n) time linear scan over parameter q (meaning that there are exactly \(q-1\) components that are equal to b, proceeded by a \(O(n \log n)\)-time pre-processing to sort the coordinates of \(\omega \).\(\square \)
