1 Introduction

Over the past decade, a considerable amount of attention has been devoted to finding high-quality solutions to low-rank optimization problems, resulting in theoretically and practically efficient algorithms for problems as disparate as matrix completion, reduced rank regression, or computer vision. In spite of this progress, almost no equivalent progress has been made on developing strong lower bounds for low-rank problems. Accordingly, this paper proposes a procedure for obtaining novel and strong lower bounds.

We consider the following low-rank optimization problem:

$$\begin{aligned}{} & {} \min _{\varvec{X} \in {\mathcal {S}}^n_+} \ \langle \varvec{C}, \varvec{X} \rangle +\Omega (\varvec{X})+\mu \cdot \textrm{Rank}(\varvec{X})\nonumber \\ {}{} & {} \,\,\text {s.t.}\quad \ \langle \varvec{A}_i, \varvec{X}\rangle =b_i \ \forall i \in [m], \varvec{X} \in {\mathcal {K}},\ \textrm{Rank}(\varvec{X}) \le k, \end{aligned}$$
(1)

where \(\varvec{C}, \varvec{A}_{1}, \ldots \varvec{A}_m \in {\mathcal {S}}^n\) are \(n \times n\) symmetric matrices, \(b_1, \ldots b_m \in \mathbb {R}\) are scalars, [n] denotes the set of running indices \(\{1, ..., n\}\), \({\mathcal {S}}^n_+\) denotes the \(n \times n\) positive semidefinite cone, and \(\mu \in \mathbb {R}_+, k \in \mathbb {N}\) are parameters which controls the complexity of \(\varvec{X}\) by respectively penalizing and constraining its rank. The set \({\mathcal {K}}\) is a proper—i.e., closed, convex, solid and pointed cone (c.f. [15], Section 2.4.1), and \(\Omega (\varvec{X})=\textrm{tr}(f(\varvec{X}))\) for some matrix convex function f; see formal definitions and assumptions in Sect. 3.

For optimization problems with logical constraints, strong relaxations can be obtained by formulating them as mixed-integer optimization (MIO) problems and applying the so-called perspective reformulation technique (see [37, 42]). In this paper, we develop a matrix analog of the perspective reformulation technique to obtain strong yet computationally tractable relaxations of low-rank optimization problems of the form (1).

1.1 Motivating example

In this section, we illustrate our results on a statistical learning example. To emphasize the analogy with the perspective reformulation technique in MIO, we first consider the best subset selection problem and review its perspective relaxations. We then consider a reduced-rank regression problem—the rank-analog of best subset selection—and provide new relaxations that naturally arise from our Matrix Perspective Reformulation Technique (MPRT).

Best Subset Selection: Given a data matrix \(\varvec{X} \in \mathbb {R}^{n \times p}\) and a response vector \(\varvec{y} \in \mathbb {R}^{n}\), the \(\ell _0-\ell _2\) regularized best subset selection problem is to solve (c.f. [3, 6, 7, 9, 65, 78]):

$$\begin{aligned} \min _{\varvec{w} \in \mathbb {R}^p} \quad&\frac{1}{2n}\Vert \varvec{y}-\varvec{X}\varvec{w}\Vert _2^2 +\frac{1}{2\gamma } \Vert \varvec{w}\Vert _2^2 +\mu \Vert \varvec{w}\Vert _0, \end{aligned}$$
(2)

where \(\mu , \gamma >0\) are parameters which control \(\varvec{w}\)’s sparsity and sensitivity to noise respectively.

Early attempts at solving Problem (2) exactly relied upon weak implicit or big-M formulations of logical constraints which supply low-quality relaxations and therefore do not scale well (see [14, 44], for discussions). However, very similar algorithms now solve these problems to certifiable optimality with millions of features. Perhaps the key ingredient in modernizing these (previously inefficient) algorithms was invoking the perspective reformulation technique—a technique for obtaining high-quality convex relaxations of non-convex sets—first stated in Stubbs’ PhD thesis [73] (see also [21, 74]) and popularized by [1, 37, 42] among others.

Relaxation via the Perspective Reformulation Technique: By applying the perspective reformulation technique [1, 37, 42] to the term \(\mu \Vert \varvec{w}\Vert _0+\frac{1}{2\gamma } \Vert \varvec{w}\Vert _2^2\), we obtain the following reformulation:

$$\begin{aligned} \min _{\varvec{w}, \varvec{\rho }\in \mathbb {R}^p, \varvec{z} \in \{0, 1\}^p} \quad&\frac{1}{2n}\Vert \varvec{y}-\varvec{X}\varvec{w}\Vert _2^2 +\frac{1}{2\gamma } \varvec{e}^\top \varvec{\rho } +\mu \cdot \varvec{e}^\top \varvec{z} \quad \text {s.t.} \quad z_i \rho _i \ge w_i^2 \quad \forall i \in [p], \end{aligned}$$
(3)

where \(\varvec{e}\) denotes a vector of all ones of appropriate dimension.

Interestingly, this formulation can be represented using second-order cones [42, 65] and optimized over efficiently using projected subgradient descent [9]. Moreover, it reliably supplies near-exact relaxations for most practically relevant cases of best subset selection [6, 65]. In instances where it is not already tight, one can apply a refinement of the perspective reformulation technique to the term \(\Vert \varvec{y}-\varvec{X}\varvec{w}\Vert _2^2\) and thereby obtain the following (tighter yet more expensive) relaxation [26]:

$$\begin{aligned}{} & {} \min _{\varvec{w}\in \mathbb {R}^p, \varvec{z} \in [0, 1]^p, \varvec{W} \in S^p_+} \frac{1}{2n}\Vert \varvec{y}\Vert _2^2-\frac{1}{n}\langle \varvec{y}, \varvec{X}\varvec{w}\rangle +\frac{1}{2}\left\langle \varvec{W}, \frac{1}{\gamma } \mathbb {I}+\frac{1}{n}\varvec{X}^\top \varvec{X}\right\rangle +\mu \cdot \varvec{e}^\top \varvec{z} \nonumber \\{} & {} \qquad \qquad \text {s.t.} \qquad \qquad \ \varvec{W} \succeq \varvec{w}\varvec{w}^\top , \ z_i W_{i,i} \ge w_i^2 \ \forall i \in [p]. \end{aligned}$$
(4)

Recently, a class of even tighter relaxations were developed by Atamtürk and Gómez [3, 39, 43]. As they were developed by considering multiple binary variables simultaneously and therefore do not, to our knowledge, generalize readily to the low-rank case (where we often have one low-rank matrix), we do not discuss (nor generalize) them here.

Reduced Rank Regression: Given m observations of a response vector \(\varvec{Y}_j \in \mathbb {R}^n\) and a predictor \(\varvec{X}_j \in \mathbb {R}^p\), an important problem in high-dimensional statistics is to recover a low-complexity model which relates \(\varvec{X}\ \hbox {and}\ \varvec{Y}\). A popular choice for doing so is to assume that \(\varvec{X}, \varvec{Y}\) are related via \(\varvec{Y}=\varvec{X}\varvec{\beta }+\varvec{E}\), where \(\varvec{\beta } \in \mathbb {R}^{p \times n}\) is a coefficient matrix which we assume to be low-rank, \(\varvec{E}\) is a matrix of noise, and we require that the rank of \(\varvec{\beta }\) is small in order that the linear model is parsimonious [57]. Introducing Frobenius regularization gives rise to the problem:

$$\begin{aligned} \min _{\varvec{\beta } \in \mathbb {R}^{p \times n}} \quad \frac{1}{2m}\Vert \varvec{Y}-\varvec{X}\varvec{\beta }\Vert _F^2+\frac{1}{2\gamma }\Vert \varvec{\beta }\Vert _F^2+\mu \cdot \textrm{Rank}( \varvec{\beta }), \end{aligned}$$
(5)

where \(\gamma , \mu >0\) control the robustness to noise and the complexity of the estimator respectively, and we normalize the ordinary least squares loss by dividing by m, the number of observations.

Existing attempts at solving this problem generally involve replacing the low-rank term with a nuclear norm term [57], which succeeds under some strong assumptions on the problem data but not in general. Recently, we proposed a new framework to model rank constraints, using orthogonal projection matrices which satisfy \(\varvec{Y}^2=\varvec{Y}\) instead of binary variables which satisfy \(z^2=z\) [11]. By building on this work, in this paper we propose a generalization of the perspective function to matrix-valued functions with positive semidefinite arguments and develop a matrix analog of the perspective reformulation technique from MIO which uses projection matrices instead of binary variables.

Relaxations via the Matrix Perspective Reformulation Technique: By applying the matrix perspective reformulation technique (Theorem 1) to the term \({\frac{1}{2\gamma }}\Vert \varvec{\beta }\Vert _F^2+\mu \cdot \textrm{Rank}(\varvec{\beta })\), we will prove that the following problem is a valid—and numerically high-quality—relaxation of (5):

$$\begin{aligned}&\min _{\varvec{\beta } \in \mathbb {R}^{p \times n}, \varvec{W} \in {\mathcal {S}}^n_+, \varvec{\theta } \in S^p_+} \quad \frac{1}{2m}\Vert \varvec{Y}-\varvec{X}\varvec{\beta }\Vert _F^2+\frac{1}{2\gamma } \textrm{tr}(\varvec{\theta })+\mu \cdot \textrm{tr}(\varvec{W})\nonumber \\&\qquad \qquad \ \text {s.t.} \varvec{W} \preceq \mathbb {I}, \left( \begin{array}{ll} \varvec{\theta } &{} \varvec{\beta } \\ \varvec{\beta }^\top &{} \varvec{W}\end{array}\right) \succeq \varvec{0}. \end{aligned}$$
(6)

The analogy between problems (2)–(5) and their relaxations (3)–(6) is striking. The goal of the present paper is to develop the corresponding theory to support and derive the relaxation (6). Interestingly, the main argument that led [26] to the improved relaxation (4) for (2) can be extended to reduced-rank regression. Combined with our MPRT, it leads to the relaxation:

$$\begin{aligned} \min _{\varvec{\theta } \in {\mathcal {S}}^n_+, \varvec{\beta } \in \mathbb {R}^{p \times n}, \varvec{B} \in {\mathcal {S}}^p_+, \varvec{W} \in {\mathcal {S}}^n_+} \quad&\frac{1}{2m}\Vert \varvec{Y}\Vert _F^2-\frac{1}{m}\langle \varvec{Y}, \varvec{X}\varvec{\beta }\rangle +\frac{1}{2}\left\langle \varvec{B}, \frac{1}{\gamma }\mathbb {I}+\!\!\frac{1}{m}\varvec{X}^\top \varvec{X}\right\rangle +\!\!\mu \cdot \textrm{tr}(\varvec{W})\nonumber \\ \text {s.t.} \quad&\left( \begin{array}{ll} \varvec{B} &{} \varvec{\beta }^\top \\ \varvec{\beta } &{} \varvec{W}\end{array}\right) \succeq \varvec{0}, \varvec{W} \preceq \mathbb {I}. \end{aligned}$$
(7)

It is not too hard to see that this is a valid semidefinite relaxation: if \(\varvec{W}\) is a rank-k projection matrix then, by the Schur complement lemma [16, Equation 2.41], \(\varvec{\beta }=\varvec{\beta }\varvec{W}\), and thus the rank of \(\varvec{\beta }\) is at most k. Moreover, if we let \(\varvec{B}=\varvec{\beta }\varvec{\beta }^\top \) in a solution, we recover a low-rank solution to the original problemFootnote 1. Actually, as we show in Sect. 3.3, a similar technique can be applied to any instance of Problem (1), for which the applications beyond matrix regression are legion.

1.2 Literature review

Three classes of approaches have been proposed for solving Problem (1): (a) heuristics, which prioritize computational efficiency and obtain typically high-quality solutions to low-rank problems efficiently but without optimality guarantees (see [59], for a review); (b) relax-and-round approaches, which balance computational efficiency and accuracy concerns by relaxing the rank constraint and rounding a solution to the relaxation to obtain a provably near-optimal low-rank matrix [11, Section 1.2.1]; and (c) exact approaches, which prioritize accuracy over computational efficiency and solve Problem (1) exactly in exponential time [11], Section 1.2.1].

Of the three classes of approaches, heuristics currently dominate the literature, because their superior runtime and memory usage allows them to address larger-scale problems. However, recent advances in algorithmic theory and computational power have drastically improved the scalability of exact and approximate methods, to the point where they can now solve moderately sized problems which are relevant in practice [11]. Moreover, relaxations of strong exact formulations often give rise to very efficient heuristics (via tight relaxations of the exact formulation) which outperform existing heuristics. This suggests that heuristic approaches may not maintain their dominance going forward, and motivates the exploration of tight yet affordable relaxations of low-rank problems.

1.3 Contributions and structure

The main contributions of this paper are twofold. First, we propose a general reformulation technique for obtaining high-quality relaxations of low-rank optimization problems: introducing an orthogonal projection matrix to model a low-rank constraint, and strengthening the formulation by taking the matrix perspective of an appropriate substructure of the problem. This technique can be viewed as a generalization of the perspective reformulation technique for obtaining strong relaxations of sparse or logically constrained problems [10, 37, 42, 43]. Second, by applying this technique, we obtain explicit characterizations of convex hulls of low-rank sets which frequently arise in low-rank problems. As the interplay between convex hulls of indicator sets and perspective functions has engineered algorithms which outperform state-of-the-art heuristics in sparse linear regression [6, 44] and sparse portfolio selection [10, 80], we hope that this work will empower similar developments for low-rank problems.

The rest of the paper is structured as follows: In Sect. 2 we supply some background on perspective functions and review their role in developing tight formulations of mixed-integer problems. In Sect. 3, we introduce the matrix perspective function and its properties, extend the function’s definition to allow semidefinite in addition to positive definite arguments, and propose a matrix perspective reformulation technique (MPRT) which successfully obtains high-quality relaxations for low-rank problems which commonly arise in the literature. We also connect the matrix perspective function to the convex hulls of epigraphs of simple matrix convex functions under rank constraints. In Sect. 4, we illustrate the utility of this connection by deriving tighter relaxations of several low-rank problems than are currently available in the literature. Finally, in Sect. 5, we numerically verify the utility of our approach on reduced rank regression, D-optimal design and non-negative matrix factorization problems.

Notation: We let nonbold face characters such as b denote scalars, lowercase bold faced characters such as \(\varvec{x}\) denote vectors, uppercase bold faced characters such as \(\varvec{X}\) denote matrices, and calligraphic uppercase characters such as \({\mathcal {Z}}\) denote sets. We let \(\mathbb {N}\) denote the set of positive integers. We let \(\textbf{e}\) denote a vector of all 1’s, \(\varvec{0}\) denote a vector of all 0’s, and \(\mathbb {I}\) denote the identity matrix. We let \({\mathcal {S}}^n_{+} \cap \mathbb {R}^{n \times n}_{+}\) denote the cone of \(n \times n\) doubly non-negative matrices, and \({\mathcal {C}}^n_+:=\{\varvec{U}\varvec{U}^\top : \varvec{U} \in \mathbb {R}^{n \times n}_{+}\}\) denote the cone of \(n \times n\) completely positive matrices. Finally, we let \(\varvec{X}^\dag \) denote the Moore-Penrose pseudoinverse of a matrix \(\varvec{X}\); see [13, 47] for general theories of matrix operators. Less common matrix operators will be defined as they are needed.

2 Background on Perspective Functions

In this section, we review perspective functions and their interplay with tight formulations of logically constrained problems. This prepares the ground for and motivates our study of matrix perspective functions and their interplay with tight formulations of low-rank problems. Many of our subsequent results can be viewed as (nontrivial) generalizations of the results in this section, since a rank constraint is a cardinality constraint on the singular values.

2.1 Preliminaries

Consider a proper closed convex function \(f : {\mathcal {X}} \rightarrow \mathbb {R}\), where \({\mathcal {X}}\) is a convex subset of \(\mathbb {R}^n\). The perspective function of f is commonly defined for any \(\varvec{x} \in \mathbb {R}^n\) and any \(t > 0\) as \((\varvec{x}, t) \mapsto t f(\varvec{x}/t)\). Its closure is defined by continuity for \(t=0\) and is equal to (c.f. [46], Proposition IV.2.2.2):

$$\begin{aligned} g_f(\varvec{x},t) = {\left\{ \begin{array}{ll} t f(\varvec{x} / t) &{} \text{ if } t > 0, \varvec{x}/t \in {\mathcal {X}}, \\ 0 &{} \text{ if } t = 0, \varvec{x} = 0, \\ f_\infty (\varvec{x}) &{} \text{ if } t = 0, \varvec{x} \ne 0, \\ +\infty &{} \text{ otherwise, } \end{array}\right. } \end{aligned}$$

where \(f_\infty \) is the recession function of f, as originally stated in [68], p. 67], which is given by

$$\begin{aligned} f_\infty (\varvec{x}) = \lim _{t \rightarrow 0} t f\left( \varvec{x}_0 - \varvec{x} + \dfrac{\varvec{x}}{t}\right) = \lim _{t \rightarrow +\infty } \dfrac{f(\varvec{x}_0 + t \varvec{x}) - f(\varvec{x}_0)}{t}, \end{aligned}$$

for any \(\varvec{x}_0\) in the domain of f. That is, \(f_\infty (\varvec{x})\) is the asymptotic slope of f in the direction of \(\varvec{x}\).

The perspective function was first investigated by [68], who made the important observation that f is convex in \(\varvec{x}\) if and only if \(g_f\) is convex in \((\varvec{x},t)\). Among other properties, we have that, for any \(t>0\), \((\varvec{x},t,s) \in \textrm{epi}(g_f)\) if and only if \((\varvec{x}/t, s/t) \in \textrm{epi}(f)\) [46], Proposition IV.2.2.1]. We refer to the review by [23] for further properties of perspective functions.

Throughout this work, we refer to \(g_f\) as the perspective function of f –although it technically is the closure of the perspective. We also consider a family of convex functions f which satisfy:

Assumption 1

The function \(f : {\mathcal {X}} \rightarrow \mathbb {R}\) is proper, closed, and convex, \(\varvec{0} \in {\mathcal {X}}\), and for any \(\varvec{x} \ne \varvec{0}\), \(f_\infty (\varvec{x}) = + \infty \).

The condition \(f_\infty (\varvec{x}) = +\infty , \forall \varvec{x} \ne \varvec{0}\) is equivalent to \(\lim _{\varvec{x} \rightarrow \infty } {f(\varvec{x})}/{\Vert \varvec{x}\Vert } = +\infty ,\) and means that, asymptotically, f increases to infinity faster than any affine function. In particular, it is satisfied if the domain of f is bounded or if f is strictly convex. Under Assumption 1, the definition of the perspective function of f simplifies to

$$\begin{aligned} g_f(\varvec{x},t) = {\left\{ \begin{array}{ll} t f(\varvec{x} / t) &{} \text{ if } t > 0, \\ 0 &{} \text{ if } t = 0, \varvec{x} = 0, \\ +\infty &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$
(8)

2.2 The perspective reformulation technique

A number of authors have observed that optimization problems over binary and continuous variables admit tight reformulations involving perspective functions of appropriate substructures of the problem, since [21], building upon the work of [68], Theorem 9.8], derived the convex hull of a disjunction of convex constraints. To motivate our study of the matrix perspective function in the sequel, we now demonstrate that a class of logically-constrained problems admit reformulations in terms of perspective functions. We remark that this development bears resemblance to other works on perspective reformulations including [10, 39, 43].

Consider a logically-constrained problem of the form

$$\begin{aligned} \begin{aligned} \min _{\varvec{z} \in {\mathcal {Z}}, {\varvec{x} \in \mathbb {R}^n}} \quad&\varvec{c}^\top \varvec{z} + f(\varvec{x}) + \Omega (\varvec{x}) \quad \text {s.t.} \quad x_i =0\ \text {if} \ z_i=0 \quad \forall i \in [n], \end{aligned} \end{aligned}$$
(9)

where \({\mathcal {Z}} \subseteq \{0,1\}^n\), \(\varvec{c} \in \mathbb {R}^n\) is a cost vector, \(f(\cdot )\) is a generic convex function which possibly models convex constraints \(\varvec{x} \in {\mathcal {X}}\) for a convex set \({\mathcal {X}} \subseteq \mathbb {R}^n\) implicitly–by requiring that \(g(\varvec{x})=+\infty \) if \(\varvec{x} \notin {\mathcal {X}}\), and \(\Omega (\cdot )\) is a regularization function which satisfies the following assumption:

Assumption 2

(Separability) \(\Omega (\varvec{x})=\sum _{i \in [n]} \Omega _i(x_i)\), where each \(\Omega _i\) satisfies Assumption 1.

Since \(z_i\) is binary, imposing the logical constraint “\(x_i=0\) if \(z_i=0\)” plus the term \(\Omega _i(x_i)\) in the objective is equivalent to \(g_{\Omega }(x_i, z_i)+(1-z_i)\Omega _i(0)\) in the objective, where \(g_{\Omega _i}\) is the perspective function of \(\Omega _i\), and thus Problem (9) is equivalent to:

$$\begin{aligned} \begin{aligned} \min _{\varvec{z} \in {\mathcal {Z}}, {\varvec{x} \in \mathbb {R}^n}} \quad&\varvec{c}^\top \varvec{z} + f(\varvec{x})+\sum _{i=1}^n \bigg (g_{\Omega _i}(x_i, z_i)+(1-z_i)\Omega _i(0)\bigg ). \end{aligned} \end{aligned}$$
(10)

Notably, while Problems (9)–(10) have the same feasible regions, (10) often has substantially stronger relaxations, as frequently noted in the perspective reformulation literature [10, 36, 37, 42].

For completeness, we provide a formal proof of equivalence between (9) and (10); note that a related (although dual, and weaker as it requires \(\Omega (\varvec{0})=0\)) result can be found in [10, Theorem 2.5]:

Lemma 1

Suppose (9) attains a finite optimal value. Then, (10) attains the same value.

Proof

It suffices to establish that the following equality holds:

$$\begin{aligned} g_{\Omega _i}(x_i, z_i)+(1-z_i)\Omega _i(0)=\Omega _i(x_i)+{\left\{ \begin{array}{ll}0 &{} \text {if} \ x_i=0 \ \text {or} \ z_i=1,\\ +\infty &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Indeed, this equality shows that any feasible solution to one problem is a feasible solution to the other with equal cost. We prove this by considering the cases where \(z_i=0\), \(z_i=1\) separately.

  • Suppose \(z_i=1\). Then, \(g_{\Omega _i}(x_i, z_i)=z_i \Omega _i(x_i/z_i) =\Omega _i(x_i)\) and \(x_i=z_i\cdot x_i\), so the result holds.

  • Suppose \(z_i=0\). If \(x_i=0\) we have \(g_{\Omega _i}(0,0)+\Omega _i(0)=\Omega _i(0)\), and moreover the right-hand-side of the equality is certainly \(\Omega _i(0)\). Alternatively, if \(x_i \ne 0\) then both sides equal \(+\infty \). \(\square \)

In Table 1, we present examples of penalties \(\Omega \) for which Assumption 1 holds and the perspective reformulation technique is applicable. We remind the reader that the exponential cone is (c.f. [22]):

$$\begin{aligned} {\mathcal {K}}_{\text {exp}}\!=\!\{\varvec{x} \in \mathbb {R}^3: x_1 \!\ge \! x_2 \exp (x_2/x_3), x_2 >0\} \cup \{(x_1, 0, x_3) \!\in \! \mathbb {R}^3: x_1 \!\ge \! 0, x_3 \!\le \! 0\}, \end{aligned}$$

while the power cone is defined for any \(\alpha \in (0, 1)\) as (c.f. [22]):

$$\begin{aligned} {\mathcal {K}}_{\text {pow}}^{\alpha }=\left\{ \varvec{x} \in \mathbb {R}^3: x_1^\alpha x_2^{1-\alpha } \ge \vert x_3\vert \right\} . \end{aligned}$$
Table 1 Convex substructures which frequently arise in MIOs and their perspective reformulations. For conciseness, we give \(g_{\Omega (x,z)}\) for \(z>0\) only, i.e., the first case in (8), \(g_{\Omega (x,z)}\) for \(z=0\) being defined as in Equation (8)

2.3 Perspective cuts

Another computationally useful application of the perspective reformulation technique has been to derive a class of cutting-planes for MIOs with logical constraints [37]. To motivate our generalization of these cuts to low-rank problems, we now briefly summarize their main result.

Consider the following problem:

$$\begin{aligned} \begin{aligned} \min _{\varvec{z} \in {\mathcal {Z}}} \min _{\varvec{x} \in \mathbb {R}^n} \quad&\varvec{c}^\top \varvec{z}+f(\varvec{x})+\sum _{i=1}^n \Omega _i(x_i)\\ \text {s.t.} \quad&\varvec{A}^i x_i \le b_i z_i \quad \forall i \in [n], \end{aligned} \end{aligned}$$
(11)

where \(\{{x_i: \ }\varvec{A}^i x_i \le 0\}=\{0\}\), which implies the set of feasible \(\varvec{x}\) is bounded, \(\Omega _i(x_i)\) is a closed convex function, we take \(\Omega _i(0)=0\) as in [37] for simplicity, and \(f(\varvec{x})\) is a convex function. Then, letting \(\rho _i\) model the epigraph of \(\Omega _i(x_i)+c_i z_i\) and \(s_i\) be a subgradient of \(\Omega _i\) at \(\bar{x}_i\), i.e., \(s_i \in \partial \Omega _i(\bar{x}_i)\), we have the following result [37, 42]:

Proposition 1

The following cut

$$\begin{aligned} \rho _i \ge (c_i+\Omega _i(\bar{x}_i))z_i+s_i(x_i-\bar{x}_i z_i) \end{aligned}$$
(12)

is valid for the equivalent MINLO:

$$\begin{aligned} \min _{\varvec{z} \in {\mathcal {Z}}}&\min _{\varvec{x}, \varvec{\rho } \in \mathbb {R}^n} \quad f(\varvec{x})+\sum _{i=1}^n \rho _i\\&\quad \mathrm{s.t.}\qquad \,\, \varvec{A}^i x_i \le b_i z_i \quad \forall i \in [n],\nonumber \\&\qquad \quad \qquad \rho _i \ge \Omega _i(x_i)+c_i z_i \quad \forall i \in [n].\nonumber \end{aligned}$$

Remark 1

In the special case where \(\Omega _i(x_i)=x_i^2\), the cut reduces to:

$$\begin{aligned} \rho _i \ge 2 x_i \bar{x}_i-\bar{x}_i^2 z_i+c_i z_i \quad \forall \bar{x}_i. \end{aligned}$$
(13)

The class of cutting planes defined in Proposition 1 are commonly referred to as perspective cuts, because they define a linear lower approximation of the perspective function of \(\Omega _i(x_i)\), \(g_{\Omega _i}(x_i, z_i)\). Consequently, Proposition 1 implies that a perspective reformulation of (11) is equivalent to adding all (infinitely many) perspective cuts (12). This may be helpful where the original problem is nonlinear, as a sequence of linear MIOs can be easier to solve than one nonlinear MIO (see [38], for a comparison).

3 The Matrix Perspective Function and its Applications

In this section, we generalize the perspective function from vectors to matrices, and invoke the matrix perspective function to propose a new technique for generating strong yet efficient relaxations of a diverse family of low-rank problems, which we call the Matrix Perspective Reformulation Technique (MPRT). Selected background on matrix analysis (see [13], for a general theory) and semidefinite optimization (see [77], for a general theory) which we use throughout this section can be found in Appendix A.

3.1 A matrix perspective function

To generalize the ideas from the previous section to low-rank constraints, we require a more expressive transform than the perspective transform, which introduces a single (scalar) additional degree of freedom and cannot control the eigenvalues of a matrix. Therefore, we invoke a generalization from quantum mechanics: the matrix perspective function defined in [27, 28], building upon the work of [29]; see also [24, 54,55,56] for a related generalization of perspective functions to perspective functionals.

Definition 1

For a matrix-valued function \(f: {\mathcal {X}} \rightarrow {\mathcal {S}}^n_+\) where \({\mathcal {X}} \subseteq {\mathcal {S}}^n\) is a convex set, the matrix perspective function of f, \(g_f\), is defined as

$$\begin{aligned} g_f(\varvec{X},\varvec{Y}) = {\left\{ \begin{array}{ll} \varvec{Y}^\frac{1}{2} f\left( \varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}\right) \varvec{Y}^{\frac{1}{2}} &{} \text{ if } \varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}} \in {\mathcal {X}}, \varvec{Y} \succ \varvec{0}, \\ \infty &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$

Remark 2

If \(\varvec{X}\) and \(\varvec{Y}\) commute and f is analytic, then Definition 1 simplifies into \(\varvec{Y} f\left( \varvec{Y}^{-1}\varvec{X}\right) \), which is the analog of the usual definition of the perspective function originally stated in [29]. Definition 1, however, generalizes this definition to the case where \(\varvec{X}\) and \(\varvec{Y}\) do not commute by ensuring that \(\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}\) is nonetheless symmetric, in a manner reminiscent of the development of interior point methods (see, e.g., [2]). In particular, if \(\varvec{Y}\) is a projection matrix such that \(\varvec{X}=\varvec{Y}\varvec{X}\)—as occurs for the exact formulations of the low-rank problems we consider in this paper—then it is safe to assume that \(\varvec{X}, \varvec{Y}\) commute. However, when \(\varvec{Y}\) is not a projection matrix, this cannot be assumed in general.

The matrix perspective function generalizes the definition of the perspective transformation to matrix-valued functions and satisfies analogous properties:

Proposition 2

Let f be a matrix-valued function and \(g_f\) its matrix perspective function. Then:

  1. (a)

    f is matrix convex, i.e.,

    $$\begin{aligned} t f(\varvec{X})+(1-t) f(\varvec{W}) \succeq f(t \varvec{X}+(1-t)\varvec{W}) \quad \forall \varvec{X}, \varvec{W} \in {\mathcal {S}}^n, \ t \in [0,1], \end{aligned}$$
    (14)

    if and only if \(g_f\) is matrix convex in \((\varvec{X}, \varvec{Y})\).

  2. (b)

    \(g_f\) is a positive homogeneous function, i.e., for any \(\mu {>} 0\) we have

    $$\begin{aligned} g_f(\mu \varvec{X}, \mu \varvec{Y})=\mu g_f(\varvec{X}, \varvec{Y}). \end{aligned}$$
    (15)
  3. (c)

    Let \(\varvec{Y} \succ \varvec{0}\) be a positive definite matrix. Then, letting the epigraph of f be denoted by

    $$\begin{aligned} \textrm{epi}(f):=\{(\varvec{X}, \varvec{\theta }): \varvec{X} \in \textrm{dom}(f), f(\varvec{X}) \preceq \varvec{\theta }\},\end{aligned}$$
    (16)

    we have \((\varvec{X}, \varvec{Y}, \varvec{\theta }) \in \textrm{epi}(g_f)\) if and only if \((\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}, \varvec{Y}^{-\frac{1}{2}}\varvec{\theta }\varvec{Y}^{-\frac{1}{2}}) \in \textrm{epi}(f)\).

Proof

We prove the claims successively:

  1. (a)

    This is precisely the main result of [27], Theorem 2.2].

  2. (b)

    For \(\mu > 0\), \(g_f(\mu \varvec{X}, \mu \varvec{Y})=\mu \varvec{Y}^{\frac{1}{2}} f\left( (\mu \varvec{Y})^{-\frac{1}{2}}\mu \varvec{X}(\mu \varvec{Y})^{-\frac{1}{2}}\right) \varvec{Y}^{\frac{1}{2}}=\mu g_f(\varvec{X}, \varvec{Y})\).

  3. (c)

    By generalizing the main result in [15, Chapter 3.2.6], for any \(\varvec{Y} \succ \varvec{0}\) we have that

    $$\begin{aligned} (\varvec{X}, \varvec{Y}, \varvec{\theta }) \in \textrm{epi}(g_f)&\iff \varvec{Y}^\frac{1}{2}f(\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}})\varvec{Y}^\frac{1}{2}\preceq \varvec{\theta },\\&\iff f(\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}) \preceq \varvec{Y}^{-\frac{1}{2}}\varvec{\theta }\varvec{Y}^{-\frac{1}{2}},\\&\iff (\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}, \varvec{Y}^{-\frac{1}{2}}\varvec{\theta }\varvec{Y}^{-\frac{1}{2}}) \in \textrm{epi}(f). \end{aligned}$$

    \(\square \)

We now specialize our attention to matrix-valued functions defined by a scalar convex function, as suggested in the introduction.

3.2 Matrix perspectives of operator functions

From any function \(\omega : \mathbb {R} \rightarrow \mathbb {R}\), we can define its extension to the set of symmetric matrices, \(f_\omega : {\mathcal {S}}^n \rightarrow {\mathcal {S}}^n\) as

$$\begin{aligned} f_\omega (\varvec{X}) = \varvec{U} {\text {Diag}}(\omega (\lambda _1^x),\dots , \omega (\lambda _n^x)) \varvec{U}^\top , \end{aligned}$$
(17)

where \(\varvec{X} = \varvec{U} {\text {Diag}}(\lambda _1^x,\dots , \lambda _n^x) \varvec{U}^\top \) is an eigendecomposition of \(\varvec{X}\). Functions of this form are called operator functions (see [13], for a general theory). In particular, one can show that \(f_\omega (\varvec{X})\) is well-defined (does not depend explicitly on the eigenbasis of \(\varvec{X}\), \(\varvec{U}\)). Among other examples, taking \(\omega (x) = \exp (x)\) (resp. \(\log (x)\)) provides a matrix generalization of the exponential (resp. logarithm) function; see Appendix A.1.

Central to our analysis is that we can explicitly characterize the closure of the matrix perspective of \(f_\omega \) under some assumptions on \(\omega \), i.e., define by continuity \(g_{f_\omega }(\varvec{X},\varvec{Y})\) for rank-deficient matrices \(\varvec{Y}\):

Proposition 3

Consider a function \(\omega : \mathbb {R} \rightarrow \mathbb {R}\) satisfying Assumption 1. Then, the closure of the matrix perspective of \(f_\omega \) is, for any \(\varvec{X} \in {\mathcal {S}}^n\), \(\varvec{Y} \in {\mathcal {S}}_+^n\),

$$\begin{aligned} g_{f_\omega }(\varvec{X},\varvec{Y}) = {\left\{ \begin{array}{ll} \varvec{Y}^\frac{1}{2} f_\omega (\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}})\varvec{Y}^{\frac{1}{2}}\!\!\!\!\! &{} \text{ if } {\text {Span}}(\varvec{X}) \subseteq {\text {Span}}(\varvec{Y}), \varvec{Y} \succeq \varvec{0}, \\ \infty &{} \text{ otherwise, } \end{array}\right. } \end{aligned}$$

where \(\varvec{Y}^{-\frac{1}{2}}\) denotes the pseudo-inverse of the square root of \(\varvec{Y}\).

Remark 3

Note that in the expression of \(g_{f_\omega }\) above, the matrix \(\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}\) is unambiguously defined if and only if \({\text {Span}}(\varvec{X}) \subseteq {\text {Span}}(\varvec{Y})\) (otherwise, its value depends on how we define the pseudo-inverse of \(\varvec{Y}^{\frac{1}{2}}\) outside of its range). Accordingly, in the remainder of the paper, we omit the condition \({\text {Span}}(\varvec{X}) \subseteq {\text {Span}}(\varvec{Y})\) whenever the analytic expression for \(g_{f_\omega }\) explicitly involves \(\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}\).

The proof of Proposition 3 is deferred to Appendix B.1. In the appendix, we also present an immediate extension where additional constraints, \(\varvec{X} \in {\mathcal {X}}\), are imposed on the argument of \(f_\omega \). As in our prior work [11], we will reformulate the rank constraints in (1) by introducing a projection matrix \(\varvec{Y}\) to encode for the span of \(\varvec{X}\). Naturally, \(\varvec{Y}\) should be rank-deficient. Hence, Proposition 3 ensures that having \({\text {tr}}(g_{f_\omega }(\varvec{X},\varvec{Y})) < \infty \) is a sufficient condition for \(\varvec{Y}\) to indeed control \({\text {Span}}(\varvec{X})\).

To gain intuition on how the matrix perspective function transforms \(\varvec{X}\) and \(\varvec{Y}\), we now provide an interesting connection between the matrix perspective of \(f_\omega \) and the perspective of \(\omega \) in the case where \(\varvec{X}\) and \(\varvec{Y}\) commute.

Proposition 4

Consider two matrices \(\varvec{X} \in {\mathcal {S}}^n, \varvec{Y} \in {\mathcal {S}}^n_+\) that commute and such that \(\textrm{Span}(\varvec{X}) \subseteq \textrm{Span}(\varvec{Y})\). Hence, there exists an orthogonal matrix \(\varvec{U}\) which jointly diagonalizes \(\varvec{X}\) and \(\varvec{Y}\). Let \(\lambda _1^x, \dots , \lambda _n^x\) and \(\lambda _1^y, \dots , \lambda _n^y\) denote the eigenvalues of \(\varvec{X}\) and \(\varvec{Y}\) respectively, ordered according to this basis \(\varvec{U}\). Consider an operator function \(f_\omega \) with \(\omega \) satisfying Assumption 1. Then, we have that:

$$\begin{aligned} g_{f_\omega } (\varvec{X}, \varvec{Y}) = \varvec{U} {\text {Diag}}\left( g_\omega (\lambda _1^x, \lambda _1^y), \dots , g_\omega (\lambda _n^x, \lambda _n^y) \right) \varvec{U}^\top \end{aligned}$$

Proof

By simultaneously diagonalizing \(\varvec{X}\) and \(\varvec{Y}\), we get

$$\begin{aligned} \varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}&=\varvec{U}\textrm{Diag}\left( \lambda _1^x/\lambda _1^y, \dots , \lambda _n^x/\lambda _n^y \right) \varvec{U}^\top , \\ f_\omega \left( \varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}} \right)&=\varvec{U}\textrm{Diag}\left( \omega (\lambda _1^x/\lambda _1^y), \dots , \omega (\lambda _n^x/\lambda _n^y) \right) \varvec{U}^\top , \\ \varvec{Y}^{\frac{1}{2}} f_\omega \left( \varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}} \right) \varvec{Y}^{\frac{1}{2}}&=\varvec{U}\textrm{Diag}\left( \lambda _1^y \omega (\lambda _1^x/\lambda _1^y), \dots , \lambda _n^y \omega (\lambda _n^x/\lambda _n^y) \right) \varvec{U}^\top . \end{aligned}$$

\(\square \)

Note that if \(\varvec{Y}\) is a projection matrix such that \(\textrm{Span}(\varvec{X}) \subseteq \textrm{Span}(\varvec{Y})\) then we necessarily have that \(\varvec{X}=\varvec{Y}\varvec{X}=\varvec{X}\varvec{Y}\) and the assumptions of Proposition 4 hold.

In the general case where \(\varvec{X}\) and \(\varvec{Y}\) do not commute, we cannot simultaneously diagonalize them and connect \(g_{f_\omega }\) with \(g_\omega \). However, we can still project \(\varvec{Y}\) onto the space of matrices that commute with \(\varvec{X}\) and obtain the following result when \(g_{f_\omega }\) is matrix convex (proof deferred to Appendix B.2):

Lemma 2

Let \(\varvec{X} \in {\mathcal {S}}^n\) and \(\varvec{Y} \in {\mathcal {S}}_+^n\) be matrices, and denote the commutant of \(\varvec{X}\), \({ {\mathcal {C}}_{\varvec{X}} :=} \{ \varvec{M} \ : \ \varvec{MX} = \varvec{XM} \}\), i.e., the set of matrices which commute with \(\varvec{X}\). For any matrix \(\varvec{M}\), denote \(\varvec{M}_{|{ \varvec{X}}}\) the orthogonal projection of \(\varvec{M}\) onto \({\mathcal {C}}_{\varvec{X}}\). Then, we have that

$$\begin{aligned} \varvec{Y}_{|{ \varvec{X}}} \in {\mathcal {S}}^n_+,\ \text{ and } \ {\text {tr}}\left( \varvec{Y}_{|{ \varvec{X}}} \right) ={\text {tr}}\left( \varvec{Y} \right) . \end{aligned}$$

Moreover, if \(\varvec{Y} \mapsto g_{f_\omega }(\varvec{X},\varvec{Y})\) is matrix convex, then we have

$$\begin{aligned} {\text {tr}}\left[ g_{f_\omega }(\varvec{X}, \varvec{Y}_{|{ \varvec{X}}}) \right] \le {\text {tr}}\left[ g_{f_\omega }(\varvec{X}, \varvec{Y}) \right] . \end{aligned}$$

3.3 The matrix perspective reformulation technique

Definition 1 and Proposition 3 supply the necessary language to lay out our Matrix Perspective Reformulation Technique (MPRT). Therefore, we now state the technique; details regarding its implementation will become clearer throughout the paper.

Let us revisit Problem (1), and assume that the term \(\Omega (\varvec{X})\) satisfies the following properties:

Assumption 3

\(\Omega (\varvec{X})= {\text {tr}}\left( f_\omega (\varvec{X}) \right) \), where \(\omega \) is a function satisfying Assumption 1 and whose associated operator function, \(f_\omega \), is matrix convex.

Assumption 3 implies that the regularizer can be rewritten as operating on the eigenvalues of \(\varvec{X}\), \(\lambda _i(\varvec{X})\), directly: \(\Omega (\varvec{X})= \sum _{i \in [n]} \omega (\lambda _i(\varvec{X}))\). As we discuss in the next section, a broad class of functions satisfy this property. For ease of notation, we refer to \(f_\omega \) as f in the remainder of the paper (and accordingly denote by \(g_f\) its matrix perspective function).

After letting an orthogonal projection matrix \(\varvec{Y}\) model the rank of \(\varvec{X}\)—as per [11]—Problem (1) admits the equivalent mixed-projection reformulation:

$$\begin{aligned} \min _{\varvec{Y} \in {\mathcal {Y}}^k_n}\min _{\varvec{X} \in {\mathcal {S}}^n} \quad&\langle \varvec{C}, \varvec{X} \rangle +\mu \cdot \textrm{tr}(\varvec{Y})+\textrm{tr}(f(\varvec{X}))\nonumber \\ \text {s.t.} \quad&\langle \varvec{A}_i, \varvec{X}\rangle =b_i \quad \forall i \in [m], \ \varvec{X}=\varvec{Y}\varvec{X}, \ \varvec{X} \in {\mathcal {K}}, \end{aligned}$$
(18)

where \(\varvec{Y} \in {\mathcal {Y}}^k_n\) is the set of \(n \times n\) orthogonal projection matrices with trace at most k:

$$\begin{aligned} {\mathcal {Y}}^k_n := \left\{ \varvec{Y} \in {\mathcal {S}}_+^n : \varvec{Y}^2 = \varvec{Y},\ \textrm{tr}(\varvec{Y}) \le k \right\} . \end{aligned}$$

Note that for \(k \in \mathbb {N}\), the convex hull of \({\mathcal {Y}}^k_n\) is given by \( \textrm{Conv}({\mathcal {Y}}^k_n)=\{\varvec{Y} \in {\mathcal {S}}^n_{+}: \varvec{Y} \preceq \mathbb {I}, \textrm{tr}(\varvec{Y}) \le k\}\), which is a well-studied object in its own right [52, 60,61,62].

Since \(\varvec{Y}\) is an orthogonal projection matrix, imposing the nonlinear constraint \(\varvec{X}=\varvec{Y}\varvec{X}\) and introducing the term \(\Omega (\varvec{X})=\textrm{tr}(f(\varvec{X}))\) in the objective is equivalent to introducing the following term in the objective:

$$\begin{aligned} \textrm{tr}(g_f(\varvec{X}, \varvec{Y}))+(n-\textrm{tr}(\varvec{Y}))\omega (0), \end{aligned}$$

where \(g_f\) is the matrix perspective of f, and thus Problem (18) is equivalent to:

$$\begin{aligned} \min _{\varvec{Y} \in {\mathcal {Y}}^k_n}\min _{\varvec{X} \in {\mathcal {S}}^n} \quad&\langle \varvec{C}, \varvec{X} \rangle +\mu \cdot \textrm{tr}(\varvec{Y})+\textrm{tr}(g_{f}(\varvec{X}, \varvec{Y}))+(n-\textrm{tr}(\varvec{Y}))\omega (0) \nonumber \\ \text {s.t.} \quad&\langle \varvec{A}_i, \varvec{X}\rangle =b_i \quad \forall i \in [m], \ \varvec{X} \in {\mathcal {K}}. \end{aligned}$$
(19)

Let us formally state and verify the equivalence between Problems (18)–(19) via:

Theorem 1

Problems (18)–(19) attain the same optimal objective value.

Proof

It suffices to show that for any feasible solution to (18) we can construct a feasible solution to (19) with an equal or lower cost, and vice versa:

  • Let \((\varvec{X}, \varvec{Y})\) be a feasible solution to (18). Since \(\varvec{X} = \varvec{Y} \varvec{X} \in {\mathcal {S}}^n\), \(\varvec{X}\) and \(\varvec{Y}\) commute. Hence, by Proposition 4, we have (using the same notation as in Proposition 4):

    $$\begin{aligned} {\text {tr}}\left( g_f(\varvec{X}, \varvec{Y}) \right) = \sum _{i \in [n]} g_\omega \left( \lambda ^x_i,\lambda ^y_i \right) = \sum _{i \in [n]} 1\left\{ \lambda _i^y>0\right\} \omega (\lambda ^x_i), \end{aligned}$$

    where \(1\left\{ \lambda _i^y>0\right\} \) is an indicator function which denotes whether the ith eigenvalue of \(\varvec{Y}\) (which is either 0 or 1) is strictly positive. Moreover, since \(\varvec{X} = \varvec{Y}\varvec{X}\), \(\lambda _i^y = 0 \implies \lambda _i^x = 0\) and

    $$\begin{aligned} {\text {tr}}\left( f(\varvec{X})\right)&= \sum _{i\in [n]} \omega (\lambda _i^x) = {\text {tr}}\left( g_f(\varvec{X}, \varvec{Y}) \right) + \sum _{i \in [n]} 1\left\{ \lambda _i^y=0\right\} \omega (0) \nonumber \\&= {\text {tr}}\left( g_f(\varvec{X}, \varvec{Y}) \right) + (n-\textrm{tr}(\varvec{Y}))\omega (0). \end{aligned}$$
    (20)

    This establishes that \((\varvec{X}, \varvec{Y})\) is feasible in (19) with the same cost.

  • Let \((\varvec{X}, \varvec{Y})\) be a feasible solution to (19). Then, it follows that \(\varvec{X} \in \textrm{Span}(\varvec{Y})\), which implies that \(\varvec{X}=\varvec{Y}\varvec{X}\) since \(\varvec{Y}\) is a projection matrix. Therefore, (20) holds, which establishes that \((\varvec{X}, \varvec{Y})\) is feasible in (18) with the same cost.

\(\square \)

Eventually, relaxing \(\varvec{Y} \in {\mathcal {Y}}^k_n\) in Problem (19) supplies as strong—and sometimes significantly stronger—relaxations than by any other technique we are aware of, as we explore in Sect. 4.

Remark 4

Note that, based on the proof of Theorem 1, we could replace \(g_f(\varvec{X}, \varvec{Y})\) in (19) by any function \(\tilde{g}(\varvec{X}, \varvec{Y})\) such that \(g_f(\varvec{X}, \varvec{Y}) = \tilde{g}(\varvec{X}, \varvec{Y})\) for \(\varvec{X}, \varvec{Y}\) that commute, with no impact on the objective value. However, it might impact tractability if \(\tilde{g}(\varvec{X}, \varvec{Y})\) is not convex in \((\varvec{X}, \varvec{Y})\).

Remark 5

Under Assumption 3, the regularization term \(\Omega (\varvec{X})\) penalizes all eigenvalues of \(f_{\omega }(\varvec{X})\) equally. The MPRT can be extended to a wider class of regularization functions that penalize the largest eigenvalues more heavily, at the price of (a significant amount of) additional notation. For brevity, we lay out this extension in Appendix C.

Theorem 1 only uses the fact that f is an operator function with \(\omega \) satisfying Assumption 1, not the fact that f is matrix convex. In other words, (19) is always an equivalent reformulation of (18). An interesting question is to identify the set of necessary conditions for the objective of (19) to be convex in \((\varvec{X},\varvec{Y})\)f being matrix convex is clearly sufficient. The objective in (19) is convex only as long as \({\text {tr}}\left( g_f \right) \) is. Interestingly, this is not equivalent to the convexity of \(\textrm{tr}(f)\). See Appendix B.3 for a counter-example. It is, however, an open question whether a weaker notion than matrix convexity could ensure the joint convexity of \(\textrm{tr}(g_f)\). It would also be interesting to investigate the benefits and the tractability of non-convex penalties (either by having f not matrix convex or \(\omega \) non-convex), given the successes of non-convex penalty functions in sparse regression problems [30, 79].

3.4 Convex hulls of low-rank sets and the MPRT

We now show that, for a general class of low-rank sets, applying the MPRT is equivalent to taking the convex hull of the set. This is significant, because we are not aware of any general-purpose techniques for taking convex hulls of low-rank sets. Formally, we have the following result:

Theorem 2

Consider an operator function \(f=f_\omega \) satisfying Assumption 3. Let

$$\begin{aligned} {\mathcal {T}}=\left\{ \varvec{X} \in {\mathcal {S}}^n:\ \textrm{tr}(f(\varvec{X}))+\mu \cdot \textrm{Rank}(\varvec{X}) \le t, \textrm{Rank}(\varvec{X}) \le k \right\} \end{aligned}$$
(21)

be a set where \(t \in \mathbb {R},k \in \mathbb {N}\) are fixed. Then, an extended formulation of the convex hull of \({\mathcal {T}}\) is given by:

$$\begin{aligned} {\mathcal {T}}^c=&\left\{ (\varvec{X}, \varvec{Y}) \in {\mathcal {S}}^n \times \textrm{Conv}({\mathcal {Y}}^k_n)\!: \textrm{tr}(g_f(\varvec{X, \varvec{Y}}))\!\!+\mu \cdot \!\textrm{tr}(\varvec{Y})\!\!+(n-\textrm{tr}(\varvec{Y}))\omega (0) \le t \right\} , \end{aligned}$$
(22)

where \(\textrm{Conv}({\mathcal {Y}}^k_n)=\{\varvec{Y} \in {\mathcal {S}}^n_+: \varvec{Y} \preceq \mathbb {I}, \textrm{tr}(\varvec{Y}) \le k \}\) is the convex hull of trace-k projection matrices, and \(g_f\) is the matrix perspective function of f.

Proof

We prove the two directions sequentially:

  • \(\textrm{Conv}\left( {\mathcal {T}}\right) \subseteq {\mathcal {T}}^c\): let \(\varvec{X} \in {\mathcal {T}}\). Then, since the rank of \(\varvec{X}\) is at most k, there exists some \(\varvec{Y} \in {\mathcal {Y}}^k_n\) such that \(\varvec{X}=\varvec{Y}\varvec{X}\) and \(\textrm{tr}(\varvec{Y})=\textrm{Rank}(\varvec{X})\). Moreover, by the same argument as in the proof of Theorem 1, it follows that (20) holds and \(\textrm{tr}(g_f(\varvec{X}, \varvec{Y}))+\mu \cdot \textrm{tr}(\varvec{Y})+(n-\textrm{tr}(\varvec{Y}))\omega (0) \le t\), which confirms that \((\varvec{X}, \varvec{Y}) \in {\mathcal {T}}^c\). Since \({\mathcal {T}}^c\) is a convex set, we therefore have \(\textrm{Conv}\left( {\mathcal {T}}\right) \subseteq {\mathcal {T}}^c\).

  • \({\mathcal {T}}^c \subseteq \textrm{Conv}\left( {\mathcal {T}}\right) \): let \((\varvec{X}, \varvec{Y}) \in {\mathcal {T}}^c\). Denote \(\varvec{Y}_{|\varvec{X}}\) the projection of \(\varvec{Y}\) onto the set of matrices that commute with \(\varvec{X}\): \(\{ \varvec{M} \ : \ \textit{\textbf{X M}} = \varvec{M X} \}\). By Lemma 2, we have that \(\varvec{Y}_{|{ \varvec{X}}} \in {\text {Conv}}({\mathcal {Y}}_n^k)\), and \({\text {tr}}\left( g_{f}(\varvec{X}, \varvec{Y}_{|{ \varvec{X}}}) \right) \le {\text {tr}}\left( g_{f}(\varvec{X}, \varvec{Y}) \right) < \infty \) so \((\varvec{X}, \varvec{Y}_{|{ \varvec{X}}}) \in {\mathcal {T}}^c\) as well. Hence, without loss of generality, by renaming \(\varvec{Y} \leftarrow \varvec{Y}_{|{ \varvec{X}}}\), we can assume that \(\varvec{X}\) and \(\varvec{Y}\) commute. Then, it follows from Proposition 4 that the vectors of eigenvalues of \(\varvec{X}\) and \(\varvec{Y}\) (ordered according to a shared eigenbasis \(\varvec{U}\)), \(\left( \varvec{\lambda }(\varvec{X}), \varvec{\lambda }(\varvec{Y})\right) \) belong to the set

    $$\begin{aligned}&\left\{ (\varvec{x}, \varvec{y}) \in \mathbb {R}^n \times [0,1]^n: \sum _i y_i \le k, \sum _{i=1}^n y_i \omega \left( \tfrac{x_i}{y_i}\right) +\mu \sum _i y_i +\left( n-\sum _i y_i\right) \omega (0) \le t \right\} , \end{aligned}$$

    which is the closure of the convex hull of

    $$\begin{aligned} {\mathcal {U}}^c:=&\left\{ (\varvec{x}, \varvec{y}) \in \mathbb {R}^n \times \{0,1\}^n: \sum _i y_i \le k,\sum _{i=1}^n \omega ({x_i})+\mu \sum _i y_i \le t, x_i = 0 \ \text {if} \ y_i = 0 \ \forall i \in [n]\right\} , \end{aligned}$$

    as proved by [76, Theorem 3] for the case \(\omega (0) = 0\). We provide a self-contained proof of the generalization of this statement to \(\omega (0) \ne 0\) in Appendix B.5 (Proposition 8). Let us decompose \((\varvec{\lambda }(\varvec{X}), \varvec{\lambda }(\varvec{Y}))\) into \(\varvec{\lambda }(\varvec{X}) = \sum _{k} \alpha _k \varvec{x}^{(k)}\), \(\lambda (\varvec{Y}) = \sum _{k} \alpha _k \varvec{y}^{(k)}\), with \(\alpha _k \ge 0\), \(\sum _k \alpha _k = 1\), and \((\varvec{x}^{(k)}, \varvec{y}^{(k)}) \in {\mathcal {U}}^c\). By definition,

    $$\begin{aligned} \varvec{T}^{(k)} := \varvec{U} \text {Diag}(\varvec{x}^{(k)}) \varvec{U}^\top \in {\mathcal {T}}, \end{aligned}$$

    and \(\varvec{X} = \sum _k \alpha _k \varvec{T}^{(k)}\). Therefore, we have that \(\varvec{X} \in \textrm{Conv}({\mathcal {T}})\), as required. \(\square \)

Remark 6

Since linear optimization problems over convex sets admit extremal optima, Theorem 2 demonstrates that unconstrained low-rank problems with spectral objectives can be recast as linear semidefinite problems, where the rank constraint is dropped without loss of optimality. This suggests that work on hidden convexity in low-rank optimization, i.e., deriving conditions under which low-rank linear optimization problems admit exact relaxations where the rank constraint is omitted (see, e.g., [12, 62, 75]), could be extended to incorporate spectral functions.

3.5 Examples of the matrix perspective function

Theorem 2 demonstrates that, for spectral functions under low-rank constraints, taking the matrix perspective is equivalent to taking the convex hull. To highlight the utility of Theorems 12, we therefore supply the perspective functions of some spectral regularization functions which frequently arise in the low-rank matrix literature, and summarize them in Table 2. We also discuss how these functions and their perspectives can be efficiently optimized over. Note that all functions introduced in this section are either matrix convex or the trace of a matrix convex function, and thus supply valid convex relaxations when used as regularizers for the MPRT.

Table 2 Analogy between perspectives of scalars and perspectives of matrix convex functions

Spectral constraint: Let \(\omega (x) = 0\) if \(|x| \le M\), \(+\infty \) otherwise. Then,

$$\begin{aligned} f(\varvec{X})={\left\{ \begin{array}{ll} {\varvec{0}} &{} \text {if} \quad \Vert \varvec{X}\Vert _\sigma \le M, \\ +\infty &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

for \(\varvec{X} \in {\mathcal {S}}^n\), where \(\Vert \cdot \Vert _\sigma \) denotes the spectral norm, i.e., the largest eigenvalue in absolute magnitude of \(\varvec{X}\). Observe that the condition \(\Vert \varvec{X}\Vert _\sigma \le M\) can be expressed via semidefinite constraints\( -M\mathbb {I} \preceq \varvec{X}\preceq M\mathbb {I}\). The perspective function \(g_f\) can then be expressed as

$$\begin{aligned} g_f(\varvec{X}, \varvec{Y}) ={\left\{ \begin{array}{ll} {\varvec{0}} &{} \text {if} \quad -M \varvec{Y} \preceq \varvec{X} \preceq M \varvec{Y}, \\ +\infty &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

If \(\varvec{X}\) and \(\varvec{Y}\) commute, \(g_f(\varvec{X},\varvec{Y})\) requires that \(\vert \lambda _j(\varvec{X}) \vert \le M \lambda _j(\varvec{Y}) \ \forall j \in [n]\)–the spectral analog of a big-M constraint. This constraint can be modeled using two semidefinite cones, and thus handled by semidefinite solvers.

Convex quadratic: For \(\omega (x) = x^2\), \(f(\varvec{X})=\varvec{X}^\top \varvec{X}\). Then, the perspective function \(g_f\) is

$$\begin{aligned} g_f(\varvec{X},\varvec{Y}) = {\left\{ \begin{array}{ll} \varvec{X}^\top \varvec{Y}^ \dag \varvec{X} &{} \text{ if } \varvec{Y} \succeq \varvec{0}, \\ +\infty &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$

Observe that this function’s epigraph is semidefinite-representable. Indeed, by the Schur complement lemma [16], Equation 2.41], minimizing the trace of \(g_f(\varvec{X}, \varvec{Y})\) is equivalent to solving

$$\begin{aligned} \min _{\varvec{\theta } \in {\mathcal {S}}^n, \varvec{Y} \in {\mathcal {S}}^n, \varvec{X} \in {\mathcal {S}}^n}\quad \textrm{tr}(\varvec{\theta }) \quad \text {s.t.} \quad \begin{pmatrix} \varvec{\theta } &{}\quad \varvec{X}\\ \varvec{X}^\top &{}\quad \varvec{Y}\end{pmatrix} \succeq \varvec{0}. \end{aligned}$$

Interestingly, this perspective function allows us to rewrite the rank-k SVD problem

$$\begin{aligned} \min _{\varvec{X} \in \mathbb {R}^{n \times m}} \quad \Vert \varvec{X}-\varvec{A}\Vert _F^2: \ \textrm{Rank}(\varvec{X})\le k \end{aligned}$$

as a linear optimization problem over the set of orthogonal projection matrices, which implies that the orthogonal projection constraint can be relaxed to its convex hull without loss of optimality (since some extremal solution will be optimal for the relaxation). This is significant, because while rank-k SVD is commonly thought of as a non-convex problem which “surprisingly” admits a closed-form solution, the MPRT shows that it actually admits an exact convex reformulation:

$$\begin{aligned} \min _{\varvec{X}, \varvec{Y}, \varvec{\theta }} \quad \frac{1}{2}\textrm{tr}(\varvec{\theta })-\langle \varvec{A}, \varvec{X}\rangle +\frac{1}{2}\Vert \varvec{A}\Vert _F^2 \ \text {s.t.} \ \varvec{Y} \preceq \mathbb {I}, \ \textrm{tr}(\varvec{Y}) \le k, \begin{pmatrix} \varvec{\theta } &{}\quad \varvec{X}\\ \varvec{X}^\top &{}\quad \varvec{Y}\end{pmatrix} \succeq \varvec{0}. \end{aligned}$$

Note that, in the above formulation, we extended our results for symmetric matrices to rectangular matrices \(\varvec{X} \in \mathbb {R}^{n \times m}\) without justification. We rigorously derive this extension for \(f(\varvec{X}) = \varvec{X}^\top \varvec{X}\) in Appendix D and defer the study of the general case to future research.

Spectral plus convex quadratic: Let

$$\begin{aligned} f(\varvec{X})={\left\{ \begin{array}{ll} \varvec{X}^\top \varvec{X} &{} \text {if} \quad \Vert \varvec{X}\Vert _\sigma \le M,\\ +\infty &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

for \(\varvec{X} \in {\mathcal {S}}^n\). Then, the perspective function \(g_f\) is

$$\begin{aligned} g_f(\varvec{X}, \varvec{Y})={\left\{ \begin{array}{ll} \varvec{X}^\top \varvec{Y}^\dag \varvec{X} &{} \text {if} \quad -M \varvec{Y} \preceq \varvec{X} \preceq M \varvec{Y}, \\ +\infty &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

This can be interpreted as the spectral analog of combining a big-M and a ridge penalty.

Convex quadratic over completely positive cone: Consider the following optimization problem

$$\begin{aligned} \min _{\varvec{X} \in {\mathcal {S}}^n} \, \varvec{X}^\top \varvec{X} \text{ s.t. } \varvec{X} \in {\mathcal {C}}^n_+, \end{aligned}$$

where \({\mathcal {C}}^n_+=\{\varvec{X}: \varvec{X}=\varvec{U}\varvec{U}^\top , \varvec{U} \in \mathbb {R}^{n \times n}_+\} \subseteq {\mathcal {S}}^n_+\) denotes the completely positive cone. Then, by denoting \(f(\varvec{X}) = \varvec{X}^\top \varvec{X}\) and \(g_f\) its perspective function, we obtain a valid relaxation by minimizing \(\textrm{tr}(g_f)\), which, by the Schur complement lemma (see [16, Equation 2.41]), can be reformulated as

$$\begin{aligned} \min _{\varvec{\theta } \in {\mathcal {S}}^n, \varvec{Y} \in {\mathcal {S}}^n, \varvec{X} \in {\mathcal {S}}^n}\quad \textrm{tr}(\varvec{\theta }) \quad \text {s.t.} \quad \begin{pmatrix} \varvec{\theta } &{} \varvec{X}\\ \varvec{X}^\top &{} \varvec{Y}\end{pmatrix} \in {\mathcal {S}}^{2n}_+, \varvec{X} \in {\mathcal {C}}^n_+. \end{aligned}$$

Unfortunately, this formulation cannot be tractably optimized over, since separating over the completely positive cone is NP-hard. However, by relaxing the completely positive cone to the doubly non-negative cone—\({\mathcal {S}}^n_+ \cap \mathbb {R}^{n \times n}_{+}\)—we obtain a tractable and near-exact relaxation. Indeed, as we shall see in our numerical experiments, combining this relaxation with a state-of-the-art heuristic supplies certifiably near-optimal solutions in both theory and practice.

Note that we could have obtained an alternative relaxation by instead considering the perspective of

$$\begin{aligned} f(\varvec{X})={\left\{ \begin{array}{ll} \varvec{X}^\top \varvec{X} &{} \text {if} \quad \varvec{X} \in {\mathcal {C}}^n_+,\\ +\infty &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

Remark 7

One can obtain a nearly identical formulation over the copositive cone (c.f. [17]).

Power: LetFootnote 2\(f(\varvec{X})=\varvec{X}^\alpha \) for \(\alpha \in [0, 1]\) and \(\varvec{X} \in {\mathcal {S}}^n_+\). The matrix perspective function isFootnote 3

$$\begin{aligned} g_f(\varvec{X},\varvec{Y}) = {\left\{ \begin{array}{ll} { \varvec{Y}^{1/2} \left( \varvec{Y}^{-1/2} \varvec{X} \varvec{Y}^{-1/2} \right) ^{\alpha } \varvec{Y}^{1/2}} &{} \text{ if } \varvec{Y}^{-\frac{1}{2}} \varvec{X}\varvec{Y}^{-\frac{1}{2}} \in {\mathcal {S}}^n_+, \varvec{Y} \succeq \varvec{0}, \\ +\infty &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$

The expression above can be simplified into \(\varvec{X}^\alpha \varvec{Y}^{1-\alpha }\) when \(\varvec{X}\) and \(\varvec{Y}\) commute and, per Remark 4, the former expression can be used equivalently for optimization purposes, even when \(\varvec{X}\) and \(\varvec{Y}\) do not commute.

Remark 8

(Matrix Power Cone) This function’s hypograph, the matrix power cone, i.e.,

$$\begin{aligned} {\mathcal {K}}^{\text {pow}, \alpha }_{\text {mat}}=\left\{ (\varvec{X}, \varvec{Y}, \varvec{\theta }) \in {\mathcal {S}}^n_+\times {\mathcal {S}}^n_{+} \times {\mathcal {S}}^n: \varvec{Y}^{\frac{1-\alpha }{2}}\varvec{X}^\alpha \varvec{Y}^{\frac{1-\alpha }{2}} \succeq \varvec{\theta }\right\} \end{aligned}$$

is a closed convex cone which is semidefinite representable for any rational \(\alpha \in [0, 1]\) [32, Equation 3]. Consequently, it is a tractable object which successfully models the matrix power function (and its perspective) and we shall make repeated use of it when we apply the MPRT to several important low-rank problems in Sect. 3.5.

Logarithm: Let \(f(\varvec{X})=-\log (\varvec{X})\) be the matrix logarithm function. We have that

$$\begin{aligned} g_f(\varvec{X}, \varvec{Y})={\left\{ \begin{array}{ll} - \varvec{Y}^\frac{1}{2}\log \left( \varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}\right) \varvec{Y}^\frac{1}{2} &{} \text{ if } \varvec{X}, \varvec{Y} \succ \varvec{0}, \\ +\infty &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$

Observe that when \(\varvec{X}\) and \(\varvec{Y}\) commute, \(g_f(\varvec{X},\varvec{Y})\) can be rewritten as \(\varvec{Y}(\log (\varvec{Y}) - \log (\varvec{X}))\), whose trace is the (Umegaki) quantum relative entropy function (see [33], for a general theory). We remark that the domain of \(\log (\varvec{X})\) requires that \(\varvec{X}\) is full-rank, which at a first glance makes the use of this function problematic for low-rank optimization. Accordingly, we consider the \(\epsilon \)-logarithm function, i.e., \(\log _{\epsilon }(\varvec{X})=\log (\varvec{X} + \epsilon \mathbb {I})\) for \(\epsilon >0\), as advocated by [35] in a different context. Background on matrix exponential and logarithm functions can be found in Appendix A.1.

Observe that \(\textrm{tr}(\log (\varvec{X}))=\log \det (\varvec{X})\) while \(\textrm{tr}(g_f)=\textrm{tr}(\varvec{X}(\log (\varvec{X})-\log (\varvec{Y}))\). Thus, the matrix logarithm and its trace verify the concavity of the log determinant function—which has numerous applications in low-rank problems [35] and interior point methods [67] among others—while the perspective of the matrix logarithm provides an elementary proof of the convexity of the quantum relative entropy: a task for which perspective-free proofs are technically demanding [29].

Von Neumann entropy: Let \(f(\varvec{X})=\varvec{X}\log (\varvec{X})\) denote the von Neumann quantum entropy of a density matrix \(\varvec{X}\). Then, its perspective function is \(g_f(\varvec{X}, \varvec{Y})= \varvec{X}\varvec{Y}^{-\frac{1}{2}}\log (\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}})\varvec{Y}^\frac{1}{2}\). When \(\varvec{X}\) and \(\varvec{Y}\) commute, this perspective can be equivalently written as

$$\begin{aligned} g_f(\varvec{X}, \varvec{Y})={\left\{ \begin{array}{ll} \varvec{X}^\frac{1}{2}\log (\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}})\varvec{X}^\frac{1}{2} &{} \text{ if } \varvec{X}, \varvec{Y} \succ \varvec{0}, \\ +\infty &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$

Note that various generalizations of the relative entropy for matrices have been proposed in the quantum physics literature [45]. However, these different definitions agree on the set of commuting matrices, hence can be used interchangeably for optimization purposes (see Remark 4).

Remark 9

(Quantum relative entropy cone) Note the epigraph of \(g_f\), namely,

$$\begin{aligned} {\mathcal {K}}^{\text {op, rel}}_{\text {mat}}=\left\{ (\varvec{X}_1, \varvec{X}_2, \varvec{X}_3) \in {\mathcal {S}}^n \times {\mathcal {S}}^n_{++} \times {\mathcal {S}}^n_{++}: \varvec{X}_1 \succeq -\varvec{X}_2^{\frac{1}{2}}\log (\varvec{X}_2^{-\frac{1}{2}}\varvec{X}_3\varvec{X}_2^{-\frac{1}{2}})\varvec{X}_2^{\frac{1}{2}}\right\} , \end{aligned}$$

is a convex cone which can be approximated using semidefinite cones and optimized over using either the |Matlab| package |CVXQuad| (see [33]), or optimized over directly using an interior point method for asymmetric cones [49]Footnote 4. Consequently, this is a tractable object which models the matrix logarithm and Von Neumann entropy (and their perspectives).

Finally, Table 2 relates the matrix perspectives discussed above with their scalar analogs.

3.6 Matrix perspective cuts

We now generalize the perspective cuts of [37, 42] from vectors to matrices and cardinality to rank constraints. Let us reconsider the previously defined mixed-projection optimization problem:

$$\begin{aligned} \min _{\varvec{Y} \in {\mathcal {Y}}^k_n}&\min _{\varvec{X} \in {\mathcal {S}}^n_+} \quad \langle \varvec{C}, \varvec{X}\rangle + \mu \cdot \textrm{tr}(\varvec{Y})\!\!+\!\!\textrm{tr}(f(\varvec{X}))\nonumber \\&\,\,\, \text {s.t.} \qquad \langle \varvec{A}_i, \varvec{X}\rangle =b_i \, \forall i \in [m], \ \!\!\varvec{X}=\!\!\varvec{Y}\varvec{X},\varvec{X}\in {\mathcal {K}}, \end{aligned}$$

where similarly to [37] we assume that \(f(\varvec{0})=\varvec{0}\) to simplify the cut derivation procedure. Letting \(\varvec{\theta }\) model the epigraph of f via \(\varvec{\theta } \succeq f(\varvec{X})\) and \(\varvec{S}\) be a subgradient of f at \(\bar{\varvec{X}}\), we have:

$$\begin{aligned} \varvec{\theta } \succeq f(\bar{\varvec{X}})\varvec{Y}+\varvec{S}^{\top }(\varvec{X}-\bar{\varvec{X}}\varvec{Y}), \end{aligned}$$
(23)

which if \(f(\varvec{X})=\varvec{X}^2\) —as discussed previously—reduces to

$$\begin{aligned} \varvec{\theta }^i \succeq \bar{\varvec{X}}(2\varvec{X}-\bar{\varvec{X}}\varvec{Y}), \end{aligned}$$

which is precisely the analog of perspective cuts in the vector case. Note however that these cuts require semidefinite constraints to impose, which suggests they may not be as practically useful. For instance, our prior work [11]’s outer-approximation scheme for low-rank problems has a non-convex QCQOP master problem, which can only be currently solved using |Gurobi|, while |Gurobi| currently does not support semidefinite constraints.

We remark however that the inner product of Equation (23) with an arbitrary PSD matrix supplies a valid linear inequality. Two interesting cases of this observation arise when we take the inner product of the cut with either a rank-one matrix or the identity matrix.

Taking an inner product with the identity matrix supplies the inequality:

$$\begin{aligned} \textrm{tr}(\varvec{\theta }) \ge \langle f(\bar{\varvec{X}}), \varvec{Y}\rangle +\langle \varvec{S}, \varvec{X}-\bar{\varvec{X}}\varvec{Y}\rangle \quad \forall \varvec{Y} \in {\mathcal {Y}}^k_n. \end{aligned}$$
(24)

Moreover, by analogy to [10, Section 3.4], if we “project out” the \(\varvec{X}\) variables by decomposing the problem into a master problem in \(\varvec{Y}\) and subproblems in \(\varvec{X}\) then this cut becomes the Generalized Benders Decomposition cuts derived in our prior work [11, Equation 17].

Alternatively, taking the inner product of the cut with a rank-one matrix \(\varvec{b}\varvec{b}^\top \) gives:

$$\begin{aligned} {\varvec{b}^\top \varvec{\theta }\varvec{b}\ge \varvec{b}^\top \left( f(\bar{\varvec{X}})\varvec{Y}+\varvec{S}^{\top }(\varvec{X}-\bar{\varvec{X}}\varvec{Y})\right) \varvec{b}.} \end{aligned}$$

A further improvement is actually possible: rather than requiring that the semidefinite inequality is non-negative with respect to one rank-one matrix, we can require that it is simultaneously non-negative in the directions \(\varvec{v}^1\) and \(\varvec{v}^2\). This supplies the second-order cone [64], Equation 8] cut:

$$\begin{aligned} \begin{pmatrix} \varvec{v}^1\\ \varvec{v}^2 \end{pmatrix}^\top \left( \varvec{\theta }-f(\bar{\varvec{X}})\varvec{Y}-\varvec{S}^{ \top }(\varvec{X}-\bar{\varvec{X}}\varvec{Y})\right) \begin{pmatrix} \varvec{v}^1 \\ \varvec{v}^2 \end{pmatrix}\succeq \begin{pmatrix} 0 &{}\quad 0\\ 0 &{}\quad 0\end{pmatrix}. \end{aligned}$$

The analysis in this section suggests that applying a perspective cut decomposition scheme out-of-the-box may be impractical, but leaves the door open to adaptations of the scheme which account for the projection matrix structure.

4 Examples and Perspective Relaxations

In this section, we apply the MRPT to several important low-rank problems, in addition to the previously discussed reduced-rank regression problem (Sect. 1.1). We also recall Theorem 2 to demonstrate that applying the MPRT to spectral functions which feature in these problems actually gives the convex hull of relevant substructures.

4.1 Matrix completion

Given a sample \((A_{i,j} : (i, j) \in {\mathcal {I}} \subseteq [n] \times [n])\) of a matrix \(\varvec{A} \in {\mathcal {S}}^n_+\), the matrix completion problem is to reconstruct the entire matrix, by assuming \(\varvec{A}\) is approximately low-rank [19]. Letting \(\mu , \gamma >0\) be penalty multipliers, this problem admits the formulation:

$$\begin{aligned} \min _{\varvec{X} \in {\mathcal {S}}^n_+} \quad&\sum _{(i,j) \in {\mathcal {I}}}(X_{i,j}-A_{i,j})^2+\frac{1}{2\gamma }\Vert \varvec{X}\Vert _F^2+\mu \cdot \textrm{Rank}(\varvec{X}). \end{aligned}$$
(25)

Applying the MPRT to the \(\Vert \varvec{X}\Vert _F^2=\textrm{tr}(\varvec{X}^\top \varvec{X})\) term demonstrates that this problem is equivalent to the mixed-projection problem:

$$\begin{aligned} \min _{\varvec{X}, \varvec{\theta } \in {\mathcal {S}}^n_+, \varvec{Y} \in {\mathcal {Y}}^n_n} \quad&\sum _{(i,j) \in {\mathcal {I}}}(X_{i,j}-A_{i,j})^2+\frac{1}{2\gamma }\textrm{tr}(\varvec{\theta })+\mu \cdot \textrm{tr}(\varvec{Y})\quad \text {s.t.} \quad \begin{pmatrix} \varvec{Y} &{}\quad \varvec{X}\\ \varvec{X} &{}\quad \varvec{\theta } \end{pmatrix} \succeq \varvec{0}, \end{aligned}$$

and relaxing \(\varvec{Y} \in {\mathcal {Y}}^n_n\) to \(\varvec{Y} \in \textrm{Conv}({\mathcal {Y}}^n_n) =\{\varvec{Y} \in {\mathcal {S}}^n: \varvec{0} \preceq \varvec{Y} \preceq \mathbb {I}\}\) supplies a valid relaxation. We now argue that this relaxation is often high-quality, by demonstrating that the MPRT supplies the convex envelope of \(t\ge \frac{1}{2\gamma }\Vert \varvec{X}\Vert _F^2+\mu \cdot \textrm{Rank}(\varvec{X})\), via the following corollary to Theorem 2:

Corollary 1

Let

$$\begin{aligned} {\mathcal {S}}=\left\{ (\varvec{Y}, \varvec{X}, \varvec{\theta }) \in {\mathcal {Y}}^k_n \times {\mathcal {S}}^n_+ \times {\mathcal {S}}^n: \varvec{\theta } \succeq \varvec{X}^\top \varvec{X}, u \varvec{Y} \succeq \varvec{X}\succeq \ell \varvec{Y} \right\} \end{aligned}$$

be a set where \(\ell , u \in \mathbb {R}_+\). Then, this set’s convex hull is given by:

$$\begin{aligned} {\mathcal {S}}^c&=\left\{ (\varvec{Y}, \varvec{X}, \varvec{\theta }) \in {\mathcal {S}}^n_+ \times {\mathcal {S}}^n_+ \times {\mathcal {S}}^n:\right. \\&\qquad \quad \left. \varvec{Y} \preceq \mathbb {I}, \textrm{tr}(\varvec{Y}) \le k, u \varvec{Y} \succeq \varvec{X}\succeq \ell \varvec{Y}, \begin{pmatrix} \varvec{Y} &{}\quad \varvec{X} \\ \varvec{X}^\top &{}\quad \varvec{\theta }\end{pmatrix}\succeq \varvec{0} \right\} . \end{aligned}$$

4.2 Tensor completion

A central problem in machine learning is to reconstruct a d-tensor \(\varvec{\mathscr {X}}\) given a subsample of its entries \((A_{i_1, \ldots i_d}: (i_1, \ldots i_d) \in {\mathcal {I}} \subseteq [n_1] \times [n_2] \times \ldots \times [n_d])\), by assuming that the tensor is low-rank. Since even evaluating the rank of a tensor is NP-hard [50], a popular approach for solving this problem is to minimize the reconstruction error while constraining the ranks of different unfoldings of the tensor (see, e.g., [40]). After imposing Frobenius norm regularization and letting \(\Vert \cdot \Vert _{HS}=\sqrt{\sum _{i_{1}=1}^{n_{1}} \ldots \sum _{{i_d}=1}^{n_d} X_{i_1, \ldots , i_d}^2}\) denote the (second-order cone representable) Hilbert-Schmidt norm of a tensor, this leads to optimization problems of the form:

$$\begin{aligned}&\min _{\mathscr {X} \in \mathbb {R}^{n_1 \times \ldots \times n_d}} {\sum _{({i_1, \ldots i_d}) \in {\mathcal {I}}}\left( \mathscr {A}_{i_1, \ldots i_d}-\mathscr {X}_{i_1, \ldots i_d}\right) ^2}+\sum _{i=1}^n \Vert \mathscr {X}_{(i)}\Vert _F^2 \quad \nonumber \\&\qquad \ \text {s.t.}\qquad \quad \quad \ \textrm{Rank}(\mathscr {X}_{(i)}) \le k \quad \forall i \in [n]. \end{aligned}$$
(26)

Similarly to low-rank matrix completion, it is tempting to apply the MRPT to model the \(\varvec{X}_{(i)}^\top \varvec{X}_{(i)}\) term for each mode-n unfolding. We now demonstrate this supplies a tight approximation of the convex hull of the sum of the regularizers, via the following lemma (proof omitted, follows in the spirit of [42, Lemma 4]):

Lemma 3

$$\begin{aligned} \textrm{Let} \quad {\mathcal {Q}}=&\left\{ (\rho , \varvec{Y}_1, \ldots , \varvec{Y}_m, \varvec{X}_1, \ldots , \varvec{X}_m, \varvec{\theta }_1, \ldots , \varvec{\theta }_m): \right. \\ {}&\left. \rho \ge \sum _{i=1}^m q_i \textrm{tr}(\varvec{\theta }_i), (\varvec{X}_i, \varvec{Y}_i, \varvec{\theta }_i) \in {\mathcal {S}}^i \ \forall i \in [m] \right\} \end{aligned}$$

be a set where \(l_i, u_i, q_i \in \mathbb {R}^n_+ \ \forall i \in [m]\), and \({\mathcal {S}}_i\) is a set of the same form as \({\mathcal {S}}\), but lu are replaced by \(l_i, u_i\). Then, an extended formulation of this set’s convex hull is given by:

$$\begin{aligned} {\mathcal {Q}}^c=&\left\{ (\rho , \varvec{Y}_1, \ldots , \varvec{Y}_m, \varvec{X}_1, \ldots , \varvec{X}_m, \varvec{\theta }_1, \ldots , \varvec{\theta }_m):\right. \\ {}&\left. \rho \ge \sum _{i=1}^m q_i \textrm{tr}(\varvec{\theta }_i), (\varvec{X}_i, \varvec{Y}_i, \varvec{\theta }_i) \in {\mathcal {S}}^c_i \ \forall i \in [m] \right\} . \end{aligned}$$

Lemma 3 suggests that the MPRT may improve algorithms which aim to recover tensors of low slice rank. For instance, in low-rank tensor problems where (26) admits multiple local solutions, solving the convex relaxation coming from \({\mathcal {Q}}^c\) and greedily rounding may give a high-quality initial point for an alternating minimization method such as the method of [31], and indeed allow such a strategy to return better solutions than if it were initialized at a random point.

Note however that Lemma 3 does not necessarily give the convex hull of the sum of the regularizers, since the regularization terms involve different slices of the same tensor and thus interact; see also [69] for a related proof that the tensor trace norm does not give the convex envelope of the sum of ranks of slices.

4.3 Low-rank factor analysis

statistics, psychometrics and economics is to decompose a covariance matrix \(\varvec{\Sigma } \in {\mathcal {S}}^n_+\) into a low-rank matrix \(\varvec{X} \in {\mathcal {S}}^n_+\) plus a diagonal matrix \(\varvec{\Phi } \in {\mathcal {S}}^n_+\), as explored by [8] and references therein. This corresponds to solving:

$$\begin{aligned} \min _{\varvec{X}, \varvec{\Phi } \in {\mathcal {S}}^n_+} \Vert \varvec{\Sigma }-\varvec{\Phi }-\varvec{X}\Vert _q^q \ \text {s.t.}&\ \textrm{Rank}(\varvec{X}) \le k, \nonumber \\&\ \Phi _{i,j} =0, \forall i, j \in [n]: i \ne j, \ \Vert \varvec{X}\Vert _\sigma \le M \end{aligned}$$
(27)

where \(q\ge 1\), \(\Vert \varvec{X} \Vert _q= \left( \sum _{i=1}^n \lambda _i(\varvec{X})^q\right) ^\frac{1}{q}\) denotes the matrix q-norm, and we constrain the spectral norm of \(\varvec{X}\) via a big-M constraint for the sake of tractability.

This problem’s objective involves minimizing \(\textrm{tr}\left( \varvec{\Sigma }-\varvec{\Phi }-\varvec{X}\right) ^q\), and it is not immediately obvious how to either apply the technique in the presence of the \(\varvec{\Phi }\) variables or alternatively seperate out the \(\varvec{\Phi }\) term and apply the MPRT to an appropriate (\(\varvec{\Phi }\)-free) substructure. To proceed, let us therefore first consider its scalar analog, obtaining the convex closure of the following set:

$$\begin{aligned} {\mathcal {T}}=&\{(x, y, z, t) \in \mathbb {R} \times \mathbb {R} \times \{0, 1\} \times \mathbb {R}^+: t \ge \vert x+y-d \vert ^q, \vert x\vert \le M,x=0 \ \text{ if } \ z=0\}, \end{aligned}$$

where \(d \in \mathbb {R}\) and \(q \ge 1\) are fixed constants, and we require that \(\vert x\vert \le M\) for the sake of tractability. We obtain the convex closure via the following proposition (proof deferred to Appendix B.4):

Proposition 5

The convex closure of the set \({\mathcal {T}}\), \({\mathcal {T}}^c\), is given by:

$$\begin{aligned} {\mathcal {T}}^c=&\bigg \{(x, y, z, t) \in \mathbb {R} \times \mathbb {R} \times [0, 1] \times \mathbb {R}^+: \\&\ \exists \beta \ge 0:t \ge \frac{\vert y-\beta -d(1-z)\vert ^q}{(1-z)^{q-1}}+\frac{\vert x+\beta -d z \vert ^q}{z^{q-1}}, \ \vert x \vert \le Mz\bigg \}. \end{aligned}$$

Remark 10

To check that this set is indeed a valid convex relaxation, observe that if \(z=0\) then \(x=0\) and \(x=-\beta \implies \beta =0\) and \(t \ge \vert y-d\vert ^q\), while if \(z=1\) then \(y=\beta \) and \(t \ge \vert x+y-d\vert ^q\).

Observe that \({\mathcal {T}}^c\) can be modeled using two power cones and one inequality constraint.

Proposition 5 suggests that we can obtain high-quality convex relaxations for low-rank factor analysis problems via a judicious use of the matrix power cone. Namely, introduce an epigraph matrix \(\varvec{\theta }\) to model the eigenvalues of \((\varvec{\Sigma }-\varvec{\Phi }-\varvec{X})^q\) and an orthogonal projection matrix \(\varvec{Y}_2\) to model the span of \(\varvec{X}\). This then leads to the following matrix power cone representable relaxation:

$$\begin{aligned} \min _{\varvec{X}, \varvec{\Phi }, \varvec{\theta }, \varvec{Y}_1, \varvec{Y}_2 \in {\mathcal {S}}^n_+, \varvec{\beta } \in {\mathcal {S}}^n} \quad&\textrm{tr}(\varvec{\theta })\\ \text {s.t.} \quad&\varvec{\theta } \succeq \varvec{Y}_1^{\frac{1-q}{2}}(\varvec{Y}_1^\frac{1}{2}\varvec{\Sigma }\varvec{Y}_1^\frac{1}{2}-\varvec{\beta }-\varvec{\Phi })\varvec{Y}_1^{\frac{1-q}{2}}\\&\qquad +\varvec{Y}_2^{\frac{1-q}{2}}(\varvec{Y}_2^\frac{1}{2}\varvec{\Sigma }\varvec{Y}_2^\frac{1}{2}+\varvec{\beta }-\varvec{X})\varvec{Y}_2^{\frac{1-q}{2}},\nonumber \\&\varvec{Y}_1+\varvec{Y}_2=\mathbb {I}, \textrm{tr}(\varvec{Y}) \le k, \Phi _{i,j}=0, \forall i, j \in [n]: i \ne j, \nonumber \\&\varvec{\Phi } \preceq \varvec{X}, \varvec{X}\preceq M\varvec{Y}_2, -\varvec{X}\preceq M\varvec{Y}_2. \nonumber \end{aligned}$$

4.4 Optimal experimental design

Letting \(\varvec{A} \in \mathbb {R}^{n \times m}\) where \(m \ge n\) be a matrix of linear measurements of the form \(y_i=\varvec{a}_i^\top \varvec{\beta }+\epsilon _i\) from an experimental setting, the D-optimal experimental design problem (a.k.a. the sensor selection problem) is to pick \(k \le m\) of these experiments in order to make the most accurate estimate of \(\varvec{\beta }\) possible, by solving (see [48, 71], for a modern approach):

$$\begin{aligned} \max _{\varvec{z} \in \{0, 1\}^n: \varvec{e}^\top \varvec{z} \le k} \quad&\log \det _{\epsilon }\left( \sum _{i \in [n]}z_i \varvec{a}_i \varvec{a}_i^\top \right) , \end{aligned}$$
(28)

where we define \(\log \det _{\epsilon }(\varvec{X})= \log \det (\varvec{X}+\epsilon \mathbb {I})\) for \(\epsilon >0\) to be the pseudo log-determinant of a rank-deficient PSD matrix, which can be thought of as imposing an uninformative prior of importance \(\epsilon \) on the experimental design process. Since \(\log \det (\varvec{X})=\textrm{tr}(\log (\varvec{X}))\), a valid convex relaxation is given by:

$$\begin{aligned} \max _{\varvec{z} \in [0, 1]^n, \varvec{\theta } \in {\mathcal {S}}^n_+}\quad \textrm{tr}(\varvec{\theta }) \quad \text {s.t.} \quad \log \left( \varvec{A}\textrm{Diag}(\varvec{z})\varvec{A}^\top +\epsilon \mathbb {I}\right) \succeq \varvec{\theta }, \end{aligned}$$

which can be modeled using the quantum relative entropy cone, via \((-\varvec{\theta }, \mathbb {I}, \varvec{A}\textrm{Diag}(\varvec{z})\varvec{A}^\top +\epsilon \mathbb {I}) \in {\mathcal {K}}^{\text {rel, op}}_{\text {mat}}\). This is equivalent to perhaps the most common relaxation of D-optimal design, as proposed by [15, Equation 7.2.6]. By formulating in terms of the quantum relative entropy cone, the identity term suggests this relaxation leaves something “on the table”.

In this direction, let us apply the MPRT. Observe that \(\varvec{X}:=\sum _{i \in [n]} z_i \varvec{a}_i \varvec{a}_i^\top \) is a rank-k matrix and thus at an optimal solution to the original problem there is some orthogonal projection matrix \(\varvec{Y}\) such that \(\varvec{X}=\varvec{Y}\varvec{X}\). Therefore, we can take the perspective function of \(f(\varvec{X})=\log (\varvec{X}{+\epsilon \mathbb {I}})\), and thereby obtain the following valid—and potentially much tighter when \(k<n\)—convex relaxation:

$$\begin{aligned} \max _{\varvec{z} \in [0, 1]^n, \varvec{\theta }, \varvec{Y} \in {\mathcal {S}}^n_+}\quad&\textrm{tr}(\varvec{\theta })+(n-\textrm{tr}(\varvec{Y}))\log (\epsilon ) \nonumber \\ \text {s.t.} \quad&\varvec{Y}^\frac{1}{2}\log \left( {\varvec{Y}^{-\frac{1}{2}}\varvec{A}\textrm{Diag}(\varvec{z})\varvec{A}^\top \varvec{Y}^{-\frac{1}{2}} +\epsilon \mathbb {I}} \right) \varvec{Y}^\frac{1}{2}\succeq \varvec{\theta }, \varvec{Y} \preceq \mathbb {I}, \textrm{tr}(\varvec{Y}) \le k,\nonumber \\ \end{aligned}$$
(29)

which can be modeled via the quantum relative entropy cone: \((-\varvec{\theta }, \varvec{Y}, \varvec{A}\textrm{Diag}(\varvec{z})\varvec{A}^\top +\epsilon \varvec{Y}) \in {\mathcal {K}}^{\text {rel, op}}_{\text {mat}}\). We now argue that this relaxation is high-quality, by demonstrating that the MPRT supplies the convex envelope of \(t \ge -\log \det _\epsilon (\varvec{X})\) under a low-rank constraint, via the following corollary to Theorem 2:

Corollary 2

$$\begin{aligned} \textrm{Let} \quad {\mathcal {S}}=\left\{ \varvec{X} \in {\mathcal {S}}^n_+ : t \ge -\log \det _\epsilon (\varvec{X}), \textrm{Rank}(\varvec{X}) \le k \right\} \end{aligned}$$

be a set where \(\epsilon , k,t\) are fixed. Then, this set’s convex hull is:

$$\begin{aligned} {\mathcal {S}}^c=&\bigg \{(\varvec{Y}, \varvec{X}) \in {\mathcal {S}}^n_+ \times {\mathcal {S}}^n_+: \varvec{0} \preceq \varvec{Y} \preceq \mathbb {I}, \textrm{tr}(\varvec{Y}) \le k, \\&t \ge - \textrm{tr}(\varvec{Y}^\frac{1}{2}\log _{\epsilon }(\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}) \varvec{Y}^\frac{1}{2})-(n-\textrm{tr}(\varvec{Y}))\log (\epsilon )\bigg \}.\nonumber \end{aligned}$$

Remark 11

Observe that (29)’s relaxation is not useful in the over-determined regime where \(k \ge n\), since setting \(\varvec{Y}=\mathbb {I}\) recovers (28)’s Boolean relaxation, which is considerably cheaper to optimize over. Accordingly, we only consider the under-determined regime in our experiments.

4.5 Non-negative matrix optimization

Many important problems in combinatorial optimization, statistics and computer vision (see, e.g., [17]) reduce to optimizing over the space of low-rank matrices with non-negative factors. An important special case is when we would like to find the low-rank completely positive matrix \(\varvec{X}\) which best approximates (in a least-squares sense) a given matrix \(\varvec{A} \in {\mathcal {S}}^n_+\), i.e., perform non-negative principal component analysis. Formally, we have the problem:

$$\begin{aligned} \min _{\varvec{X} \in {\mathcal {C}}^n_+: \textrm{Rank}(\varvec{X}) \le k} \quad&\Vert \varvec{X}-\varvec{A}\Vert _F^2, \end{aligned}$$
(30)

where \({\mathcal {C}}^n_+:=\{\varvec{U}\varvec{U}^\top : \varvec{U} \in \mathbb {R}^{n \times n}_{+}\}\) denotes the cone of \(n \times n\) completely positive matrices.

Applying the MPRT to the strongly convex \(\frac{1}{2}\Vert \varvec{X}\Vert _F^2\) term in the objective therefore yields the following completely positive program:

$$\begin{aligned} \min _{\varvec{X} \in {\mathcal {C}}^n_+, \varvec{Y}, \varvec{\theta } \in {\mathcal {S}}^n} \quad&\frac{1}{2}\textrm{tr}(\varvec{\theta })-\langle \varvec{X}, \varvec{A}\rangle +\frac{1}{2}\Vert \varvec{A}\Vert _F^2\ \text {s.t.}\ \varvec{Y} \preceq \mathbb {I}, \ \textrm{tr}(\varvec{Y})\le k,\ \begin{pmatrix} \varvec{Y} &{} \varvec{X}\\ \varvec{X}^\top &{} \varvec{\theta }\end{pmatrix} \in S^{2n}_+. \end{aligned}$$
(31)

Interestingly, since (31)’s reformulation has a linear objective, some extreme point in its relaxation is optimal, which means we can relax the requirement that \(\varvec{Y}\) is a projection matrix without loss of optimality and the computational complexity of the problem is entirely concentrated in the completely positive cone. Unfortunately however, completely positive optimization itself is intractable. Nonetheless, it can be approximated by replacing the completely positive cone with the doubly non-negative cone, \({\mathcal {S}}^n_+ \cap \mathbb {R}^{n \times n}_{+}\). Namely, we instead solve

$$\begin{aligned}&\min _{\varvec{X} \in {\mathcal {S}}^n_+ \cap \mathbb {R}^{n \times n}_{+}, \varvec{Y}, \varvec{\theta } \in {\mathcal {S}}^n} \ \frac{1}{2}\textrm{tr}(\varvec{\theta })-\langle \varvec{X}, \varvec{A}\rangle +\frac{1}{2}\Vert \varvec{A}\Vert _F^2\ \nonumber \\&\qquad \qquad \ \text {s.t.} \qquad \qquad \begin{pmatrix} \varvec{Y} &{} \varvec{X}\\ \varvec{X}^\top &{} \varvec{\theta }\end{pmatrix} \in S^{2n}_+,\ \varvec{Y} \preceq \mathbb {I}, \ \textrm{tr}(\varvec{Y}) \le k. \end{aligned}$$
(32)

Unfortunately, rounding a solution to (32) to obtain a completely positive \(\varvec{X}\) is non-trivial. Indeed, according to [41], there is currently no effective mechanism for rounding doubly non-negative programs. Nonetheless, as we shall see in our numerical results, there are already highly effective heuristic methods for completely positive matrix factorization, and combining our relaxation with such a procedure offers certificates of near optimality in a tractable fashion.

Remark 12

If \(\varvec{X}=\varvec{D}\varvec{\Pi }\) is a monomial matrix, i.e., decomposable as the product of a diagonal matrix \(\varvec{D}\) and a permutation matrix \(\varvec{\Pi }\), as occurs in binary optimization problems such as k-means clustering problems among others (c.f. [63]), then it follows that \((\varvec{X}^\top \varvec{X})^\dag \ge \varvec{0}\) (see [66]) and thus \(\varvec{Y}:=\varvec{X}(\varvec{X}^\top \varvec{X})^\dag \varvec{X}^\top \) is elementwise non-negative. In this case, the doubly non-negative relaxation (32) should be strengthened by requiring that \(\varvec{Y} \ge \varvec{0}\).

5 Numerical Results

In this section, we evaluate the algorithmic strategies derived in the previous section, implemented in |Julia| 1.5 using |JuMP.jl| 0.21.6 and |Mosek| 9.1 to solve the conic problems considered here. Except where indicated otherwise, all experiments were performed on a Intel Xeon E5–2690 v4 2.6GHz CPU core using 32 GB RAM. To bridge the gap between theory and practice, we have made our code freely available on |Github| at github.com/ryancorywright/MatrixPerspectiveSoftware.

5.1 Reduced rank regression

In this section, we compare our convex relaxations for reduced rank regression developed in the introduction and laid out in (6)–(7)—which we refer to as “Persp” and “DCL” respectively—against the nuclear norm estimator proposed by [57] (“NN”), who solve

$$\begin{aligned} \min _{\varvec{\beta } \in \mathbb {R}^{p \times n}} \quad \frac{1}{2m}\Vert \varvec{Y}-\varvec{X}\varvec{\beta }\Vert _F^2+\frac{1}{2\gamma }\Vert \varvec{\beta }\Vert _F^2+\mu \Vert \varvec{\beta }\Vert _*. \end{aligned}$$
(33)

Similarly to [57], we attempt to recover rank-\(k_{true}\) estimators \(\varvec{\beta }_{\text {true}}=\varvec{U}\varvec{V}^\top \), where each entry of \(\varvec{U} \in \mathbb {R}^{p \times k_{true}}, \varvec{V} \in \mathbb {R}^{n \times k_{true}}\) is i.i.d. standard Gaussian \({\mathcal {N}}(0,1)\), the matrix \(\varvec{X} \in \mathbb {R}^{m \times p}\) contains i.i.d. standard Gaussian \({\mathcal {N}}(0,1)\) entries, \(\varvec{Y}=\varvec{X}\varvec{\beta }+\varvec{E}\), and \(E_{i,j} \sim {\mathcal {N}}(0, \sigma )\) injects a small amount of i.i.d. noise. We set \(n=p=50, k=10\), \(\gamma =10^6\), \(\sigma =0.05\) and vary m. To ensure a fair comparison, we cross-validate \(\mu \) for both of our relaxations and [57]’s approach so as to minimize the MSE on a validation set. For each m, we evaluate 20 different values of \(\mu \) which are distributed uniformly in logspace between \(10^{-4}\) and \(10^{4}\) across 50 random instances for our convex relaxations and report on 100 different random instances with the “best” \(\mu \) for each method and each p.

Rank recovery and statistical accuracy: Figures 1a–c report the relative accuracy (\(\Vert \varvec{\beta }_{\text {est}}-\varvec{\beta }_{\text {true}}\Vert _F/\Vert \varvec{\beta }_{\text {true}}\Vert _F\)), the rank (i.e., number of singular values of \(\varvec{\beta }_{\text {est}}\) which exceed \(10^{-4}\)), and the out-of-sample MSEFootnote 5\(\Vert \varvec{X}_{\text {new}}\varvec{\beta }_{\text {est}}-\varvec{Y}_{\text {new}}\Vert _F^2\) (normalized by the out-of-sample MSE of the ground truth \(\Vert \varvec{X}_{\text {new}}\varvec{\beta }_{\text {true}}-\varvec{Y}_{\text {new}}\Vert _F^2\)). Results are averaged over 100 random instances per value of m. We observe that—even though we did not supply the true rank of the optimal solution in our formulation—Problem (7)’s relaxation returns solutions of the correct rank (\(k_{true}=10\)) and better MSE/accuracy, while our more “naive” perspective relaxation (6) and the nuclear norm approach (33) return solutions of a higher rank and lower accuracy. This suggests that (7)’s formulation should be considered as a more accurate estimator for reduced rank problems, and empirically confirms that the MPRT can lead to significant improvements in statistical accuracy.

Scalability w.r.t. m: Figure 1d reports the average time for |Mosek| to convergeFootnote 6 to an optimal solution (over 100 random instances per m). Surprisingly, although (7) is a stronger relaxation than (6), it is one to two orders of magnitude faster than (6) and (33)’s formulations. The relative scalability of (7)’s formulation as m—the number of observations—increases can be explained by the fact that (7) considers a linear inner product of the Gram matrix \(\varvec{X}^\top \varvec{X}\) with a semidefinite matrix \(\varvec{B}\) (the size of which does not vary with m) while Problems (6) and (33) have a quadratic inner product \(\langle \varvec{\beta }\varvec{\beta }^\top , \varvec{X}^\top \varvec{X}\rangle \) which must be modeled using a rotated second-order cone constraint (the size of which depends on m), since modern conic solvers such as |Mosek| do not allow quadratic objective terms and semidefinite constraints to be simultaneously present (if they did, we believe all three formulations would scale similarly).

Fig. 1
figure 1

Comparative performance, as the number of samples m increases, of formulations (6) (Persp, in blue), (7) (DCL, in orange) and (33) (NN, in green), averaged over 100 synthetic reduced rank regression instances where \(n=p=50\), \(k_{true}=10\). The hyperparameter \(\mu \) was first cross-validated for all approaches separately

Scalability w.r.t p: Next, we evaluate the scalability of all three approaches in terms of their solve times and peak memory usage (measured using the |slurm| command |MaxRSS|), as \(n=p\) increases. Figure 2 depicts the average time to converge to an optimal solution (a) and peak memory consumption (b) by each method as we vary \(n=p\) with \(m=n\), \(k=10\), \(\gamma =10^6\), each \(\mu \) fixed to the average cross-validated value found in the previous experiment, a peak memory budget of 120GB, a runtime budget of 12 hours, and otherwise the same experimental setup as previously (averaged over 20 random instances per n). We observe (7)’s relaxation is dramatically more scalable than the other two approaches considered, and can solve problems of nearly twice the size (4 times as many variables), and solves problems of a similar size in substantially less time and with substantially less peak memory consumption (40s vs. 1000s when \(n=100\)). All in all, the proposed relaxation (7) seems to be the best method of the three considered.

Fig. 2
figure 2

Average time to compute an optimal solution (left panel) and peak memory usage (right panel) vs. dimensionality \(n=p\) for Problems (6) (Persp, in blue), (7) (DCL. in orange) and (33) (NN, in green) over 20 synthetic reduced rank regression instances where \(k_{true}=10\)

5.2 Non-negative matrix factorization

In this section, we benchmark the quality of our dual bound for non-negative matrix factorization laid out in Sect. 4.5 by using the non-linear reformulation strategy proposed by [18] (alternating least squares or ALS) to obtain upper bounds. Namely, we obtain upper bounds by solving for local minima of the problem

$$\begin{aligned} \min _{\varvec{U} \in \mathbb {R}^{n \times k}_{+}} \quad&\Vert \varvec{U}\varvec{U}^\top -\varvec{A}\Vert _F^2. \end{aligned}$$
(34)

In our implementation of ALS, we obtain a local minimum by introducing a dummy variable \(\varvec{V}\) which equals \(\varvec{U}\) at optimality and alternating between solving the following two problems

$$\begin{aligned} \varvec{U}_{t+1}=\arg \min _{\varvec{U} \in \mathbb {R}^{n \times k}_{+}} \quad&\Vert \varvec{U}\varvec{V}_t^\top -\varvec{A}\Vert _F^2+\rho _t \Vert \varvec{U}-\varvec{V}_t\Vert _F^2, \end{aligned}$$
(35)
$$\begin{aligned} \varvec{V}_{t+1}=\arg \min _{\varvec{V} \in \mathbb {R}^{n \times k}_{+}} \quad&\Vert \varvec{U}_t\varvec{V}^\top -\varvec{A}\Vert _F^2+\rho _t \Vert \varvec{U}_t-\varvec{V}\Vert _F^2, \end{aligned}$$
(36)

where we set \(\rho _t=\min (10^{-4}\times 2^{t-1}, 10^5)\) at the tth iteration in order that the final matrix is positive semidefinite, as advocated in [5, Section 5.2.3] (we cap \(\rho _t\) to avoid numerical instability). We iterate over solving these two problems from a random initialization point \(\varvec{V}_{0}\)—where each \(V_{0,i,j}\) is i.i.d. standard uniform—until either the objective value between iterations does not change by \(10^{-4}\) or we exceed the maximum number of allowable iterations, which we set to 100.

To generate problem instances, we let \(\varvec{A}=\varvec{U}\varvec{U}^\top +\varvec{E}\) where \(\varvec{U} \in \mathbb {R}^{n \times k_{true}}\), each \(U_{i,j}\) is uniform on [0, 1], \(E_{i,j} \sim {\mathcal {N}}(0, 0.0125 k_{true})\), and set \(A_{i,j}=0\) if \(A_{i,j}<0\). We set \(n=50, k_{true}=10\). We use the ALS heuristic to compute a feasible solution \(\varvec{X}\) and an upper-bound on the problem’s objective value. By comparing it with the lower bound derived from our MPRT, we can assess the sub-optimality of the heuristic solution, which previously lacked optimality guarantees.

Figure 3 depicts the average relative in-sample MSE of the heuristic (\(\Vert \varvec{X}-\varvec{A}\Vert _F/\Vert \varvec{A}\Vert _F\)) and the relative bound gap—(UB-LB)/UB—as we vary the target rank, averaged over 100 random synthetic instances. We observe that the method is most accurate and has the lowest MSE when k is set to \(k_{true}=10\), which confirms that the method can recover solutions of the correct rank. In addition, by combining the solution from OLS with our lower-bound, we can compute a duality gap and assert that the heuristic solution is \(0\%{-}3\%\)-optimal, with the gap peaking at \(k = k_{true}\) and stabilizing as \(k \rightarrow n\). This echoes similar findings in k-means clustering and alternating current optimal power flow problems, where the SDO relaxation need not be near-tight in theory but nonetheless is nearly exact in practice [51, 63]. Further, this suggests our convex relaxation may be a powerful weapon for providing gaps for heuristics for non-negative matrix factorization, and particularly detecting when they are performing well or can be further improved.

Fig. 3
figure 3

Average relative MSE and duality gap vs. target rank k using the ALS heuristic (UB) and the MPRT relaxation (LB). Results are averaged over 100 synthetic completely positive matrix factorization instances where \(n=50\), \(k_{true}=10\)

Figure 4 reports the time needed to compute both the upper bound and a lower bound solution as we vary the target rank.

Fig. 4
figure 4

Computational time to compute a feasible solution (ALS) and solve the relaxation (Semidefinite bound) vs. target rank k, averaged over 100 synthetic completely positive matrix factorization instances where \(n=50\), \(k_{true}=10\)

5.3 Optimal experimental design

In this section, we benchmark our dual bound for D-optimal experimental design (29) against the convex relaxation (28) and a greedy submodular maximization approach, in terms of both bound quality and the ability of all three approaches to generate high-quality feasible solutions. We round both relaxations to generate feasible solutions greedily, by setting the k largest \(z_i\)’s in a continuous relaxation to 1, while for the submodular maximization approach we iteratively set the jth index of \(\varvec{z}\) to 1, where \({\mathcal {S}}\) is initially an empty set and we iteratively take

$$\begin{aligned} {\mathcal {S}} \leftarrow {\mathcal {S}} \cup \{j\}: j \in \arg \max _{i \in [n]\backslash {\mathcal {S}}}\left\{ \log \det _\epsilon \left( \sum _{l \in {\mathcal {S}}}z_l \varvec{a}_l\varvec{a}_l^\top +\varvec{a}_i\varvec{a}_i^\top \right) \right\} . \end{aligned}$$

Interestingly, the greedy rounding approach enjoys rigorous approximation guarantees (see [48, 71]), while the submodular maximization approach also enjoys strong guarantees (see [58]).

Table 3 Average runtime in seconds and relative bound gap per approach, over 20 random instances where \(n=10, m=20\)

We benchmark all methods in terms of their performance on synthetic D-optimal experimental design problems, where we let \(\varvec{A} \in \mathbb {R}^{n \times m}\) be a matrix with i.i.d. \({\mathcal {N}}(0, \frac{1}{\sqrt{n}})\) entries. We set \(n=20, m=10, \epsilon =10^{-6}\) and vary \(k <m\) over 20 random instances. Table 3 depicts the average relative bound gap, objective values, and runtimes for all 3 methods (we use the lower bound from (28)’s relaxation to compute the submodular bound gap). Note that all results for this experiment were generated on a standard Macbook pro laptop with a 2.9GHZ 6-core Intel i9 CPU using 16GB DDR4 RAM, CVX version 1.22, Matlab R2021a, and |Mosek| 9.1. Moreover, we optimize over (29)’s relaxation using the |CVXQuad| package developed by [33].

Relaxation quality: We observe that (29)’s relaxation is dramatically stronger than (28), offering bound gaps on the order of \(0\%{-}3\%\) when \(k \le 7\), rather than gaps of \(90\%\) or more. This confirms the efficacy of the MPRT, and demonstrates the value of taking low-rank constraints into account when designing convex relaxations, even when not obviously present.

Scalability: We observe that (29)’s relaxation is around two orders of magnitude slower than the other proposed approaches, largely because semidefinite approximations of quantum relative entropy are expensive, but is still tractable for moderate sizes. We believe, however, that the relaxation would scale significantly better if it were optimized over using an interior point method for non-symmetric cones (see, e.g., [49, 72]), or an alternating minimization approach (see [34]). As such, (29)’s relaxation is potentially useful at moderate problem sizes with off-the-shelf software, or at larger problem sizes with problem-specific techniques such as alternating minimization.

6 Conclusion

In this paper, we introduced the Matrix Perspective Reformulation Technique (MPRT), a new technique for deriving tractable and often high-quality relaxations of a wide variety of low-rank problems. We also invoked the technique to derive the convex hulls of some frequently-studied low-rank sets, and provided examples where the technique proves useful in practice. This is significant and potentially useful to the community, because substantial progress on producing tractable upper bounds for low-rank problems has been made over the past decade, but until now almost no progress on tractable lower bounds has followed.

Future work could take three directions: (1) automatically detecting structures where the MPRT could be applied, as is already done for perspective reformulations in the MIO case by |CPLEX| and |Gurobi|, (2) developing scalable semidefinite-free techniques for solving the semidefinite relaxations proposed in this paper, and (3) combining the ideas in this paper and in our prior work [11] with custom branching strategies to solve low-rank problems to optimality at scale.