The effect of smooth parametrizations on nonconvex optimization landscapes

We develop new tools to study landscapes in nonconvex optimization. Given one optimization problem, we pair it with another by smoothly parametrizing the domain. This is either for practical purposes (e.g., to use smooth optimization algorithms with good guarantees) or for theoretical purposes (e.g., to reveal that the landscape satisfies a strict saddle property). In both cases, the central question is: how do the landscapes of the two problems relate? More precisely: how do desirable points such as local minima and critical points in one problem relate to those in the other problem? A key finding in this paper is that these relations are often determined by the parametrization itself, and are almost entirely independent of the cost function. Accordingly, we introduce a general framework to study parametrizations by their effect on landscapes. The framework enables us to obtain new guarantees for an array of problems, some of which were previously treated on a case-by-case basis in the literature. Applications include: optimizing low-rank matrices and tensors through factorizations; solving semidefinite programs via the Burer-Monteiro approach; training neural networks by optimizing their weights and biases; and quotienting out symmetries.


Introduction
We consider pairs of optimization problems (P) and (Q) as defined below, where E is a linear space, M is a smooth manifold, and ϕ : M → E is a smooth (over)parametrization of the search space X = ϕ(M) of (P): We usually assume that f : E → R is smooth, hence so is g = f • ϕ : M → R by composition.
Such pairs of problems (Q) and (P) arise in two scenarios (concrete examples follow): (a) Our task is to minimize f on X as in (P), but we lack good algorithms to do so, e.g., because X lacks regularity. In this case, we choose a smooth parametrization ϕ of X , and use it to optimize over the smooth problem (Q) instead.
(b) Our task is to minimize g on M as in (Q), but its landscape is complex (e.g., due to symmetries). In this case, we factor g through a smooth map ϕ in the hope of revealing a problem (P) whose landscape is simpler and can be leveraged to analyze that of (Q).
In both cases, we run an optimization algorithm on the smooth problem (Q). This algorithm may be able to find desirable points y on M for (Q) (global or local minima, stationary points). However, in general such points need not map to desirable points ϕ(y) on X for (P). Indeed, nonlinear parameterizations may severely distort landscapes. The presence of such spurious points is significant even if the algorithm does not converge to them, as practical algorithms are only able to find approximately stationary points in finitely-many iterations. Therefore, there typically is a region around spurious points in which algorithms terminate and return a point on X that is not close to a desirable point.
In this paper, we characterize the properties that the parametrization ϕ needs to satisfy for desirable points of (Q) to map to desirable points of (P), that is, we develop a general framework to relate the landscapes of pairs of problems of the above form. Our framework enables us to unify and strengthen the analysis of a wealth of parametrizations arising in applications, hitherto studied case-by-case.
For an example of scenario (a), consider minimizing a cost f over the set X = R m×n ≤r of all m × n matrices of rank at most r. Unfortunately, standard algorithms running on (P) may converge to a non-stationary point because of the nonsmooth geometry of X [33,37]. Instead of trying to solve (P) directly, it is common to parametrize X by the linear space M = R m×r ×R n×r using the rank factorization ϕ(L, R) = LR ⊤ , and to solve (Q) instead. The resulting problem (Q) requires minimizing a smooth cost function over a linear space; there are several algorithms that converge to a second-order stationary point for such problems. Furthermore, any second-order stationary point for (Q) maps under ϕ to a stationary point for (P) by [22,Thm. 1]. Thus, parametrization of X by ϕ gives us an algorithm converging to a stationary point for (P) by running standard algorithms on (Q), even though similarly reasonable algorithms may fail to produce a stationary point when applied directly to (P).
For an illustration of scenario (b), consider finding the smallest eigenvalue of a d × d symmetric matrix A, which is of the form (Q) with M the unit sphere in R d and g(y) = y ⊤ Ay. This problem is not convex, hence it could have bad local minima. Here is one way to reason that it does not (as is well known). If λ ∈ R d denotes the vector of eigenvalues of A and U ∈ O(d) is an orthogonal matrix of eigenvectors satisfying A = Udiag(λ)U ⊤ (both of which are unknown), define ϕ(y) = diag(U ⊤ yy ⊤ U) ∈ R d and f (x) = λ ⊤ x. It is easy to check that g = f • ϕ and that X = ϕ(M) is the standard simplex in R d . The resulting problem (P) is convex in this case, hence each of its stationary points is a global minimum. A corollary of the theory we develop in this paper is that any second-order stationary point for (Q), for any cost function f , maps to a stationary point for (P)-see Example 4.14. Thus, we recover the well-known fact that any second-order stationary point for the eigenvalue problem (Q) is globally optimal. Even though the problem (P) cannot be solved directly in this case because f and ϕ are unknown, their mere existence can be used to show that the nonconvexity of (Q) is "benign". From this perspective, problem (P) reveals hidden convexity in problem (Q).

Contributions
We call the parametrization ϕ : M → X a (smooth) lift of X . As the above examples illustrate, understanding when lifts map certain desirable points for (Q) to desirable points for (P) yields guarantees for algorithms running on (Q). The relation between these two sets of desirable points has been studied for a variety of specific lifts-see "Prior Work" below. In this paper, we study this relation in general. Specifically, we ask and answer the following questions: • Do global minima of (Q) map under ϕ to global minima of (P) for all cost functions f ? The answer is yes because ϕ(M) = X , see Theorem 2.6.
• Do local minima of (Q) map to local minima of (P) for all continuous f ? If so, we say ϕ satisfies "local ⇒ local". The answer is yes if and only if ϕ is open, see Theorem 2.8. Openness is clearly sufficient, but proving its necessity requires substantial work. Some lifts of interest are not open, see Table 1.
• Do stationary points of (Q) map to stationary points of (P) for all differentiable f ? If so, we say ϕ satisfies "1 ⇒ 1", where "1" stands for "first-order". This property is rarely satisfied: unless all tangent cones of X are linear subspaces, for every smooth lift ϕ, there exists a (linear) f such that some stationary point for (Q) does not map to a stationary point for (P). See Theorem 2.10 and Corollary 2.22. Thus, for all but the simplest sets, all lifts are liable to introduce spurious stationary points.
• Do second-order stationary points of (Q) map to stationary points of (P) for all twicedifferentiable f ? If so, we say ϕ satisfies "2 ⇒ 1". This property holds under some conditions on ϕ which we characterize completely, see Theorem 2.12. These conditions hold for many interesting examples, see Table 1. The "2 ⇒ 1" property can enable crisp conclusions as illustrated by the two example scenarios above.
• Understanding stationarity on X requires knowledge of its tangent cones. These can be hard to characterize. We show that in some cases it is possible to obtain an explicit expression for the tangent cones simultaneously with proving "1 ⇒ 1" and "2 ⇒ 1" for some lift of X , see Remark 3.5.
• Given a set X , how can we construct a smooth lift ϕ : M → X satisfying our desirable properties? When X is given as a preimage under a smooth map, as is the case in typical constrained optimization problems, we give a systematic construction of a lift ϕ : M → X . When the set M constructed in this way is a smooth manifold, we give conditions under which ϕ satisfies each of the above properties.
Our main definitions and results are given in Section 2.1. We use our framework to study several examples, including various lifts of low-rank matrix and tensor varieties and the Burer-Monteiro lift for smooth semidefinite programs (SDPs). See Table 1 for a summary of some of our results. In particular, our theory enables us to do the following.
• We prove in Corollary 4.13 that the Burer-Monteiro lift [10,11,9,27] for smooth lowrank SDPs satisfies "2 ⇒ 1", that is, it maps second-order stationary points for (Q) to stationary points for (P), for any twice-differentiable cost function. Previously, this was only shown for rank-deficient second-order stationary points [27,Thm. 7]. We also derive an explicit expression for the tangent cones to smooth low-rank SDP domains (22).
• In the robotics and computer vision literature, the authors of [17] compose the Burer-Monteiro lift with a submersion to enable the use of robotics software. We study compositions of lifts in general in Section 2.4, and use this general theory to strengthen the guarantees of [17] in Example 2.44.
• In [22] and [32], the authors study several lifts for optimization over low-rank matrices, including the rank factorization lift mentioned above. For all these lifts, they show that first-order stationary points for (Q) may not map to stationary points for (P) but second-order stationary points for (Q) do. It is also shown in [32] that, for all of these lifts, local minima for (Q) may not map to local minima for (P). We extend the results of [22,32] using our unified framework, by characterizing precisely where each of our three desirable properties hold.
• In [36], the authors numerically observe that optimization over the factors of the singular value decomposition (SVD) performs poorly, but can be improved by allowing the middle factor to be non-diagonal. These observations may be explained by our framework. We characterize in Proposition 3.8 where each of our desirable properties holds for the SVD lift and its modification from [36]. In particular, we show that the SVD lift may introduce spurious local minima in the setting of [36], while its modification does not.
• A neural network architecture encodes a lift from a parameter space to a function space. The authors of [38] show that this lift may fail to be open, and that such failure may cause optimization algorithms to stagnate. Our theory gives another implication: since the lift is not open, it introduces spurious local minima for some cost functions.
• We show in Section 3.4 that none of the popular tensor factorization lifts satisfy "2 ⇒ 1", so second-order stationary points of (Q) need not map to stationary points of (P). This suggests that to find a stationary point over the set of bounded-rank tensors (for many notions of such rank), it is not enough to find a second-order stationary point over the factors in a rank-revealing factorization.
• The eigenvalue example above (scenario (b)) essentially lifts the simplex to the sphere by entrywise squaring. This lift is the topic of [35]. We extend their results beyond the case of convex f , as a particular case of our general construction of lifts in Section 4, see Corollary 4.13.

Prior Work
Lifts are ubiquitous. We already mentioned above the Burer-Monteiro approach to semidefinite programming [9,14], low-rank optimization [22,32,36], computer vision [17], and neural networks [38]. In [47,29], the authors compare the stationary points for (Q) to those of (P) for the case of linear neural networks and prove a special case of our Theorem 2.10 characterizing "1 ⇒ 1". The training of general neural networks, and risk minimization more generally, is naturally given in the form (Q), see [47,Appx. A.3] for example. This perspective on risk minimization has been exploited in works such as [5,6]. Other instances of lifts occur in robotics, specifically in inverse kinematics and trajectory planning [45,Chaps. 1,4]. There, variables corresponding to the joints of the robot parametrize the configurations that the robot can assume. Yet another class of lifts comes from Kostant's convexity theorem, extending the hidden convexity in the eigenvalue example from the beginning of this section to optimization of certain linear functions over certain Lie group orbits, see [30]. When X is a non-smooth algebraic variety (i.e., the zero set of polynomials), a smooth variety M and a polynomial map ϕ : M → X satisfying certain properties (specifically, properness and birationality) are called a resolution of singularities for X [23,Chap. 17], which always exists in our setting [23,Thm. 17.23]. However, this resolution of singularities may not satisfy any of the properties we consider, as the example of the nodal cubic shows (Example 2.14).
Finally, the composite optimization literature [12,34,19,7,49,4] studies algorithms for problems of the form (Q) where f is convex (and not necessarily differentiable) and ϕ is smooth. Our setting and aims are distinct however. We focus on algorithm-independent questions pertaining to the relation between desirable points for (Q) and (P).

Notation and basic definitions
Here we collect notation and standard definitions. Table 2 collects the notations and definitions for several sets used throughout the paper.
Bounded-rank matrices R m×n  • E is a linear space endowed with an inner product ·, · and induced norm · .
• M is a smooth Riemannian manifold endowed with a Riemannian metric, also denoted ·, · , with its induced norm · . The tangent space to M at y ∈ M is denoted T y M.
• If S ⊆ E is a subspace of an inner product space E, we denote by Proj S (x) the orthogonal projection of x ∈ E onto S.
• X is endowed with its subspace topology from E, and M is endowed with its manifold topology.
• A neighborhood of a point x is a set that contains x in its interior. ( • A smooth curve on M passing through y ∈ M with velocity v ∈ T y M is a smooth map c : I → M satisfying c(0) = y and c ′ (0) = v, where I denotes an arbitrary open interval in R containing 0 (not always the same one).
• If c : I → M is a smooth curve on a Riemannian manifold M such that c(0) = y, then c ′′ (0) ∈ T y M denotes its intrinsic (Riemannian) acceleration. Accordingly, if γ = ϕ • c then γ ′′ (0) denotes its (standard) acceleration in the Euclidean space E.
• A cone is a set K ⊆ E such that v ∈ K implies αv ∈ K for all α > 0.
• The dual cone K * of a cone K is K * = {u ∈ E : u, v ≥ 0 for all v ∈ K}. We use the following properties throughout (see [18,Prop. 4.5] for proofs): -The dual cone is always a closed convex cone. - -The bidual cone K * * of K is equal to the closure of its convex hull K * * = conv(K).
-If K is a linear space, then its dual K * is equal to its orthogonal complement K ⊥ .
• If ψ : N → M is a smooth map between smooth manifolds, its differential at z ∈ N is denoted Dψ(z) : T z N → T ψ(z) M.
• If g : M → R is a smooth function, its Riemannian gradient at y ∈ M is denoted ∇g(y) ∈ T y M. It is the unique element of T y M satisfying ∇g(y), v = Dg(y) [v] for all v ∈ T y M.
The Riemannian Hessian of g at y is a self-adjoint linear map denoted ∇ 2 g(y) : T y M → T y M, see [8, §8.11] for the definition.
If f : E → R is a smooth function on a linear space, the usual definitions of the directional derivative Df (x)[v], the gradient ∇f (x), and the Hessian ∇ 2 f (x) coincide with the above definitions specialized to M = E.
• We write S 0 to denote a positive-semidefinite (PSD) matrix or a PSD self-adjoint linear map S, and we write S ≻ 0 to indicate S is positive-definite.

Smooth lifts and their properties
We aim to relate the landscapes of (P) and (Q). To this end, we define several types of points of interest. We begin with a preliminary definition.
Definition 2.1. The tangent cone to X at x ∈ X is the set This is a closed (not necessarily convex) cone [42,Lem. 3.12].
The cone T x X is also called the contingent or Bouligand tangent cone [15, §2.7]. If X is a smooth embedded submanifold of E near x, then T x X coincides with the usual tangent space to X at x, see [41,Ex. 6.8].
In words, x is stationary if the cost function is non-decreasing to first order along all tangent directions at x. Local minima of (P) are stationary [42,Thm. 3.24].
Next, we define a smooth lift and consider desirable points for (Q).   (c) A point y ∈ M is first-order stationary (or "1-critical") for (Q) if for each smooth curve c : We proceed to answer the questions raised in Section 1.

Main results
The connection between global minima of (Q) and (P) is straightforward.
Theorem 2.6. A point y ∈ M is a global minimum of (Q) if and only if x = ϕ(y) is a global minimum of (P).
Proof. Because ϕ(M) = X , we have inf y∈M g(y) = inf y∈M f (ϕ(y)) = inf x∈X f (x) =: p * . Therefore, y is a global minimum for (Q) iff g(y) = f (x) = p * which happens iff x is a global minimum for (P).
Computing global minima is often hard, so in practice we can only hope to find local minima, or even just stationary points. If x = ϕ(y) is a local minimum or stationary point for (P), then it is easy to show that y is, respectively, a local minimum or a stationary point for (Q): see Propositions 2.13 and 2.21. However, the converse is false in general. As the example of the nodal cubic (Example 2.14 below) shows, x need not be stationary even if y is a local minimum. Therefore, the following two properties are not automatically satisfied. (a) The lift ϕ satisfies the "local ⇒ local" property at y ∈ M if, for all continuous f : X → R, if y is a local minimum for (Q) then x = ϕ(y) is a local minimum for (P). We say ϕ satisfies the "local ⇒ local" property if it does so at all y ∈ M.
(b) The lift ϕ satisfies the "k ⇒ 1" property at y for k = 1, 2 if for all k times differentiable f : X → R, if y is k-critical for (Q) then x = ϕ(y) is stationary for (P). We say ϕ satisfies the "k ⇒ 1" property if it does so at all y ∈ M.
We now characterize the above properties. The characterization of "local ⇒ local" below is easy to state. Proving that openness is sufficient for "local ⇒ local" is easy. Proof of its necessity requires substantial work, deferred to Appendix A. Our proof in the appendix provides the result in a more general, topological setting without using smoothness. Moreover, it provides (possibly new) conditions which are equivalent to openness and may be easier to check in some situations.
To characterize the "1 ⇒ 1" and "2 ⇒ 1" properties, we need to consider velocities and accelerations of curves on X obtained by pushing curves on M through the lift ϕ. To that end, we introduce the following definitions.
We write L ϕ y and Q ϕ y when we wish to emphasize the lift.
Of course, L y is simply the differential Dϕ(y), and is therefore linear and independent of the choice of curves c v . The value of Q y (v) does depend on the choice of c v but, importantly, this is inconsequential for our purposes: see Remark 2.39. We explain how to compute L y and Q y without explicitly choosing curves c v in Section 2.5.
Our characterization of "1 ⇒ 1" can now be concisely stated.
We further show that if ϕ satisfies "1 ⇒ 1" at y and x = ϕ(y) is a smooth point of X , that is, X is a smooth embedded submanifold of E near x, then ϕ satisfies "local ⇒ local" at y as well: see Proposition 2.24 below. Theorem 2.10 yields the following necessary, and quite restrictive, condition for "1 ⇒ 1". Corollary 2.11. Unless all tangent cones of X are linear spaces, no smooth lift for X satisfies "1 ⇒ 1". Corollary 2.11 shows that "1 ⇒ 1" usually does not hold at preimages of nonsmooth points of X (see Definition 2.23), whose tangent cones are rarely linear, so all lifts may introduce spurious critical points on M. Table 1 gives several examples of popular lifts of nonsmooth sets violating "1 ⇒ 1". Corollary 2.11 reveals that no smooth lift of these sets can satisfy "1 ⇒ 1". This is our main motivation for studying "2 ⇒ 1".
Our characterization of "2 ⇒ 1" is more involved. We therefore also give sufficient conditions that are easier to check in many examples, and a necessary condition that is easier to refute in others.
Theorem 2.12. Let ϕ : M → X be a smooth lift and fix y ∈ M. We have the following chain of implications for "2 ⇒ 1", where the cones W y and B y in E are defined in Definitions 2.27 and 2.32, respectively, and are expressed in terms of L y and Q y in Proposition 2. 40.
We show that the first sufficient condition above is not necessary in Example 2.36, and that the necessary condition is not sufficient in Example 2.47. We conjecture that the second sufficient condition is not necessary either, see Conjecture 2.37 below.
In the remainder of this section, we prove the above theorems and explain how to compute the quantities appearing in them. We then apply them to a variety of examples in Section 3.

Local minima
In this section, we investigate the relationship between the local minima of (P) and those of (Q). Preimages of local minima on X are always local minima on M merely because ϕ is continuous. 13. Let x be a local minimum for (P). Any y ∈ ϕ −1 (x) is a local minimum for (Q).
Proof. There exists a neighborhood U of x in X such that f (x) ≤ f (x ′ ) for all x ′ ∈ U. Since ϕ : M → X is continuous, the set U = ϕ −1 (U) is a neighborhood of y in M. Pick an arbitrary y ′ ∈ U: it satisfies ϕ(y ′ ) = x ′ for some x ′ ∈ U. Hence, g(y) = f (x) ≤ f (x ′ ) = g(y ′ ), i.e., y is a local minimum of (Q).
Unfortunately, lifting can introduce spurious local minima, that is, local minima for (Q) that exist only because of the lift and not because they were present in (P) to begin with. Example 2.14 (Nodal cubic). Consider the nodal cubic and its lift 3 Then the point y = (0, 0, 1) is a local minimum for g = f • ϕ but ϕ(y) = (0, 0) is not even stationary for f . Indeed, we have (−1, 1) ⊤ ∈ T (0,0) X and Df (0, 0)[(−1, 1)] = −2 < 0. The curves X and M together with the points x, y are illustrated in Figure 1.
To ensure that a lift does not introduce spurious local minima, we need to verify that it satisfies the "local ⇒ local" property (Definition 2.7(a)). We proceed to prove the easy direction of Theorem 2.8 stating that openness implies "local ⇒ local". The converse is more involved and is deferred to Appendix A. The proof of Theorem 2.8 shows that ϕ maps local minima of (Q) to local minima of (P) for all smooth f if and only if it does so for all f . Proof of Theorem 2.8. Assume ϕ is open at y, and that y is a local minimum for (Q). Then, there exists a neighborhood U of y on M such that g(y) ≤ g(y ′ ) for all y ′ ∈ U. The set U = ϕ(U) is a neighborhood of x = ϕ(y) in X by openness of ϕ at y. Moreover, each x ′ ∈ U is of the form x ′ = ϕ(y ′ ) for some y ′ ∈ U. Therefore, f (x) = g(y) ≤ g(y ′ ) = f (x ′ ) for all x ′ ∈ U, that is, x is a local minimum of (P). For the converse, see Theorem A.2.
We note in passing that all continuous, surjective, open maps are quotient maps, hence if ϕ is a smooth lift of X satisfying "local ⇒ local" then it is a quotient map from M to X . Remark 2.15. We introduce an equivalent condition for openness of ϕ at y in Appendix A that we call the Subsequence Lifting Property (SLP), and which is sometimes easier to check. Specifically, SLP is satisfied at y if for any sequence (x i ) i∈N ⊆ X converging to x = ϕ(y), there is a subsequence indexed by (i j ) j∈N ⊆ N and y i j ∈ M converging to y such that ϕ(y i j ) = x i j for all j ∈ N.
Here is an example illustrating the usefulness of SLP. We give more examples in Section 3.   [38,Cor. 4.3]. Our Theorem 2.8 implies that training such a neural network by running an optimization algorithm on its weights and biases might get stuck in a local minimum that does not parametrize a local minimum in function space. In that case, a different parametrization of the same function might not be a local minimum for (Q).

Stationary points
In this section, we investigate the relationship between the first-and second-order stationary points for (Q) and (first-order) stationary points for (P). To that end, we begin by relating the (Riemannian) gradient and Hessian of g = f • ϕ to the (Euclidean) counterparts of f . Definition 2.18. For any w ∈ E, define ϕ w : M → R by ϕ w (y) = w, ϕ(y) . Lemma 2.19. For any twice differentiable cost f : E → R, any y ∈ M, and x = ϕ(y), we have ∇g(y) = L * y (∇f (x)), Proof. For any v ∈ T y M, let c v : I → M be a smooth curve satisfying c v (0) = y, c ′ v (0) = v, and c ′′ v (0) = 0 (e.g., let c v be a geodesic), and let Since this holds for all v ∈ T y M, we conclude that ∇g(y) = L * y (∇f (x)). Next, where the first equality uses c ′′ v (0) = 0, see [8, §5.9]. On the other hand, Since this holds for all v ∈ T y M and both ∇ 2 g(y) and L * y • ∇ 2 f (x) • L y + ∇ 2 ϕ ∇f (x) (y) are self-adjoint linear maps on T y M, we conclude that they are equal.
We turn to proving our characterizations of "k ⇒ 1" for k = 1, 2 stated in Section 2.1.
Proof. The first claim follows from the fact that L y (v) = (ϕ • c v ) ′ (0) for a curve c v as in Definition 2.9, and Definition 2.1 of the tangent cone T x X . For the second claim, y is 1critical for (Q) iff ∇g(y) = L * y (∇f (x)) = 0, or equivalently, ∇f (x) ∈ ker(L * y ) = (im L y ) ⊥ . Preimages of stationary points on X are always 1-critical on M.
Proof. If x ∈ X is stationary for (P), then ∇f (x) ∈ (T x X ) * . Since T x X ⊇ im L y , taking duals on both sides we get that ∇f (x) ∈ (T x X ) * ⊆ (im L y ) ⊥ , hence y is 1-critical for (Q) by Lemma 2.20.
The converse to Proposition 2.21 is false in general. In fact, Example 2.14 shows that a lift need not even map local minima to stationary points on X .
Then ∇f (x) = w ∈ (im L y ) ⊥ so y is 1-critical for (Q) by Lemma 2.20, but ∇f (x) / ∈ (T x X ) * , so x is not stationary for (P). Hence "1 ⇒ 1" is not satisfied at y.
In fact, the proof of Theorem 2.10 also characterizes the cost functions f for which y is 1-critical for (Q) but x is not stationary for (P).
Corollary 2.22. Suppose ϕ does not satisfy "1 ⇒ 1" at y, and denote x = ϕ(y). Then y is 1-critical for (Q) but x is not stationary for (P) if and only if ∇f In particular, if "1 ⇒ 1" is not satisfied at y, there is a linear cost function f for which y is 1-critical for (Q) but x is not stationary for (P).
As discussed in Section 2.1, Corollary 2.11 implies that "1 ⇒ 1" rarely holds on all of M. Nevertheless, "1 ⇒ 1" does usually hold at preimages of smooth points, that is, points near which X is a smooth embedded submanifold of E. Moreover, if "1 ⇒ 1" holds at such points then "local ⇒ local" holds there as well, which we proceed to show. Definition 2.23. A point x ∈ X is smooth if there is an open neighborhood U ⊆ E containing x such that U ∩ X is a smooth embedded submanifold of E. It is called nonsmooth or singular otherwise. The smooth locus of X , denoted X smth , is the set of smooth points of X .
Note that X smth is open in X in the subspace topology. For all equidimensional algebraic varieties, and all constraint sets in practical optimization problems we are aware of, X smth is itself a smooth embedded submanifold of E, and it is dense in X . For example, if X is the nodal cubic (1) shown in Figure 1, then X smth = X \ {(0, 0)}. If X = R m×n ≤r , then X smth = R m×n =r .
Proof. Let U ⊆ E be an open neighborhood of ϕ(y) in E such that U ∩ X is a smooth embedded submanifold of E. Since ϕ(M) = X , we have ϕ −1 (U ∩ X ) = ϕ −1 (U) =: V , which is open in M by continuity of ϕ. Therefore, V is also a smooth manifold, since it is an open subset of M, and ϕ| V : V → U ∩ X is a smooth map between smooth manifolds. Since U ∩ X is an embedded submanifold of E, the differential of ϕ| V viewed as a map V → E coincides with its differential viewed as a map V → U ∩ X . By Theorem 2.10, ϕ satisfies "1 ⇒ 1" at y iff L y = Dϕ(y) = D(ϕ| V )(y) is surjective, meaning ϕ| V is a submersion near y [31,Prop. 4.1]. By [31,Prop. 4.28], this implies ϕ is open at y, hence it satisfies "local ⇒ local" at y by Theorem 2.8.
The converse of Proposition 2.24 fails. For example, ϕ(y) = y 3 viewed as a map R → R satisfies "local ⇒ local" at y = 0 but not "1 ⇒ 1". Remark 2.25. In the setting of Proposition 2.24, if ϕ satisfies "1 ⇒ 1" at y then it also satisfies "k ⇒ k" for all k ≥ 1. Here k-criticality for ϕ(y) on X smth (or equivalently, on X ) is defined in terms of curves as in Definition 2.4. This follows because ϕ is a smooth submersion between smooth manifolds near y, hence any curve passing through ϕ(y) is the image under ϕ of a curve passing through y, see [31,Thm. 4.26].
The only examples of lifts we know of that satisfy "1 ⇒ 1" everywhere are smooth maps between smooth manifolds that are submersions, which then also satisfy "local ⇒ local". Example 2.26 (Submersions). Examples of submersions in optimization, that is, lifts of the form ϕ : M → X where X is an embedded submanifold of E and im Dϕ(y) = T ϕ(y) X for all y ∈ M, include: • Quotient maps by smooth, free, and proper Lie group actions [8, §9], [2, §3.4], used in particular to optimize over Grassmannians by lifting to Stiefel manifolds [20].
• The map SO(p) → St(p, d) taking the first d columns of a matrix, which is used in the rotation averaging algorithm of the robotics paper [17], see Example 2.44 below.
Failure of a lift to satisfy "1 ⇒ 1" means that it may introduce spurious critical points. In the next section, we characterize the "2 ⇒ 1" property, which enables the use of second-order information to avoid these spurious points.
We begin by introducing the W y set appearing in Theorem 2.12 to characterize "2 ⇒ 1".
and y is 2-critical for (Q) .
We write W ϕ y when we wish to emphasize the lift.
Proof. If ϕ satisfies "2 ⇒ 1" and w ∈ W y , then since y is 2-critical for (Q) for some f satisfying ∇f (x) = w, we conclude that x is stationary for x is stationary for f . This shows "2 ⇒ 1".
Since the 2-criticality of y for (Q) only depends on the first two derivatives of f , we can restrict the functions f in Definition 2.27 to be of class C ∞ or even quadratic polynomials whose Hessians are a multiple of the identity, as the following proposition shows.
Proposition 2.29. For y ∈ M and x = ϕ(y), we have In particular, W y is a convex cone.
Note that ∇f α (x) = w, and by Lemma 2.19, we have ∇g α (y) = L * y (w) = ∇g(y) = 0 and Thus, y is 2-critical for g α . This shows the reverse inclusion. W y is a convex cone since if w 1 , w 2 are in W y as witnessed by functions f 1 , f 2 , then any Proposition 2.29 shows that if "2 ⇒ 1" is not satisfied at y, then there exists a simple convex quadratic cost f for which y is 2-critical for (Q) but x = ϕ(y) is not stationary for (P).
x is not stationary for (P). We conjecture that the reverse inclusion in Theorem 2.28 always holds (it does for all the lifts in Table 1). If this is indeed true, then ϕ satisfies the "2 ⇒ 1" property at y if and only if (T x X ) * = W y , neatly echoing the condition for "1 ⇒ 1", namely, (T x X ) * = (im L y ) ⊥ .
The description of W y can be complicated. It is therefore worthwhile to derive sufficient conditions for "2 ⇒ 1" that are easier to check. A point x ∈ X is stationary for (P) if ∇f (x) ∈ (T x X ) * . To derive sufficient conditions for "2 ⇒ 1", we identify two sets whose duals contain ∇f (x) if x = ϕ(y) and y is 2-critical for (Q). The sufficient conditions then require the duals of these two subsets to be contained in (T x X ) * . We will see in Section 3 that these conditions are indeed often convenient to verify.
We write A ϕ y , B ϕ y when we wish to emphasize the lift.
We show in Proposition 2.40(a) below that A y = Q y (ker L y ) + im L y is precisely the set appearing in the first sufficient condition in Theorem 2.12. The following are the basic properties these two sets satisfy.
(a) A y and B y are cones, and B y is closed.
Proof. The proofs of part (a) and the second half of part (b) are given in Appendix B.
It is clear that A y ⊆ B y from Definition 2.32.
(c) Suppose y is 2-critical for g = f • ϕ and w ∈ B y . Let c i : Taking i → ∞, we conclude that ∇f (x), w ≥ 0 and hence ∇f (x) ∈ B * y as claimed.
(d) If w ∈ W y then there exists f such that ∇f (x) = w and y is 2-critical for (Q), hence w ∈ B * y by part (c). The other inclusions follow by taking duals in part (b). We remark that neither the inclusion B y ⊆ T x X nor T x X ⊆ B y hold in general, see the end of Appendix B. By combining part (d) above with Theorem 2.28, we get the following sufficient conditions for "2 ⇒ 1".
holds at y. The next two examples illustrate the utility and the limitations of the above sufficient conditions. Example 2.35 (Cuspidal cubic). Consider the cuspidal cubic (see Figure 2 in Section 3.1) and its lift 4 The origin Since A y is a cone containing 0, we conclude that T x X ⊆ A y , and hence the lift satisfies "2 ⇒ 1" at y by Corollary 2.34.
Example 2.36 (The rank factorization lift). Consider the lift ϕ(L, R) = LR ⊤ of the bounded-rank matrix variety X = R m×n ≤r , where M = R m×r × R n×r is a Euclidean space. Let y = (L, 0) for some L with rank(L) = r and let x = ϕ(y) = 0. Because R m×n ≤r is a closed cone whose convex hull is all of R m×n , we have Since T x X is not a linear space, there is no smooth lift of R m×n ≤r satisfying "1 ⇒ 1" at y. If Thus, the sufficient conditions in Corollary 2.34(a) do not hold at y.
Nevertheless, it was shown in [22] that "2 ⇒ 1" holds everywhere on M. In Proposition 3.6 below, we use a similar argument to show B y contains all rank-1 m × n matrices, hence B * y = {0} = (T x X ) * and the sufficient condition Corollary 2.34(b) holds in this case. This example also shows that B y may be strictly larger than the closure of A y .
The sufficient condition in Corollary 2.34(b) holds in all the examples satisfying "2 ⇒ 1" that we consider, and is convenient to check for the various lifts of low-rank matrices we study in Section 3.3. Still, we conjecture the following.

Conjecture 2.37. The sufficient condition in Corollary 2.34(b) is not necessary in general.
We can now prove Theorem 2.12, stating the chain of implications we find the most useful for verifying or refuting "2 ⇒ 1" in the examples of Section 3.
Proof of Theorem 2.12. As we show in Proposition 2.40(a) below, the first condition is just T x X ⊆ A y . The second condition is the equivalence of this inclusion to T x X = A y which follows by Proposition 2.33(b). The second condition implies the third by Proposition 2.33(b). The third condition implies the fourth by Proposition 2.33(d), and it is equivalent to "2 ⇒ 1" at y by Theorem 2.28.
The last implication gives a necessary condition for "2 ⇒ 1" to hold. Suppose there exists where the second equality follows from the chain rule, the third equality follows from Lemma 2.38(a) below, and the inequality follows from w ∈ (Q(T y M)) * . Thus, y is 2critical for (Q). However, ∇f (x) = w / ∈ (T x X ) * so x is not stationary for (P), hence "2 ⇒ 1" does not hold at y.
Our goal now is to derive more explicit expressions for the sets A y , B y , W y that appear in Theorem 2.12 in terms of the maps L y and Q y from Definition 2.9. We first characterize the ambiguity in Q y (v) arising from different choices of curves c v in that definition.
Proof. (a) For any w ∈ E, recall the function ϕ w (y) = w, ϕ(y) from Definition 2.18. Let c : I → M be a curve satisfying c(0) = y and c ′ (0) = v. Then, on the one hand, On the other hand, using the Riemannian structure on M, By Lemma 2.19, we have ∇ϕ w (y) = L * y (w), so ∇ϕ w (y), c ′′ (0) = w, L y (c ′′ (0)) . We conclude that Therefore, for any w ∈ E we have which proves the claim. Lemma 2.38 shows that the possible values of Q y (v) (depending on the choice of curve c v in Definition 2.9) differ by an element of im L y , and conversely, every element of Q y (v)+im L y can be obtained by an appropriate choice of c v . Consequently, if w ∈ (im L y ) ⊥ , then the inner product w, Q y (v) is independent of the choice of c v in Definition 2.9. In fact, (5) shows that it is a quadratic form in v ∈ T y M given by: We stress that this identity requires w ∈ (im L y ) ⊥ in general. It allows us to view w, Q y (v) interchangeably as either a quadratic form in v on T y M or a linear form in w on (im L y ) ⊥ .
Remark 2.39 (Disambiguation of Q y ). Given the above ambiguity in Q y , it would be natural to define the codomain of Q y to be the quotient vector space E/ im L y . This would make it independent of the choice of c v . Subsets of the quotient E/ im L y are in bijection with subsets of E that are closed under addition with im L y (i.e. subsets S ⊆ E such that S + im L y = S), which includes A y and B y . Subsets of the dual vector space to E/ im L y are in bijection with subsets of (im L y ) ⊥ , which includes W y . Hence we could equivalently phrase our conditions for "2 ⇒ 1" in terms of subsets of E/ im L y and its dual. However, we have several techniques to obtain expressions for Q y without explicitly choosing curves, see Section 2.5 below. Thus, Definition 2.9 mirrors the computations we do more closely. In practice, we view Q y as taking values in E, and consider two maps differing by elements of im L y as equivalent for the purpose of verifying "2 ⇒ 1" since they yield the same sets A y , B y , W y .
We now express the sets A y , B y , W y appearing in our conditions for "2 ⇒ 1" in terms of the maps L y and Q y . We explain how to compute L y and Q y in various settings in Section 2.5 below.
Remark 2.41. We always have but inclusion may be strict depending on the curves c v in Definition 2.9. In practice, the expressions for Q y we obtain using our techniques from Section 2.5 below are smooth, and the subset of B y in (7) is large enough to prove "2 ⇒ 1" in every example where it holds.
) + im L y ) = w + im L y exists as well, and w is contained in this limit. This shows the inclusion ⊆ in the claim.
so w ∈ B y and hence the reverse inclusion in the claim also holds.
(c) Let x = ϕ(y). By Proposition 2.29, a vector w ∈ E is contained in W y iff there exists α > 0 such that y is 2-critical for (Q) with cost In other words, w ∈ W y iff w ∈ (im L y ) ⊥ and ∇ 2 ϕ w (y)+αL * y •L y 0 for some α > 0. To understand when the second condition holds, we decompose T y M = ker L y ⊕ (ker L y ) ⊥ and express the relevant self-adjoint operators on T y M in block matrix form with respect to a basis compatible with this decomposition. More explicitly, choose a basis as described above and denote n = dim ker L y and m = dim(ker L y ) ⊥ . Assume first that m > 0. We can represent ∇ 2 ϕ w (y) and αL * y • L y as Thus, By the generalized Schur complement theorem [50,Thm. 1.20], the positivity of the above block matrix is equivalent to . We deduce the following expression for W y : with Φ 1 and Φ 2 as defined above. We now work out basis-free characterizations of the properties (6) shows that this is also equivalent to w, Q y (v) ≥ 0 for all v ∈ ker L y , which is in turn equivalent to w, Q y (v) + L y (u) ≥ 0 for all v ∈ ker L y and u ∈ T y M. This last condition is just w ∈ A * y by part (a). Second, we must understand for which vectors w it holds that im In basis-free terms, we have shown that if Putting everything together, where the last equality holds because A * y ⊆ (im L y ) ⊥ by Proposition 2.33(b). This is the claimed expression for W y .
If m = 0, or equivalently, if L y = 0, then w ∈ W y iff w ∈ (im L y ) ⊥ = E and ∇ 2 ϕ w (y) 0. This in turn is equivalent to w ∈ A * y = (Q y (ker L y )) * ∩ (im L y ) ⊥ by (6), so W y = A * y in this case. Conversely, if w ∈ A * y and L y = 0 then ∇ 2 ϕ w (y) 0 so the condition in the claimed expression for W y is satisfied automatically. This gives the claimed expression for W y for m = 0 as well.

Composition of lifts
In this section, we ask when are lift properties we have been studying preserved under composition. We use the following proposition both to compute L y and Q y in various settings, and to study some of the lifts appearing in the literature in Sections 3-4.
Proposition 2.42. Let ϕ : M → X be a smooth lift, and let ψ : N → M be a smooth map between smooth manifolds such that ϕ • ψ : N → X is surjective. Both ϕ and ϕ • ψ are smooth lifts for X . For z ∈ N and y = ψ(z) ∈ M, the following hold.
(a) If ϕ • ψ satisfies "local ⇒ local" at z, then ϕ satisfies "local ⇒ local" at y. If ψ is open (in particular, if ψ is a submersion) at z, and if ϕ satisfies "local ⇒ local" at y, then ϕ • ψ satisfies "local ⇒ local" at z.
(c) If ψ is a submersion at z, then The proof is straightforward but long, and is deferred to Appendix C.1. Here we denote v ≡ w mod im L y to mean v − w ∈ im L y . By Remark 2.39, equality of Q ϕ•ψ z and Q ϕ y modulo im L ϕ y = im L ϕ•ψ z means that either one can be used to verify "2 ⇒ 1". Proposition 2.42 shows that, given a smooth lift ϕ : M → X , there is no benefit to further lifting M to another smooth manifold through ψ : N → M in terms of our properties. Indeed, if ϕ does not satisfy one of our properties, then neither does ϕ • ψ for any smooth ψ (we cannot 'fix' a bad lift by lifting it further). On the other hand, this proposition also tells us that our properties, as well as the sets involved in their characterization, are preserved under submersions. This allows us to check these properties on a chart for M. For lifts to a manifold M which is a quotient of another manifold M (these arise naturally when quotienting by group actions, see [8, §9]), the proposition allows us to verify our properties on M which is often easier.
Remark 2.43. If ψ : N → M is a submersion, for each z ∈ N let V z = ker Dψ(z) and H z = (ker Dψ(z)) ⊥ be the so-called vertical and horizontal spaces at z, which satisfy T z N = where D ϕ(z) and D 2 ϕ(z) are the ordinary first-and second-order derivative maps of ϕ viewed as a map between linear spaces E ′ → E. In particular, if M is itself a linear space, we may take U = E ′ = M and ψ = id and use (8) with ϕ = ϕ.
M embedded in a linear space: Suppose now that M is an embedded Riemannian submanifold of another linear space E ′ . By [8,Prop. 3.31], the lift ϕ can be extended to a smooth map on a neighborhood V of M in E ′ , denoted by ϕ : V → E. This means ϕ is a smooth map defined on an open subset V ⊆ E ′ containing M and satisfies ϕ| M = ϕ. If Denote by u v =c v (0) the ordinary (extrinsic) acceleration of c v viewed as a curve in E ′ . Then Definition 2.9 together with the chain rule give Suppose further that M is given locally near y as the zero set of a smooth map h : By the chain rule, the latter two equations can be written as Conversely, for any v, u v ∈ E ′ satisfying (9) and for any extension ϕ of ϕ. Note that T 2 y,v M is an affine subspace of E ′ which is a translate of the subspace T y M, as can be seen from (10). Therefore, different choices of u v lead to different expressions for Q y which are equal modulo im L y .
M as a quotient manifold: Suppose next that M is a quotient manifold of M with quotient map π : M → M. Then ϕ = ϕ • π gives a smooth lift of X to M. Since π is a submersion, Proposition 2.42 and Remark 2.43 imply that to check "2 ⇒ 1", we need only compute L z and Q z for ϕ restricted to the horizontal spaces (ker Dπ(z)) ⊥ using the preceding two methods.
Computing ∇ 2 ϕ w : To check the equivalent condition in Theorem 2.12 for any presentation of M, we need to compute W y . If we use Proposition 2.40 to do so, we need an expression for the Riemannian Hessian ∇ 2 ϕ w (y) where ϕ w (y) = w, ϕ(y) for w ∈ (im L y ) ⊥ . Given Q y , we can obtain ∇ 2 ϕ w (y) as the unique self-adjoint operator on T y M that defines the quadratic form (6). Conversely, if we compute ∇ 2 ϕ w (y) for all w ∈ (im L y ) ⊥ , e.g., using the techniques from [8, §5.5], we can set Q y (v) to be the unique element of (im L y ) ⊥ satisfying (6), providing another way to compute Q y .
Remark 2.45. A natural choice of curve c v in Definition 2.9 would be one that has zero initial acceleration, i.e. c ′′ v (0) = 0. This choice gives Q y (v) ∈ (im L y ) ⊥ , and can also be obtained by choosing u v in (11) to be the minimum norm solution to the system of equations defining T 2 y,v M in (10), or by deriving Q y from ∇ 2 ϕ w (y) using (6). Incidentally, this minimum norm solution is called the second fundamental form of M in E ′ . Even though this choice of curve c v is natural theoretically, the resulting expressions for Q y may be unnecessarily complicated, as Example 2.46 below illustrates.
We now illustrate the above techniques for computing L y and Q y on two examples. We first show an example using charts.
Example 2.46 (Desingularization of R m×n ≤r ). Consider the lift, proposed in [28], of X = R m×n ≤r to the so-called desingularization of R m×n ≤r : Here Gr(n, n − r) is the Grassmannian of (n − r)-dimensional subspaces of R n [8, §9]. We compute L and Q using charts. For Since rank(Y 0 ) = n − r, we can find n − r linearly independent rows in Y 0 . Let J ∈ R (n−r)×(n−r) be the invertible submatrix of Y 0 obtained by extracting these rows, and let Π ∈ R n×n be a permutation matrix sending those n − r rows to the first rows. Then where the second identity is implied by 0 = X 0 Y 0 J −1 = X 0 Π ⊤ (ΠY 0 J −1 ). Accordingly, a chart ψ : R m×r × R r×(n−r) → M containing (X 0 , S 0 ) is given by Composing with ϕ, we obtain the lift ϕ(Z, W ) = ϕ(ψ(Z, W )) = −ZW, Z Π, and by (8), is the ordinary Euclidean Hessian of ϕ V (Z, W ) = V, ϕ(Z, W ) , given by We use the above expressions to show that this lift satisfies "2 ⇒ 1" everywhere on M in Proposition 3.7 below. Alternatively, we can compute L and Q by viewing M as a quotient manifold of an embedded submanifold, and compute ∇ 2 ϕ V using (6). We carry out this alternative approach and compare it to the charts-based one above in Appendix D.1. It is much harder to use the resulting expressions to check the equivalent condition for "2 ⇒ 1" in Theorem 2.12.

Examples
In this section, we illustrate how our theory can be used to verify properties of various concrete lifts. For each of the lifts below, we ask whether "local ⇒ local", "1 ⇒ 1", and "2 ⇒ 1" hold.

Curves
We begin with two classic examples of nonsmooth plane curves: the cuspidal and nodal cubics. Both have singularities at the origin, though the two singularities have different types, called a cusp and node respectively [23, §20]. They are shown in Figure 2, together with their tangent cone at the origin. We have already encountered them in Examples 2.14 and 2.35, where we proved some of their properties using ad hoc arguments. They can be easily treated systematically using our framework, computing L and Q using the expressions (11) for embedded submanifolds. Specifically, the following two propositions follow from short computations.  Next, we consider lifts of matrices and tensors of bounded rank.

PSD matrices
We consider the set of bounded-rank PSD matrices with r < n together with its lift to M = R n×r via the factorization map ϕ(R) = RR ⊤ . This is the parametrization of X (possibly intersected with an affine subspace) proposed by Burer and Monteiro to solve certain SDPs [10]. Viewing this parametrization as a smooth lift, we ask whether it satisfies our desirable properties.
Moreover, the sufficient condition A R = T RR ⊤ X for "2 ⇒ 1" holds everywhere, and where in the last line U ∈ O(n) is an eigenmatrix for X satisfying X = UΣU ⊤ with Σ ∈ R n×n diagonal with the eigenvalues of X on the diagonal in descending order.
Proof. For "local ⇒ local", we sketch the proof from [11] of (in our terminology) SLP, the condition from Remark 2.15 that is equivalent to "local ⇒ local" by Theorem A.2. Fix R ∈ M and X = RR ⊤ . For any sequence (X i ) i≥1 ⊆ X converging to X, let X i = U i Σ i U ⊤ i be a size-r eigendecomposition for each X i (so Σ i ∈ R r×r is a diagonal matrix of eigenvalues of X i , possibly including zeros). Let R i = U i Σ 1/2 i and note that R i = X i 1/2 which is bounded, hence after passing to a subsequence we may assume that lim i R i = R exists. By continuity of ϕ, we have For V ∈ R n×n and X = RR ⊤ ∈ X , define where we used the fact that col(R) = col(X) = ker(X) ⊥ . The tangent cone at X to S n 0 is given by [24,Eq. (9)] T X S n 0 = {V ∈ S n : V u, u ≥ 0, for all u ∈ ker(X)} = {V ∈ S n : V ⊥ 0} and the tangent cone at X to R n×n ≤r is given by [43,Thm. 3.2] T X R n×n Hence the intersection T X R n×n ≤r ∩ T X S n 0 is given by the claimed expression. Furthermore, the tangent cone to an intersection is always included in the intersection of the tangent cones, which follows easily from Definition 2.1. Hence T X X ⊆ T X R n×n ≤r ∩ T X S n 0 and it suffices to show the reverse inclusion. We do so simultaneously with proving "2 ⇒ 1".

Remark 3.5.
Finding an explicit expression for tangent cones can be difficult in general. In Proposition 3.4, the set X was an intersection of two sets whose tangent cones are known, namely R n×n ≤r and S n 0 , which gave us an inclusion T X X ⊆ T X R n×n ≤r ∩ T X S n 0 . However, the tangent cone to an intersection can be strictly contained in the intersection of the tangent cones. 6 The proof of "2 ⇒ 1" in Proposition 3.4 proceeds by showing A R = T X R n×n ≤r ∩T X S n 0 , which gives T X X = T X R n×n ≤r ∩ T X S n 0 = A R because A R ⊆ T X X by Proposition 2.33(b). This simultaneously gives us "2 ⇒ 1" and an expression for the tangent cone.
The proof of Proposition 3.4 illustrates a more general, and as far as we know novel, technique of getting expressions for the tangent cones using lifts. Generalizing the above discussion, if we have an inclusion T x X ⊆ S for some set S and we are able to prove either A y ⊇ S for some y ∈ ϕ −1 (x), then we must have T x X = S by Proposition 2.33(b). In this case, we also conclude that "2 ⇒ 1" holds at y by Theorem 2.12. In Section 4, we shall see another setting in which we naturally have a superset for T x X (see Lemma 4.8), and which allows us to derive expressions for T x X from lifts satisfying "1 ⇒ 1" and "2 ⇒ 1". If X is defined by polynomial equalities and inequalities, a particular superset of the tangent cone, called the algebraic tangent cone, can be computed using Gröbner bases [16, §9.7].
Incidentally, a general condition implying that the tangent cone to an intersection is the intersection of the tangent cones is given in [41,Thm. 6.42]. That condition does not apply to X = R n×n ≤r ∩ S n 0 because R n×n ≤r is not Clarke-regular in the sense of [41, Def. 6.4]. Our approach circumvents Clarke regularity, exploiting the existence of an appropriate lift instead.

General matrices
Next, we consider the analogous lift for R m×n ≤r given by ϕ(L, R) = LR ⊤ , defined on the linear space M = R m×r × R n×r . Throughout this section, we assume r < min{m, n}. We have already encountered this lift, which we call the rank factorization lift, in Example 2.36. The proof of the following proposition is given in Appendix D.2.  • "local ⇒ local" at (X, S) if and only if rank(X) = r; the same is true for "1 ⇒ 1".
• "2 ⇒ 1" everywhere on M. 6 For example, consider intersecting the circle in the plane with one of its tangent lines.
Yet another natural lift of R m×n ≤r is given by the SVD: To make M a smooth manifold, we do not restrict σ to be nonnegative. In [36, §6.3], the authors observed that Riemannian gradient descent running on the SVD lift (17) gets stuck in a suboptimal point for a certain matrix completion problem. In contrast, they observed that if they allow the middle factor in the SVD lift to be symmetric but possibly nondiagonal, the same algorithm converges to the global optimum from the same initialization.
To help elucidate this empirical behavior, we also study the corresponding modified lift  If we further allow the middle factor to be non-symmetric, then "local ⇒ local" and "1 ⇒ 1" hold everywhere on the preimage of R m×n =r .

Tensors
We are not aware of a lift of R m×n ≤r induced by matrix factorizations with three or more factors which satisfies "2 ⇒ 1". In fact, we can prove some negative results for lifts that are multilinear in more than two factors. Such lifts also arise naturally when parametrizing various sets of tensors and linear neural networks. The proof of the following proposition is given in Appendix D.5. Proposition 3.9. Suppose ϕ : M → X ⊆ E is a smooth lift where M ⊆ E ′ = E 1 × · · · × E d is a smooth embedded submanifold of a product of Euclidean spaces E i , and ϕ is defined on all of E ′ and is multilinear in its d arguments. If M contains a point (y 1 , . . . , y d ) such that y i = 0 for three indices i, and 0 = ϕ(y 1 , . . . , y d ) is not an isolated point of X , then ϕ does not satisfy "2 ⇒ 1" at (y 1 , . . . , y d ).
• CP-tensors: where (X i ) :,j denotes the jth column of X i .
If d ≥ 3 then none of the above lifts satisfy "2 ⇒ 1" at points in M with at least three zero factors.
Proposition 3.9 might suggest that failure of "2 ⇒ 1" can be avoided by normalizing the arguments of the lift to have unit norm. Specifically, by multilinearity of ϕ we have Using this observation, one could replace a lift ϕ : R n 1 × · · · × R n d → X to a product of Euclidean spaces by a lift ψ : R × S n 1 −1 × · · · × S n d −1 to a product of R and several spheres, satisfying ψ(λ, x 1 , . . . , x d ) = λϕ(x 1 , . . . , x d ). Only one factor can be zero in this new lift, so Proposition 3.9 does not apply and we might hope that "2 ⇒ 1" holds. Unfortunately, this may not resolve the problem as there is another obstruction to "2 ⇒ 1" that does not rely on several zero factors, at least for the following specific form of a lift. The proofs of the next proposition and its corollary are also given in Appendix D.5.
(c) Consider the normalized CP decomposition lift: The case r = 1 with λ > 0 has been studied in [46]. In the general case, this lift does not satisfy "2 ⇒ 1" whenever CP-rank(ϕ(λ 1 , Y 1 , . . . , Y d )) < r, col(Y j ) ⊥ = {0} for all j, and either d ≥ 3 or d = 2 and λ = 0. 7 The symmetric rank of a symmetric tensor X, denoted sym-rk(X), is the smallest r ∈ N such that there exists a decomposition of the form X = r i=1 λ i v ⊗r i for some λ i ∈ R and v i ∈ R n .

Lift construction via fiber products, more examples
In this section, we give a systematic construction of lifts for a large class of sets X . If the resulting lifted space is a smooth manifold, we also give conditions under which the lift satisfies our desirable properties. Moreover, under these conditions we can obtain expressions for the tangent cones to X . We shall see that several natural lifts are special cases of this construction.
Suppose the set X is presented in the form where Z ⊆ E ′ is some subset of a linear space and F : E → E ′ is smooth. This form is general-any set X can be written in this form by letting F be the identity and Z = X . However, we shall see that our framework is most useful when Z is a product of simple sets for which we have smooth lifts satisfying our desirable properties. For example, any set defined by k smooth equalities g i (x) = 0 and ℓ smooth inequalities h j (x) ≥ 0 can be written in this form by letting F (x) = (g 1 (x), . . . , g k (x), h 1 (x), . . . , h ℓ (x)) and Z = {0} k × R ℓ ≥0 . We can also incorporate semidefiniteness and rank constraints of smooth functions of x by taking Cartesian products of Z with R m×n ≤r or S n 0 . Suppose now that we have a smooth lift ψ : N → Z. We can use this lift of Z to construct a lift of X by taking the fiber product of F and ψ. Here M F,ψ is the (set-theoretic) fiber product of the maps F : E → E ′ and ψ : N → E ′ .
The following commutative diagram illustrates Definition 4.1. Its top horizontal arrow is the coordinate projection (x, y) → y.
The fiber product M F,ψ need not be a smooth manifold even when both F and ψ are smooth maps between smooth manifolds. For our purposes, we shall make a more restrictive assumption: We proceed to give some examples of the above construction. We then study fiber product lifts in general and instantiate our results on these examples. Example 4.3 (Sphere to ball). Let X = {x ∈ R n : x 2 ≤ 1} be the unit Euclidean ball. Let Z = R ≥0 and F (x) = 1 − x ⊤ x so X = F −1 (Z). Let ψ : R → R ≥0 be the smooth lift ψ(y) = y 2 . Then M F,ψ = {(x, y) ∈ R n × R : 1 − x ⊤ x = y 2 }, which is just the unit sphere in R n+1 , and ϕ(x, y) = x is projection onto the first n coordinates. This lift is used in [39, §2.7] to apply a solver for quadratic programming over the sphere (Q) to quadratic programs over the ball (P). 0) where superscript ⊙2 denotes entrywise squaring. Then and ϕ(x, y) = x. It is easy to check that M F,ψ is a smooth manifold, and that ψ : S n−1 → M F,ψ given by ψ(y) = (y ⊙2 , y) is a diffeomorphism from the unit sphere S n−1 to M F,ψ . By Proposition 2.42, the fiber product lift of the simplex is equivalent (for the purposes of checking our desirable properties) to the lift ϕ = ϕ • ψ sending a unit vector y ∈ S n−1 to y ⊙2 . Li et al. [35] study this particular lift for optimization of (P) through (Q) for convex f .

Assumption 4.2 guarantees that this is a smooth manifold. This assumption holds for generic
and also for a number of applications of interest [9]. It is used in [9] to prove that the nonconvexity in the Burer-Monteiro method is benign by (in our terminology) proving "2 ⇒ 1" for this lift and linear costs f . Now that we have seen several examples of fiber product lifts, we ask when do desirable properties of the lift ψ : N → Z imply the corresponding properties for the fiber product lift ϕ : M F,ψ → X . This is answered by the next few propositions.
which is open in X as the intersection of two open sets. Thus, ϕ is open.
Note that the above proof, and hence the conclusion of Proposition 4.7, apply more generally whenever M F,ψ is endowed with the subspace topology induced from E × N (but is not necessarily a smooth manifold) and when all maps involved are continuous (but not necessarily smooth).
We now turn to studying "1 ⇒ 1" and "2 ⇒ 1". Along the way, we give another instance of the technique for finding tangent cones via lifts outlined in Remark 3.5. To do so, we begin by giving a superset of the tangent cone that is obtained from the fact that X is given by an inverse image [41,Thm. 6.31].
Lemma 4.8. The following inclusion always holds: Proof. If v ∈ T x X then by Definition 2.1 there exist a sequence (x i ) i≥1 ⊆ X converging to x and (τ i ) i≥1 ⊆ R >0 converging to zero satisfying v = lim i→∞  = (0, 0). If x = (0, 0) then the above set is all of R 2 while T x X is the nonnegative half of the x 1 -axis (see Figure 2). Proposition 4.9. Under Assumption 4.2, if ψ satisfies "1 ⇒ 1" at y ∈ N , then ϕ satisfies "1 ⇒ 1" at (x, y) ∈ M F,ψ , and equality holds in Lemma 4.8.

Using Lemma 4.8 and Proposition 2.33(b), we get the chain of inclusions
We conclude that all these sets are equal and that "1 ⇒ 1" holds for ϕ at (x, y). Proposition 4.10. Under Assumption 4.2, if ψ satisfies the sufficient condition A ψ y = T ψ(y) Z for "2 ⇒ 1" at y ∈ N , then ϕ satisfies the sufficient condition A ϕ (x,y) = T x X for "2 ⇒ 1" at (x, y) ∈ M F,ψ , and equality holds in Lemma 4.8.
Remark 4.11. We do not know whether ϕ satisfies "2 ⇒ 1" under the assumption that ψ satisfies one of the weaker sufficient conditions for "2 ⇒ 1" in Theorem 2.12.
We remark that other sufficient conditions for equality in Lemma 4.8 to be achieved are given in [41,Exer. 6.7,Thm. 6.31]. However, they do not apply to Example 4.6 (Z is not Clarke-regular and DF (X) may not be surjective). In contrast, our approach via lifts does apply to this example, and gives "2 ⇒ 1" and an expression for the tangent cones simultaneously, see Corollary 4.13 below.
As the examples in the beginning of this section illustrate, Z is often given by a product. It is therefore useful to note that a product of lifts satisfying our desirable properties also satisfies the same properties: . . , k are subsets admitting smooth lifts ψ i : N i → Z i . Let Z = Z 1 × · · · × Z k and ψ = ψ 1 × · · · × ψ k : N 1 × · · · × N k → Z, which is a smooth lift of Z. Then the following hold.
(b) ψ satisfies "local ⇒ local" at (y 1 , . . . , y k ) if and only if ψ i satisfies "local ⇒ local" at y i for all i.
(e) ψ satisfies the sufficient condition A ψ (y 1 ,...,y k ) = T (y 1 ,...,y k ) Z for "2 ⇒ 1" if ψ i satisfies the corresponding conditions A ψ i y i = T y i Z i for all i. The proof is given in Appendix C.2. Equality in Proposition 4.12(a) is achieved when each Z i is Clarke-regular at z i [41,Prop. 6.41]. By Remark 2.39, equality of Q ψ (y 1 ,...,y k ) and Q ψ 1 y 1 × · · · × Q ψ k y k modulo im L ψ (y 1 ,...,y k ) means that either one can be used to verify "2 ⇒ 1". We can now revisit the examples from the beginning of this section.
Corollary 4.13. The preceding lifts satisfy the following.
• The sphere to ball lift in Example 4.3 satisfies "local ⇒ local" everywhere, "1 ⇒ 1" at y if and only if y = 0 (i.e., at preimages of the interior of the ball), and "2 ⇒ 1" everywhere.
• The sphere to simplex lift in Example 4.4 satisfies "local ⇒ local" everywhere, "1 ⇒ 1" at y if and only if y i = 0 for all i (i.e., at preimages of points in the relative interior of the simplex), and "2 ⇒ 1" everywhere.
• The lift of the annulus in Example 4.5 satisfies "local ⇒ local" everywhere, "1 ⇒ 1" at y if and only if y 1 , y 2 = 0 (i.e., at preimages of points in the interior of the annulus), and "2 ⇒ 1" everywhere.
• The Burer-Monteiro lift of Example 4.6 under the smoothness assumption satisfies "local ⇒ local" everywhere, "1 ⇒ 1" at Y if and only if rank(Y ) = r (i.e., at preimages of points of rank r), and "2 ⇒ 1" everywhere. Moreover, we get the following expression for the tangent cones to X : Note that an expression for T X S n 0 ∩ T X R n×n ≤r is derived in Proposition 3.4 (incidentally, also as a consequence of the sufficient condition for "2 ⇒ 1" used in Proposition 4.10). The expression (22) for the tangent cone to a smooth low-rank SDP domain appears to be new. Previously, it was only shown that a rank-deficient 2-critical point for the Burer-Monteiro lifted problem (Q) maps to a stationary point for the low-rank SDP (P) [27,Thm. 7]. The result of Corollary 4.13 shows that this is true for any 2-critical point.
Proof. For the first three bullet points, consider the lift ψ(y) = y 2 from N = R to Z = R ≥0 . Observe that it satisfies "local ⇒ local" everywhere, "1 ⇒ 1" at y = 0 and satisfies the sufficient condition A y = T ψ(y) Z for "2 ⇒ 1" at y = 0. Indeed, at y = 0 we have L y (ẏ) = 2yẏ which is an isomorphism of T y N = R and T y 2 Z = R, and at y = 0 we have L y = 0 and Q y (ẏ) = 2ẏ 2 by (8) so A y = Q y (ker L y )+im L y = R ≥0 = T 0 Z. Propositions 4.7, 4.9, and 4.10 imply that the first three lifts satisfy "local ⇒ local" and "2 ⇒ 1" everywhere and give the claimed "if" directions for "1 ⇒ 1". The "only if" directions follow from Corollary 2.11.
For the Burer-Monteiro lift, consider the lift ψ(R) = RR ⊤ from N = R n×r to Z = S n 0 ∩R n×n ≤r . Proposition 3.4 shows that ψ satisfies "local ⇒local" and the sufficient condition A R = T RR ⊤ Z for "2 ⇒ 1" everywhere and "1 ⇒ 1" at points R of rank r. Therefore, Propositions 4.7, 4.9, and 4.10 imply that the Burer-Monteiro lift satisfies "local ⇒ local" and "2 ⇒ 1" everywhere and "1 ⇒ 1" at R if rank(R) = r. The "1 ⇒ 1" property does not hold at other points by Corollary 2.11. Proposition 4.10 gives the claimed expression for the tangent cones to X . Example 4.14. We can now revisit the example of computing the smallest eigenvalue of a symmetric matrix mentioned in Section 1. There, where U is the matrix of eigenvectors of a given symmetric matrix. Observe that ϕ(y) = (U ⊤ y) ⊙2 , which is the composition of the linear diffeomorphism y → U ⊤ y and the sphere to simplex lift from Example 4.4. We conclude that this lift satisfies "2 ⇒ 1" everywhere on M by Proposition 2.42 and Corollary 4.13. Therefore, any 2-critical point for (Q) maps to a stationary point for (P), for any cost f . If f is convex, then since X is also convex any stationary point for (P) is globally optimal. Thus, in this case any 2-critical point for (Q) is globally optimal and its nonconvexity is benign. This is well-known for the eigenvalue problem, which corresponds to the case of linear f .

Conclusions and future work
For the pair of problems (Q) and (P), we characterized the properties the lift ϕ : M → X needs to satisfy in order to map desirable points of (Q) to desirable points of (P). We showed that global minima for (Q) always map to global minima for (P) (Theorem 2.6), and that local minima for (Q) map to local minima for (P) if and only if ϕ is open (Theorem 2.8).
We showed that 1-critical points for (Q) map to stationary points for (P) if and only if the differential of ϕ, viewed as a map from tangent spaces to M to tangent cones to X , is surjective (Theorem 2.10). This requires the tangent cones to X to be linear spaces. We then characterized when 2-critical points for (Q) map to stationary points for (P), and gave two sufficient conditions and a necessary condition that may be easier to check for some examples (Theorem 2.12). We explained several techniques to compute all quantities involved in these conditions in Section 2.5. Using our theory, we studied the above properties for a variety of lifts, including several lifts of low-rank matrices and tensors (Section 3) and the Burer-Monteiro lift for smooth SDPs (Corollary 4.13). We also proposed a systematic construction of lifts using fiber products that applies when X is given as a preimage under a smooth function (Section 4). We gave conditions under which it satisfies our desirable properties, assuming this construction yields a smooth manifold. In some cases, we can also obtain an expression for the tangent cone simultaneously with "2 ⇒ 1", as explained in Remark 3.5.
We end by listing several future directions suggested by this work.
(a) "k ⇒ 1" for general k: Several lifts of interest, notably tensor factorizations with more than two factors, do not satisfy "2 ⇒ 1". It would therefore be interesting to characterize "k ⇒ 1" for general k, i.e., when do k-critical points for (Q) map to stationary points for (P) for any k times differentiable cost f ? Do lifts that are multilinear in k arguments, such as order-k tensor lifts, satisfy "k ⇒ 1"?
What can be said about "k ⇒ ℓ" for ℓ > 1? Already for ℓ = 2, the second-order optimality conditions on X can be involved [42,Thm. 3.45]. On the positive side, if "1 ⇒ 1" holds at a preimage of a smooth point, then "k ⇒ k" holds there for all k ≥ 1, see Remark 2.25.
(b) Robust "k ⇒ 1": Algorithms run for finitely many iterations in practice, hence can only find approximate k-critical points for (Q). It is therefore important to characterize "robust" versions of "k ⇒ 1", guaranteeing that approximate k-critical points for (Q) map to approximate stationary points for (P). Note that if X lacks regularity, care is needed when defining approximate stationarity for (P), see [33].
(c) Obstructions to "local ⇒ local" and "k ⇒ 1": Can we find general obstructions to "local ⇒ local" and "k ⇒ 1"? More concretely, is there a lift for low-rank tensors satisfying "2 ⇒ 1"? Is there a lift for R m×n ≤r satisfying "local ⇒ local"? Are there topological obstructions to existence of lifts satisfying "local ⇒local"? What properties of the singularities of X play a role? (e) Sets X defined via lift: To verify "1 ⇒ 1" and "2 ⇒ 1" on concrete examples of X using the theory in this paper, we need to understand the tangent cones to X , which is often challenging. Many sets X encountered in applications are only defined implicitly via a lift ϕ : M → E. Examples of such sets X include the set of tensors admitting a certain factorization, the set of functions parameterized by a given neural network architecture, and the set of positions and orientations attainable by a robotic arm with a given joint configuration. Are there conditions for "k ⇒ 1" that can be checked using ϕ and M alone, without an explicit expression for the tangent cones to X ?
(f) Dynamical systems on M and their image on X : This paper is focused on comparing properties of points on M and their images on X . In contrast, several applications are concerned with properties of entire trajectories of dynamical systems on M, and it may be interesting to compare these properties with their counterparts for the images of the trajectories on X . Examples of such comparisons include relating gradient flow on the weights of a neural network to gradient flow in function or measure spaces [6,5,26], and the "algorithmic equivalence" technique used in [3,21] to study mirror descent by showing that its continuous-time analogue is equivalent to gradient flow on a reparametrized problem.

1.
ϕ is open at y if ϕ(U) is a neighborhood of x in X for all neighborhoods U of y in M.

2.
ϕ is approximately open at y if ϕ(U) is a neighborhood of x in X for all neighborhoods U of y in M.
3. ϕ satisfies the Subsequence Lifting Property (SLP) at y if for every sequence (x i ) i≥1 ⊆ X converging to x there exists a subsequence indexed by (i j ) j≥1 and a sequence (y i j ) j≥1 ⊆ M converging to y such that ϕ(y i j ) = x i j for all j ≥ 1.
4. ϕ satisfies the Approximate Subsequence Lifting Property (ASLP) at y if for every sequence (x i ) i>1 ⊆ X converging to x and every sequence (ǫ i ) i≥1 ⊆ R >0 converging to 0 there exists a subsequence indexed by (i j ) j≥1 and a sequence (y i j ) j≥1 ⊆ M converging to y such that dist(ϕ(y i j ), x i j ) ≤ ǫ i j for all j ≥ 1.
Theorem A.2. If M is Hausdorff, second-countable and locally compact (all of which hold if M is a topological manifold), then the four properties of ϕ at y ∈ M in Definition A.1 are equivalent to each other and to the "local ⇒ local" property at y (Definition 2.7(a)).
Proof. We show that ASLP =⇒ approximate openness =⇒ openness =⇒ SLP =⇒ "local ⇒ local" =⇒ ASLP. Suppose ϕ satisfies ASLP at y. Suppose there exists a neighborhood U of y such that ϕ(U) is not a neighborhood of x = ϕ(y). Then we can find a sequence ( ) > 0 and apply ASLP to find a sequence (y i ) i≥1 ⊆ M such that y i → y and dist(ϕ(y i ), ∈ ϕ(U) for all i. However, because U is a neighborhood of y and y i → y, we must have y i ∈ U for all large i, a contradiction. Thus, ϕ(U) is a neighborhood of x, so ϕ is approximately open at y.
Suppose ϕ is approximately open at y, and let U be a neighborhood of y in M. Because M is locally compact, we can find a compact neighborhood V ⊆ U of x. Since ϕ is continuous and V is compact, we have that ϕ(V ) is compact; since X is Hausdorff (it is a metric space), it follows that ϕ(V ) is closed. Combining with the fact that ϕ is approximately open at y, we deduce that ϕ(V ) is a neighborhood of x. Since ϕ(U) ⊇ ϕ(V ), we conclude that ϕ(U) is a neighborhood of x as well. Thus, ϕ is open at y.
Suppose ϕ is open at y, and (x j ) j≥1 ⊆ X converges to x = ϕ(y). Owing to the topological properties of M, there is a sequence of open neighborhoods U i of y with compact closures such that . Moreover, because ϕ(U i ) is a neighborhood of x and x j → x, there exists index J(i) such that x j ∈ ϕ(U i ) for all j ≥ J(i). After passing to a subsequence of (x j ), we may assume x j ∈ ϕ(U j ) and pick y j ∈ U j satisfying x j = ϕ(y j ). Because (y j ) is an infinite sequence contained in the compact set U 1 , after passing to a subsequence again we may assume that lim j y j exists. With i arbitrary, we have for all j > i that y j ∈ U j ⊆ U i+1 , hence that lim j y j ∈ U i+1 ⊆ U i . This holds for all i, hence lim j y j ∈ i U i = {y}. Thus, y = lim i y i and ϕ(y i ) = x i , so ϕ satisfies SLP.
Suppose ϕ satisfies SLP at y. Let f : X → R be a cost function on X and g = f • ϕ. Suppose x = ϕ(y) is not a local minimum for f on X , that is, there exists a sequence (x i ) i≥1 ⊆ X converging to x such that f (x i ) < f (x) for all i. Applying SLP, after passing to a subsequence we can find a sequence (y i ) i≥1 ⊆ M converging to y such that ϕ(y i ) = x i . Since g(y i ) = f (x i ) < f (x) = g(y) and y i → y, we conclude that y is not a local minimum for g. By contrapositive, this shows that ϕ satisfies the "local ⇒ local" property at y.
For the last implication, we proceed by contrapositive once again. Suppose ϕ does not satisfy ASLP at y. Then, we can find sequences (x i ) i≥1 ⊆ X converging to x and (ǫ i ) i≥1 ⊆ R >0 converging to 0 such that no subsequence of (x i ) can be approximately lifted to M in the sense of ASLP. LetB(x, ǫ) = {x ′ ∈ X : dist(x, x ′ ) ≤ ǫ}. Notice that x = ϕ(y) / ∈B(x i , ǫ i ) for all but finitely many indices i, as otherwise the constant sequence y i ≡ y would give an approximate lift of a subsequence. Since x i → x and ǫ i → 0, after passing to a subsequence we may assume that the closed ballsB(x i , ǫ i ) are pairwise disjoint and none contain x. Define the following sum of smooth bump functions centered at the x i otherwise. This is well defined because the ballsB(x i , ǫ i ) are disjoint. (As a side note, we remark that if X is a metric subspace of a Euclidean space E as in our general treatment, then f extends to a smooth function on E.) Note that x is not a local minimum for f since However, y is a local minimum for g = f • ϕ. Indeed, if there was a sequence (y i ) converging to y such that g(y i ) < g(y) = 0, then we would have ϕ(y i ) ∈B(x n i , ǫ n i ) for an infinite subsequence (n i ) with n i → ∞ (since we must have ϕ(y i ) → x by continuity of ϕ), showing that (y i ) is an approximate lift of the subsequence (x n i ): a contradiction to our assumptions about (x i ), (ǫ i ). Thus, ϕ does not satisfy the "local ⇒ local" property at y.
Because M is second-countable and locally compact, we can find a countable basis of open neighborhoods with compact closures {V j } j≥1 for y. Since M is Hausdorff and {V j } is a basis for y, we have ∞ j=1 V j = {y}. Indeed, if y ′ = y, then there exists a neighborhood of y not containing y ′ , and this neighborhood contains V i for some i by definition of a local basis. By replacing V i by i j=1 V j (which preserves their intersection), we may assume which is an open neighborhood of y with compact closure by assumption. Having constructed U 1 , . . . , U i , use local compactness to find a compact neighborhood K i+1 ⊆ V i+1 ∩ U i of y and let U i+1 be the interior of K i+1 . Then U i+1 is an open neighborhood of y by construction, and U i+1 ⊆ K i+1 ⊆ U i which also shows U i+1 is compact as a closed subset of the compact set K i+1 . Finally, we have A y is a cone. The proof that B y is a cone is analogous.
Suppose w j ∈ B y and w j → w. Let c j,i : We prove the last claim in part (b). Since 0 ∈ im L y , we trivially have A y + im L y ⊇ A y and similarly for B y . Conversely, if w ∈ A y , let c : To see that B y ⊆ T x X in general, consider B y for the lift ϕ : given by ϕ(λ, u, v) = λuv ⊤ at y = (0, e 1 , e 1 ) where e 1 = 1, 0 ⊤ . We remark that "2 ⇒ 1" does not hold for this lift by Proposition 3.11. To see that T x X ⊆ B y in general, note that if T x X ⊆ B y then B * y ⊆ (T x X ) * and hence "2 ⇒ 1" holds at y by Corollary 2.34(b). In particular, we have T x X ⊆ B y for the above example as well.

C Compositions and products of lifts C.1 Compositions
Proof of Proposition 2.42. Recall that "local ⇒ local" is equivalent to openness by Theorem 2.8.
(a) For the first statement, suppose ϕ • ψ is open at z and let V ⊆ M be a neighborhood of y. Because ψ is continuous, ) is a neighborhood of ϕ(y) as well. This shows ϕ is open at y, and hence satisfies "local ⇒ local" there by Theorem 2.8.
For the second statement, if U ⊆ N is a neighborhood of z, then ψ(U) ⊆ M is a neighborhood of y since ψ is open at z. If ϕ satisfies "local ⇒ local", or equivalently, is open at y, then ϕ(ψ(U)) ⊆ X is a neighborhood of ϕ(y), hence ϕ • ψ is open at z and satisfies "local ⇒ local" there by Theorem 2.8.
Conversely, suppose ψ is a submersion at z and ϕ satisfies "2 ⇒ 1" at y. Suppose z is 2-critical for f • ϕ • ψ. Because ψ is a submersion at z, for any curve c : I → M satisfying c(0) = y, there exists a potentially smaller interval I ′ ⊆ I containing t = 0 in its interior and a curvec : I ′ → N such that ψ •c = c. For example, by [31,Thm. 4.26] there exists a smooth local section σ of ψ defined near y satisfying σ(y) = z, in which case we can setc = σ • c. Because z is 2-critical, it follows that y is 2-critical for f • ϕ by (23). Since ϕ satisfies "2 ⇒ 1" at y, we conclude that ϕ(y) = ϕ(ψ(z)) is stationary for f , which shows ϕ • ψ satisfies "2 ⇒ 1" at z. k )) j≥1 ⊆ Z and τ j → 0 such that τ j > 0 satisfy For an example where the inclusion is strict, see [41,Prop. 6.41].
(b) Since products of open sets form a basis for the (product) topology on Z, we conclude that ψ is open at y = (y 1 , . . . , y k ) iff ψ(U 1 × · · · × U k ) = ψ 1 (U 1 ) × · · · × ψ k (U k ) is a neighborhood of y whenever U i ⊆ M i is a neighborhood of y i for all i. Because the interior of a product of sets in the product topology is the product of their interiors, we conclude that ψ 1 (U 1 ) × · · · × ψ k (U k ) is a neighborhood of y if and only if ψ i (U i ) is a neighborhood of y i for all i. This proves part (b) by Theorem 2.8.
(e) This follows from (a) and (d).

D Additional details and proofs for examples D.1 Alternative computation of L and Q for desingularization lift
In Example 2.46, we computed L and Q for the desingularization lift of R m×n ≤r using charts. Alternatively, we can view M as a quotient manifold of the embedded submanifold with the quotient map π : M → M given by π(X, Y ) = (X, col(Y )), see [28, §2.3]. Since quotient maps are submersions, Proposition 2.42 shows that to check our desirable properties for ϕ at (X, S), it suffices to check them for ϕ = ϕ • π at any (X, Y ) such that π(X, Y ) = (X, S). Since M is an embedded submanifold of a linear space E = R m×n × R n×(n−r) , we can use the corresponding techniques from Section 2.5 to compute L and Q and check our properties. We now carry this out to illustrate the difference between the expressions obtained using different techniques. Therefore, we get from (11) that Both maps can be restricted to the horizontal space of π by Remark 2.43, which is shown in [28, §2.3] to be Indeed, the above map is clearly linear, maps H (X,Y ) to itself by definition, and can be easily verified to be self-adjoint. Unfortunately, orthogonal projection onto H (X,Y ) has a complicated expression [28,Thm. 5]. In contrast, we get a simple explicit formula using the chart-based formalism in (14). Incidentally, using the above expression for the Riemannian Hessian we can obtain a different expression for Q ϕ (X,Y ) , which is again somewhat more complicated than (13) obtained using charts: By Remark 2.39, the expression for Q ϕ (X,Y ) is only defined up to im L ϕ (X,Y ) . The difference between the two expressions lies in im L ϕ (X,Y ) by the Woodbury matrix identity. This example shows that verification of "2 ⇒ 1" is simplified by using the right approach.

D.2 Rank factorization lift
Proof of Proposition 3.6. For "local ⇒ local", we follow [32,Prop. 2.34] and show SLP holds at every "balanced" (L, R) factorizations, i.e., one satisfying rank(L) = rank(R) = rank(LR ⊤ ). 8 Let X = LR ⊤ and suppose (X i ) i∈N ⊆ X converges to X. Let X i = U i Σ i V ⊤ i be a size-r SVD of X i where Σ i ∈ R r×r is diagonal with the first r singular values of X i (possibly including zeros) on the diagonal.
. Since X i are bounded, after passing to a subsequence we may assume that the limit (L ∞ , R ∞ ) = lim i (L i , R i ) exists. By continuity of ϕ, we must have This implies rank(L ∞ ) = rank(R ∞ ) = rank(X) by considering the polar decompositions of L ∞ , R ∞ , see [32,Lem. 2.33]. By [32,Lem. 2.32], there exists J ∈ GL(r) satisfying L = L ∞ J and R = R ∞ J −⊤ . Therefore, (L i J, R i J −⊤ ) converges to (L, R) and is a lift of X i , showing that SLP holds. We conclude that ϕ satisfies "local ⇒ local" at (L, R) such that rank(L) = rank(R) = rank(LR ⊤ ) by Theorem A.2.
Conversely, we show that if rank(L), rank(R) and rank(LR ⊤ ) are not all equal then "local ⇒ local" does not hold at (L, R) by constructing a sequence (X i ) converging to X = LR ⊤ no subsequence of which can be lifted to a sequence converging to (L, R). It always holds that rank(LR ⊤ ) ≤ min{rank(L), rank(R)}. Assume rank(X) < rank(L) (in particular, rank(X) < r). The case rank(X) < rank(R) is similar. Define for L ⊥ ∈ R m×r and R ⊥ ∈ R n×r satisfying L ⊤ ⊥ L = R ⊤ ⊥ R = 0 and rank(L ⊥ ) = rank(R ⊥ ) = r − rank(X). Note that rank(L i ) = rank(R i ) = r for all i. We construct L ⊥ , R ⊥ as follows. Let Proj col(X) L = U L Σ L V ⊤ L be a thin SVD for Proj col(X) L, whose rank is rank(X) (because X = LR ⊤ implies col(X) ⊆ col(L)). Since rank(X) < r, we can find W L ∈ St(m, r − rank(X)) and Q L ∈ St(r, r − rank(X)) such that W ⊤ L U L = Q ⊤ L V L = 0, and note that L ⊥ = W L Q ⊤ L satisfies the desired properties. The construction of R ⊥ is analogous. Define X i = L i R ⊤ i which converges to X as i → ∞, and suppose (L i j ,R i j ) is a lift of a subsequence of (X i ) converging to (L, R). Because rank(L i ) = rank(R i ) = r, we also have rank(X i ) = r, and there exist J i j ∈ GL(r) such that (L i j ,R i j ) = (L i j J i j , R i j J −⊤ i j ), see [32,Lem. 2.32]. Then whose rank is rank(L) > rank(X).
But we also have L ⊤ L i j J i j = L ⊤ Proj col(X) LJ i j , whose rank is rank(X), so rank(lim j L ⊤ L i j J i j ) ≤ rank(X) since bounded-rank matrices form a closed set. This is a contradiction. Thus, no subsequence of (X i ) can be lifted, showing that ϕ does not satisfy "local ⇒ local" at (L, R) by Theorem A.2. For an explicit example of a cost f and point (L, R) which is a local minimum for (Q) but such that LR ⊤ is not a local minimum for (P), see [32,Prop. 2.30].
For "1 ⇒ 1" and "2 ⇒ 1", note that M is a linear space, hence (8) gives for some permutation matrix Π ∈ R n×n . Proposition 2.42 implies that our desirable properties hold at (Z, W ) iff they hold at (X, S), so it suffices to consider the composed lift ϕ(Z, W ) = −ZW, Z Π =: X, defined on the linear space R m×r × R r×(n−r) . For this lift, we computed L (Z,W ) and Q (Z, W ) in (13): Suppose rank(X) = r. Note that col(X) = col(Z), so rank(Z) = r. If L (Z,W ) (Ż,Ẇ ) = 0, thenŻ = 0 and ZẆ = 0. This impliesẆ = 0 since Z has full column rank. Thus, L (Z,W ) is injective, but since its domain has dimension (m + n − r)r = dim R m×n =r , we conclude that it is an isomorphism. Thus, "1 ⇒ 1" holds at (Z, W ) by Theorem 2.10. If rank(X) < r then T X X is not a linear space [25, Thm. 2.2], hence "1 ⇒ 1" cannot hold for any lift by Corollary 2.11.
Suppose rank(X) < r. We show "2 ⇒ 1" holds at (Z, W ) by showing that B * (Z,W ) ⊆ (T X X ) * . To that end, note that rank(Z) = rank(X) < r, so there is a unit vector w ∈ R r satisfying Zw = 0. Let u ∈ R m and v ∈ R n−r be arbitrary. For any i ∈ N, We conclude that To characterize (im L (Z,W ) ) ⊥ , observe that V = V 1 , V 2 Π ∈ R m×n with V 1 ∈ R m×(n−r) satisfies V ∈ (im L (Z,W ) ) ⊥ iff the following holds for all (Ż,Ẇ ) ∈ R m×r × R r×(n−r) : This is equivalent to This shows "2 ⇒ 1" holds at (Z, W ).

D.4 SVD lift and its modification
Proof of Proposition 3.8. We first analyze the SVD lift, and then its modification. SVD lift: Considering the lift (17). Fix (U, σ, V ) ∈ M such that the entries of |σ| are all nonzero but not distinct. Choose k < ℓ such that |σ k | = |σ ℓ | and let X = ϕ(U, σ, V ) = Udiag(σ)V ⊤ . We show "local ⇒ local" does not hold at (U, σ, V ) by constructing a sequence converging to X such that no subsequence of it can be lifted to a sequence converging to (U, σ, V ). Choose α i ∈ R r converging to zero such that |σ + α i | has distinct entries which are all nonzero. Let Q (k,ℓ) ∈ O(r) be the Givens rotation matrix rotating by π/4 in (k, ℓ) plane, given explicitly by and let U (k,ℓ) = Usign(σ)Q (k,ℓ) sign(σ) ∈ St(m, r) and V (k,ℓ) = V Q (k,ℓ) ∈ St(n, r). Define whose limit is The third equality above follows since |σ k | = |σ ℓ | so the corresponding submatrix of diag(σ) is a multiple of the identity, and Q (k,ℓ) is orthogonal and acts by the identity outside of that submatrix. Suppose (U i j , σ i j , V i j ) is a lift of some subsequence of (X i ) converging to (U, σ, V ). The singular values of X i j are the entries of |σ i j |, which are also the entries of |σ +α i j |. Since all these singular values are distinct, U i j and V i j must contain the kth and ℓth columns of U (k,ℓ) and V (k,ℓ) up to sign, since the singular vectors are unique up to sign [48,Thm. 4.1]. But then it cannot happen that (U i j , V i j ) → (U, V ) by construction of Q (k,ℓ) , a contradiction. Thus, no subsequence of (X i ) can be lifted to a sequence on M converging to (U, σ, V ), showing that "local ⇒ local" does not hold there. Proposition 2.24 together with the fact that X smth = R m×n =r implies that "1 ⇒ 1" does not hold at such (U, σ, V ) either. Next, fix (U, σ, V ) ∈ M such that σ k = 0 for some k, and let X = ϕ(U, σ, V ). Since rank(X) < r < min(m, n), there exist unit vectors u ′ k ∈ R m and v ′ k ∈ R n such that U ⊤ u ′ k = 0 and V ⊤ v ′ k = 0. Let U (k) , V (k) be obtained from U, V by replacing their kth columns with u ′ k , v ′ k , respectively, and let α i ∈ R r be a sequence converging to zero such that |σ + α i | has distinct entries that are all nonzero. Define X i = ϕ(U (k) , σ + α i , V (k) ) which converge to X as i → ∞. If (U i j , σ i j , V i j ) is a lift of a subsequence of (X i ) converging to (U, σ, V ), then one of the columns of U i j , V i j must be u ′ k , v ′ k up to sign because |σ i j | has distinct entries which are the singular values of X i j . This contradicts U i j → U and V i j → V . Thus, "local ⇒ local" does not hold at such (U, σ, V ). Corollary 2.11 shows that "1 ⇒ 1" does not hold there either since T X X is not a linear space.
Finally, fix (U, σ, V ) ∈ M such that all the entries of |σ| are nonzero and distinct. We verify that "1 ⇒ 1" holds there using Theorem 2.10 by showing im L (U,σ,V ) = T X X . Since M is a product of embedded submanifolds of linear spaces, we have from (11) that where (U,σ,V ) ∈ T (U,σ,V ) M. Let U ⊥ ∈ St(m, m−r) and V ⊥ ∈ St(n, n−r) satisfy U ⊤ U ⊥ = 0 and V ⊤ V ⊥ = 0. By [8, §7.3], we haveU ∈ T U St(m, r) andV ∈ T V St(n, r) if and only iḟ where Ω u , Ω v ∈ Skew(r) := {Ω ∈ R r×r : Ω ⊤ + Ω = 0} and B u ∈ R (m−r)×r , B v ∈ R (n−r)×r are arbitrary. Using this parametrization, Since T X X is a linear space and dim T (U,σ,V ) M = dim T X X , we have im L (U,σ,V ) = T X X if and only if L (U,σ,V ) is injective. Suppose therefore that L (U,σ,V ) (U,σ,V ) = 0. By (26) this is equivalent to whereσ = 0 follows by considering the diagonal of the top left block in (26). The fourth equality in (27) together with the skew-symmetry of Ω u , Ω v gives for all i, j that Since |σ i | = |σ j | whenever i = j, we get (Ω v ) i,j = 0 and (Ω u ) i,j = 0 for all i = j, hence Ω u = Ω v = 0. We conclude that (U,σ,V ) = 0 so L (U,σ,V ) is injective and "1 ⇒ 1" holds. By Proposition 2.24, "local ⇒ local" holds there as well. All cases have been checked, so the first bullet in Proposition 3.8 is proved. Modified SVD lift: We now consider the modified SVD lift St(m, r) × S r × St(n, r) → R m×n ≤r defined by ϕ(U, M, V ) = UMV ⊤ . Fix (U, M, V ) ∈ M such that rank(M) = r and λ k (M) + λ ℓ (M) = 0 for some k < ℓ. We show "local ⇒ local" does not hold at (U, M, V ) by constructing a sequence converging to X = ϕ(U, M, V ) such that no subsequence of it can be lifted to a sequence converging to (U, M, V ). Let α i ∈ R r be a sequence converging to zero such that |λ(M) + α i | has distinct entries that are all nonzero. Let M = W diag(λ(M))W ⊤ be an eigendecomposition of M, where W ∈ O(r). Define U (k,ℓ) = UW T (k,ℓ) S (k,ℓ) W ⊤ ∈ St(m, r), V (k,ℓ) = V W T (k,ℓ) W ⊤ ∈ St(n, r), where T (k,ℓ) ∈ O(r) is the permutation interchanging the kth and ℓth entries of a vector while fixing all the others, and S (k,ℓ) ∈ O(r) flips the signs of these entries (so it's a diagonal matrix with all 1's on the diagonal except for the kth and ℓth entries which are −1). Note that T (k,ℓ) and S (k,ℓ) are symmetric and commute. Let where the first equality on the second line follows because S (k,ℓ) flips the signs of λ k (M) and of λ ℓ (M) = −λ k (M), and conjugation by T (k,ℓ) interchanges them again. Suppose (U i j , M i j , V i j ) is a lift of a subsequence of (X i ) converging to (U, M, V ). The singular values of X i j are the entries |λ(M i j )|, which are also the entries of |λ(M) + α i j | (possibly permuted). Combining this with λ(M i j ) → λ(M) (since M i j → M), α i j → 0, and the fact that λ(M) has no zero entries, we further get that the entries of λ(M i j ) are equal to the entries of λ(M) + α i j for all large j. After passing to a subsequence, we may assume that this equality holds for all j. Let M i j = W i j diag(λ(M) + α i j )W ⊤ i j be an eigendecomposition of M i j . Since O(r) is compact, we may also assume after further passing to a subsequence that lim exists. At this point, we have two SVDs of X i j , namely Because the singular values |λ(M) + α i j | of X i j are distinct, its singular vectors are unique up to sign [48,Thm. 4.1]. Specifically, there exists S i j ∈ diag({±1} r ) satisfying where in the first line we used the fact that diagonal matrices commute. Because S i j takes values in a finite set, after passing to a subsequence again we may assume S i j = S is fixed. Then U (k,ℓ) = U i j W i j SW ⊤ and V (k,ℓ) = V i j W i j SW ⊤ for all j, and taking j → ∞ we conclude that U (k,ℓ) = U W SW ⊤ and V (k,ℓ) = V W SW ⊤ . Equating this to (28), we obtain U W SW ⊤ = UW T (k,ℓ) S (k,ℓ) W ⊤ and V W SW ⊤ = V W T (k,ℓ) W ⊤ .
Rearranging gives W ⊤ W S = T (k,ℓ) S (k,ℓ) and W ⊤ W S = T (k,ℓ) so in fact, T (k,ℓ) S (k,ℓ) = T (k,ℓ) , a contradiction. Thus, no subsequence of (X i ) can be lifted, so the lift does not satisfy "local ⇒ local" at (U, M, V ). By Proposition 2.24 and the fact that X ∈ R m×n =r = X smth , we conclude that "1 ⇒ 1" is not satisfied at (U, M, V ) either. Now fix (U, M, V ) such that λ k (M) = 0 for some k, and let X, W be as above. Since r < min{m, n}, there exist unit vectors u ′ k ∈ R m and v ′ k ∈ R n such that (UW ) ⊤ u ′ k = 0 and (V W ) ⊤ v ′ k = 0. Let Y ∈ O(m) and Z ∈ O(n) send the kth columns of UW and V W to u ′ k and v ′ k , respectively, and act by the identity on their orthogonal complements. Let α i ∈ R r converge to zero such that |λ(M) + α i | are distinct and nonzero. Define X i = ϕ(Y U, M + α i , ZV ), which converge to Y UM(ZV ) ⊤ = X. Suppose (U i j , M i j , V i j ) is a lift of a subsequence of (X i ) converging to (U, M, V ). Let M i j = W i j diag(λ(M) + α i )W ⊤ i j be an eigendecomposition. After passing to a subsequence, we may assume W i j → W in O(r). Because the singular vectors of X i j are unique up to sign, there exists S i j ∈ diag({±1} r ) satisfying and similarly for ZV . Because S i j takes values in a finite set, after passing to a subsequence again we may assume S i j = S is fixed for all j, in which case we get This is a contradiction since col(Y U) = col(U) = col(U W SW ⊤ ) by construction of Y . Thus, no subsequence of (X i ) can be lifted so "local ⇒local" does not hold at such (U, M, V ). Since T X X is not a linear space, Corollary 2.11 shows that "1 ⇒ 1" does not hold there either. WritingU,V as in (25), we get similarly to (26) that Since rank(M) = r, we have col(U) = col(X) and col(V ) = col(X ⊤ ). By [43,Thm. 3.1], we have Since λ i (M) + λ j (M) = 0 for all i, j, for anyẊ 1 ∈ R r×r we can pick Ω ∈ Skew(r) such that ΩM + MΩ = skew(Ẋ 1 ) =Ẋ 1 −Ẋ ⊤
(b) Note that M is diffeomorphic to with the same diffeomorphism as in part (a). As in (a), it suffices to consider the composed lift mapping N → X . If r < n, let u ∈ span{y 1 , . . . , y r } ⊥ have unit norm. Then u ⊗d ∈ (span{y 1 , . . . , y r } ⊥ ) ⊗d = col(Y 1 ) ⊥ ⊗. . .⊗col(Y d ) ⊥ . However, u ⊗d / ∈ (T X X ) * if λ i = 0 for some i. Indeed, in that case we have X − tu ⊗d ∈ X for all t ≥ 0, so −u ⊗d ∈ T X X and u ⊗d , −u ⊗d = −1. Thus, the result follows from Proposition 3.11.