Abstract
This paper aims to propose an efficient numerical method for the most challenging problem known as the robust Euclidean embedding (REE) in the family of multi-dimensional scaling (MDS). The problem is notoriously known to be nonsmooth, nonconvex and its objective is non-Lipschitzian. We first explain that the semidefinite programming (SDP) relaxations and Euclidean distance matrix (EDM) approach, popular for other types of problems in the MDS family, failed to provide a viable method for this problem. We then propose a penalized REE (PREE), which can be economically majorized. We show that the majorized problem is convex provided that the penalty parameter is above certain threshold. Moreover, it has a closed-form solution, resulting in an efficient algorithm dubbed as PREEEDM (for Penalized REE via EDM optimization). We prove among others that PREEEDM converges to a stationary point of PREE, which is also an approximate critical point of REE. Finally, the efficiency of PREEEDM is compared with several state-of-the-art methods including SDP and EDM solvers on a large number of test problems from sensor network localization and molecular conformation.
Similar content being viewed by others
1 Introduction
This paper aims to propose an efficient numerical method for the most challenging problem in the Multi-Dimensional Scaling (MDS) family, which has found many applications in social and engineering sciences [6, 10]. The problem is known as the Robust Euclidean Embedding, a term borrowed from [8]. In the following, we first describe the problem and its three variants. We then explain our approach and main contribution. We will postpone the relevant literature review to the next section in order to shorten the introduction.
1.1 Problem description
The problem can be described as follows. Suppose we are given some dissimilarity measurements (e.g., noisy distances), collectively denoted as \(\delta _{ij}\), for some pairs (i, j) among n items. The problem is to find a set of n points \(\mathbf{x}_i \in \mathfrak {R}^r\), \(i=1,\ldots , n\) such that
where \(\Vert \mathbf{x}\Vert \) is the Euclidean norm (i.e., \(\ell _2\) norm) in \(\mathfrak {R}^r\) and \(\mathcal{E}\) is the set of the pairs (i, j), whose dissimilarities \(\delta _{ij} >0\) are known (\(\mathcal{E}\) can be thought of the edge set if we treat \(\delta _{ij}\) as a weighted edge distance between vertex i and vertex j, resulting in a weighted graph.) Throughout, we use “\(:=\)” or “\(=:\)” to mean “define”. The space \(\mathfrak {R}^r\) is called an embedding space and it is most interesting when r is small (e.g., \(r=2, 3\) for data visualization). One may also try to find a set of embedding points such that:
A great deal of effort has been made to seek the best approximation from (1) or (2). The most robust criterion to quantify the best approximation is the Robust Euclidean Embedding (REE) defined by
where \(W_{ij} >0\) if \(\delta _{ij} >0\) and \(W_{ij} \ge 0\) otherwise (\(W_{ij}\) can be treated as a weight for the importance of \(\delta _{ij}\)), and \(X := [\mathbf{x}_1, \ldots , \mathbf{x}_n]\) with each \(\mathbf{x}_i\) being a column vector. In [1, 8], Problem (3) was referred to as a robust variant of MDS and is denoted as rMDS. We will reserve rMDS for the Robust MDS problem:
The reference rMDS for the problem (4) is more appropriate because it involved the squared distances \(D_{ij}\), which were used by the classical MDS [22, 29, 43, 49, 53]. The preceding two problems are robust because of the robustness of the \(\ell _1\) norm used to quantify the errors [31, Sect. IV].
When the least squares criterion is used to (1), we have the popular model known as the Kruskal’s stress [30] minimization:
Similarly, when the least-squares criterion was applied to (2), we get the so-called squared stress [6]:
In many applications such as molecular conformation [21], lower and upper bounds data on the distances can also be collected:
where \(0 \le L_{ij} \le U_{ij}\). In applications such as nonlinear dimensionality reduction [47] and sensor network localization [44, 54], upper bounds \(U_{ij}\) can be computed by the shortest path distances and \(L_{ij}\) are simply set to be zero.
According to [8, Sect. 5.1], all of those problems are NP-hard. However, some problems are computationally more “difficult” to solve than the others. The most challenging one, which is also the main focus of this paper, is the problem (3) with/without the constraint (7). The difficulty comes from the nonsmooth term of \(\ell _1\) norm and the distance terms \(d_{ij}\) used. All other problems either involve the squared distances \(D_{ij}\) or the squared \(\ell _2\) norm, which make them “easier” to approximate. We will explain the reasons in the literature review part.
In contrast to all other three problems, there lacks efficient methods for the REE problem (3). One of the earliest computational papers that discuss this problem is Heiser [23], which is followed up by [28], where the Huber smoothing function was used to approximate the \(\ell _1\) norm near zero with a majorization technique. It was emphasized in [28] that “the function is not differentiable at its minimum and is hard to majorize, leading to a degeneracy that makes the problem numerically unstable”. Another important method is the PlaceCenter (PC for short) algorithm studied in [1]. We will compare with it in the numerical part. The difficulty in solving (3) is also well illustrated by a sophisticated Semi-definite Programming (SDP) approach in [35, Sect. IV] (see the literature review part). We now describe our approach proposed in this paper.
1.2 Our approach and main contributions
Our approach heavily makes use of the concept of Euclidean Distance Matrix (EDM). We need some notation. Let \(\mathcal{S}^n\) denote the space of all \(n \times n\) symmetric matrices, endowed with the standard inner product. The induced norm is the Frobenius norm, denoted by \(\Vert A\Vert \) for \(A \in \mathcal{S}^n\). The (i, j)th element of \(A \in \mathcal{S}^n\) is often written as \(A_{ij}\). Let \(\mathcal{S}^n_+\) be the cone of positive semidefinite matrices in \(\mathcal{S}^n\) and we write \(A \succeq 0\) for \(A \in \mathcal{S}^n_+\). A matrix \(D \in \mathcal{S}^n\) is called an EDM if there exists a set of points \(\mathbf{x}_i \in \mathfrak {R}^r\), \(i=1,2\ldots , n\) such that the (i, j)th element of D is given by \(D_{ij} := \Vert \mathbf{x}_i - \mathbf{x}_j \Vert ^2\), \(i, j=1, \ldots , n\). The smallest dimension r is called the embedding dimension of D and \(r={\mathrm {rank}}(JDJ)\), where \(J := I - \frac{1}{n} \mathbf{1}\mathbf{1}^T\) is known as the centring matrix with I being the identity matrix in \(\mathcal{S}^n\) and \(\mathbf{1}\) being the vector of all ones in \(\mathfrak {R}^n\). We use \(\mathcal{D}^n\) to denote the set of all Euclidean distance matrices of size \(n \times n\).
A very useful characterization for \(D \in \mathcal{D}^n\) [22, 49] is
This result shows that \(\mathcal{D}^n\) is a closed and convex cone. Moreover, a set of embedding points are generated by the classical MDS method [22, 43, 49, 53]:
where the eigenvalues \(\lambda _1 \ge \lambda _2 \ge \cdots \ge \lambda _r >0\) and the corresponding eigenvectors \(\mathbf{p}_1, \mathbf{p}_2, \ldots , \mathbf{p}_r\) are from the eigen-decomoposition:
with \(r = {\mathrm {rank}}(JDJ)\). Therefore, the REE problem (3) with the constraint (7) can be reformulated in terms of EDM as
where “\(\circ \)” is the Hadamard product for matrices (i.e., \(A \circ B = (A_{ij}B_{ij})\)), \(\sqrt{D}\) is the elementwise square root of D, \(\varDelta _{ij} := \delta _{ij}\), and \(\Vert \cdot \Vert _1\) is the \(\ell _1\) norm. Once we obtained an optimal solution of (11), we use (9) and (10) to generate the required embedding points.
The reformulation well captures four difficulties in solving the REE problem (3).
-
(i)
The objective function f(D) is not convex. The term \(| \sqrt{D_{ij}} - \delta _{ij} |\) is convex when \(\delta _{ij}^2 > D_{ij}\) and concave otherwise.
-
(ii)
The objective function is nonsmooth. It is not differentiable at certain points due to the \(\ell _1\) norm and the square root operation involved.
-
(iii)
The objective function is not Lipschizian. The Lipschitz constant goes to infinity as \(D_{ij}\) goes to zero. The implication is that the subdifferential of the objective function [42, Def. 8.3] may be unbounded. This would create a huge obstacle in establishing any convergence results of iterative algorithms for (11).
-
(iv)
The rank constraint is not convex and is hard to approximate. This is a common issue for any optimization problem with a rank constraint.
We note that no matter what reformulations one may use for (3), those four difficulties would appear in different forms and won’t go away. We also note that all other three problems, when reformulated in terms of EDM, have a convex objective function. This distinctive feature alone makes the problem (11) the most challenging one to solve.
Existing numerical experiments have shown that the MDS embedding (9) and (10) works well as long as D is close to a true EDM. A typical example is when the data sits on a lower-dimensional manifold [47]. Motivated by this, we are going to generate an approximate EDM instead of a true EDM in our algorithm. It follows from (8) that (also see [32, Theorem A]):
where \(\mathcal{K}^n_+\) is known to be the conditionally positive semidefinite cone:
and \(\mathbf{1}^\perp \) is the subspace in \(\mathfrak {R}^n\) orthogonal to \(\mathbf{1}\). The diagonal constraint in (12) can be integrated to the set \(\mathcal{B}\) with the choice \(L_{ii} = U_{ii} = 0\) for \(i=1,\ldots , n\). We combine \(\mathcal{K}^n_+\) with the rank constraint into the set \(\mathcal{K}^n_+(r)\):
We call it the conditionally positive semidefinite cone with the rank-r cut. Consequently, the constraints in (11) become \(-D \in \mathcal{K}^n_+(r)\) and \(D \in \mathcal{B}\).
Next, we quantify the feasibility of \(-D\) belonging to \(\mathcal{K}^n_+(r)\) as follows. Let \(\varPi _{{\mathcal{K}^n_+(r)}}^B(A)\) be the set of all nearest points in \({\mathcal{K}^n_+(r)}\) from a give matrix \(A \in \mathcal{S}^n\). That is
Since \({\mathcal{K}^n_+(r)}\) is not convex (unless \(r \ge n-1\)), the projection \(\varPi _{\mathcal{K}^n_+(r)}^B(A)\) is a set instead of a single point. We let \(\varPi _{\mathcal{K}^n_+(r)}(A)\) be any element in \(\varPi _{\mathcal{K}^n_+(r)}^B(A)\) and define the function
Since g(A) is just the half of the squared distance from \((-A)\) to \({\mathcal{K}^n_+(r)}\), it does not depend on which element \(\varPi _{\mathcal{K}^n_+(r)}(A)\) is being used. It is easy to see that
Hence, the problem (11) is equivalent to
This is a classical constrained optimization problem with an equality constraint and a simple box constraint. Therefore, the quadratic penalty method [34, Chp. 17] can be applied to get the following problem:
where \(\rho >0\) is the penalty parameter. We refer to this problem as the penalized REE problem (PREE).
The quadratic penalty method is used often in practice [34, P. 497]. In fact, it is particularly suitable to (11) because it overcomes all four difficulties discussed above. We will need two more important tools to help us efficiently solve the penalty problem (16). One is the majorization technique that has recently become very popular in engineering sciences [46] (also see [6, Chp. 8] for its extensive use in MDS). Suppose we have the current iterate \(D^k\). We construct a majorization function \(g_m(D, D^k)\) for g(D) at \(D^k\) such that
The majorization is constructed in such a way that it is easier to solve the majorized problem:
It can be seen that
Hence, the algorithm generates a sequence \(\{D^k\}\) that is nonincreasing in \(f_\rho (D)\). Since \(f_\rho (D)\) is bounded below by 0, the functional sequence \(\{f_\rho (D^k)\}\) converges. However, we are more concerned where the iterate sequence \(\{ D^k\}\) converges. The second concern is how the subproblem (18) has to be solved. This brings out the second technique, which is to solve the following one-dimensional problem:
for given \(\delta >0\) and \(0 \le a \le b\). We will show that the solution of this problem will lead to a close-form solution of (18).
Since our method is for the Penalized REE by EDM optimization, we call it PREEEDM. The major contribution of this paper is to make the outlined solution procedure water-tight. In particular, we will investigate the relationship between the PREE problem (16) and the original problem (11) in terms of the \(\epsilon \)-optimality (Proposition 1). We will also show that the majorization function \(g_m(\cdot , \cdot )\) can be economically constructed (Sect. 3.2). Moreover, the majorized function \(f_\rho ^k(D)\) is guaranteed to be convex provided that the penalty parameter is above certain threshold and the subdifferentials at the generated sequences are bounded (Proposition 4). Furthermore, each majorization subproblem has a closed form solution (Theorem 1). We are also able to prove that any accumulation of the generated sequence by PREEEDM is a stationary point of (16), which is also an approximate stationary point of (11) (Theorem 2). Built upon its solid convergence results and simple implementation, PREEEDM is demonstrated to be comparable to six state-of-the-art software packages in terms of solution quality and outperform them in terms of the computational time for a large number of tested problems from sensor network localizations and molecular conformations.
1.3 Organization of the paper
In the next section, we give a selective literature review mainly on the Semi-Definite Programming (SDP) and EDM approaches. In Sect. 3. we introduce some necessary background and prove a key technical result (Lemma 1) that is crucial to the convexity of the majorization subproblem. We study the relationship between the penalized REE (16) and the original REE in Sect. 4, where the majorized subproblem is shown to have a closed-form solution. In Sect. 5, we provide a complete set of convergence results for the proposed PREEEDM algorithm. Numerical experiments are included in Sect. 6. The paper concludes in Sect. 7. All proofs except that of Theorem 2 can be found in “Appendix”.
2 Literature review
One can find a thorough review on all of the four problems in [17] by France and Carroll, mainly from the perspective of applications. One can also find valuable discussion on some of those problems in [2]. So the starting point of our review is that those problems have their own reasons to be studied and we are more concerned how they can be efficiently solved.
Most of existing algorithms can be put in three groups. The first group consists of alternating coordinates descent methods, whose main variables are \(\mathbf{x}_i\), \(i=1,\ldots ,n\). A famous representative in this group is the method of SMACOF for the stress minimization (5) [13, 14]. The key idea is to alternatively minimize the function \(f^{(d, 2)}\) with respect to each \(\mathbf{x}_i\), while keeping other points \(\mathbf{x}_j\) (\(j \not =i\)) unchanged, and each minimization problem is relatively easier to solve by employing the technique of majorization. SMACOF has been widely used and the interested reader can refer to [6] for more references and to [54] for some critical comments on SMACOF when it is applied to the sensor network localization problem. The second and third group consist respectively the methods of SDP and EDM optimization. We will give a more detailed review on the two groups because of their close relevance to our proposed method in this paper. The main purpose of our review is to show that there lacks efficient numerical methods for the REE problem (3).
2.1 On SDP approach
We note that each of the four objective functions either involves the Euclidean distance \(d_{ij}\) or its squared \(D_{ij} = d_{ij}^2\). A crucial observation is that constraints on them often have SDP relaxations. For example, it is easy to see
where \(Y := X^TX \succeq 0\). Hence, the squared distance \(d^2_{ij}\) is a linear function of the positive semidefinite matrix Y. Consequently, the EDM cone \(\mathcal{D}^n\) can be described through linear transformations of positive semidefinite matrices. One can further relax the constraint \(Y=X^TX\) to \(Y \succeq X^TX\). By the Schur-complement, one has
By dropping the rank constraint, the robust MDS problem (4) can be relaxed to a SDP, which was initiated by Biswas and Ye [15].
For the Euclidean distance \(d_{ij}\), we introduce a new variable \(T_{ij}=d_{ij}\). One may relax this constraint to \(T_{ij} \le d_{ij}\), which has a SDP representation:
Combination of (20), (21) and (22) leads to a large number of SDP relaxations. Typical examples, for the robust MDS problem (4), are the SDP relaxation method [5] and the edge-based SDP relaxation method [38, 50] and [27], which leads to a comprehensive Matlab package SFSDP. For the squared stress (6), one may refer to [16, 25]. For the stress problem (5), a typical SDP relaxation can be found in [35, Problem (8)]. However, unlike the problems (4), (5) and (6), the REE problem (3) does not have a straightforward SDP relaxation. We use an attempt made in [35] to illustrate this point below.
First, it is noted that problem (3) can be written in terms of EDM:
The term \(| \sqrt{D_{ij}} - \delta _{ij} |\) is convex if \(\delta _{ij} > \sqrt{D_{ij}}\) and is concave otherwise. A major obstacle is how to efficiently deal with the concavity in the objective.
Secondly, by dropping the rank constraint and through certain linear approximation to the concave term, a SDP problem is proposed for (3) (see [35, Eq. (20)]):
where the quantities \(a_{ij}\) and \(b_{ij}\) can be computed from \(\delta _{ij}\). We note that each quadratic constraint in (23) is equivalent to a positive semidefinite constraint on \(\mathcal{S}^2_+\) and \(D \in \mathcal{D}^n\) is a semidefinite constraint on \(\mathcal{S}^n_+\) by (8). Therefore, the total number of the semidefinite constraints is \(|\mathcal{E}| + 1\). Finally, the optimal solution of (23) is then refined through a second-stage algorithm (see [35, Sect. IV(B)]). Both stages of the algorithmic scheme above would need sophisticated implementation skills and its numerical efficiency and solution quality are yet to be confirmed due to the twice relaxation schemes (dropping the rank constraint and linearization). The lack of efficient algorithms for (3) motivated our research in this paper.
2.2 On EDM approach
A distinguishing feature from the SDP approach is that this approach treats EDM D as the main variable, without having to rely on its SDP representation. This approach works because of the characterization (12) and that the orthogonal projection onto \(\mathcal{K}^n_+\) has a closed-form formula [19, 20]. Several methods are based on this formula. The basic model for this approach is the so-called nearest EDM problem:
which is a convex relaxation of (6) with the special choice \(W_{ij} \equiv 1\). Here the elements of the matrix \(\varDelta ^{(2)}\) are given by \(\varDelta ^{(2)}_{ij} := \delta _{ij}^2\). The relaxation is obtained by dropping the rank constraint \({\mathrm {rank}}(JDJ) \le r\). Since the constraints of (24) are the intersection of a subspace and a convex cone, the method of alternation projection was proposed in [19, 20] with applications to molecule conformation [21]. Newton’s method for (24) was developed in [39]. Extensions of Newton’s method for the model (24) with more constraints including general weights \(W_{ij}\), the rank constraint \({\mathrm {rank}}(JDJ) \le r\) or the box constraints (7) can be found in [3, 11, 40]. A recent application of the model (24) with a regularization term to Statistics is [55], where the problem is solved by an SDP, similar to that proposed by Toh [48].
There are two common features in this class of methods. One is that they require the objective function to be convex, which is true for the problems (4), (5) and (6) when formulated in EDM. The second feature is that the nonconvexity is only caused by the rank constraint. However, as already seen in Sect. 1.2, the REE problem (3) in terms of EDM has a nonconvex objective coupled with the distance \(d_{ij}\) (not squared distances) being used. This has caused various difficulties in applying the existing EDM-based methods mentioned above to (3). A latest research [56] by the authors has tried to extend the EDM approach to the stress minimization problem (5) along a similar line as outlined in Sect. 1.2. Once again, we emphasize that the key difference between the problem (3) and (5) is about nonconvex objective vs convex objective and non-differentiability vs differentiability. Hence, the problem (3) is significantly more difficult to solve than (5). Nevertheless, we will show that it can be efficiently solved by the proposed EDM optimization.
3 Background and technical lemmas
In this part, we introduce the necessary background about subgradient and positive roots of a special depressed cubic equation. In particular, we will prove a technical result about a composite function between the absolute value and the square root functions. This result (Lemma 1) is in the style of Taylor-expansion for differentiable functions.
3.1 Subgradients of functions
An important function appearing in our EDM reformulation (11) of the REE problem (3) is \(\phi _{\delta }(\cdot ): \mathfrak {R}_+ \mapsto \mathfrak {R}_+\) defined for a given constant \(\delta >0\) by
where \(\mathfrak {R}_+\) is the set of all nonnegative numbers. We will need to compute its subgradient in the sense of Rockafellar and Wets [42].
Definition 1
[42, Def. 8.3] Consider a function \(f: \mathfrak {R}^n \mapsto \mathfrak {R}\cup \{-\infty , + \infty \}\) and a point \(\bar{\mathbf{x}}\) with \(f(\bar{\mathbf{x}})\) finite. For a vector \(\mathbf{v}\in \mathfrak {R}^n\), one says that
-
(a)
\(\mathbf{v}\) is a regular subgradient of f at \(\bar{\mathbf{x}}\), written \(\mathbf{v}\in \widehat{\partial } {f}(\bar{\mathbf{x}})\), if
$$\begin{aligned} f(\mathbf{x}) \ge f(\bar{\mathbf{x}}) + \langle \mathbf{v}, \; \mathbf{x}- \bar{\mathbf{x}} \rangle + o( \Vert \bar{\mathbf{x}}- \mathbf{x}\Vert ), \end{aligned}$$where the little ‘o’ term is a short-hand notation for the one-sided limit condition:
$$\begin{aligned} \liminf _{ {\mathop {\mathbf{x}\not = \bar{\mathbf{x}} }\limits ^{ \mathbf{x}\rightarrow \bar{\mathbf{x}} }} } \frac{f(\mathbf{x}) - f(\bar{\mathbf{x}}) - \langle \mathbf{v}, \; \mathbf{x}- \bar{\mathbf{x}} \rangle }{\Vert \mathbf{x}- \bar{\mathbf{x}} \Vert } \ge 0; \end{aligned}$$ -
(b)
\(\mathbf{v}\) is a (general) subgradient of f at \(\bar{\mathbf{x}}\), written \(\mathbf{v}\in \partial {f}(\bar{\mathbf{x}})\), if there are sequences \(\mathbf{x}^\nu \rightarrow \bar{\mathbf{x}}\) with \(f(\mathbf{x}^\nu ) \rightarrow f(\bar{\mathbf{x}})\) and \(\mathbf{v}^\nu \in \widehat{\partial } f(\mathbf{x}^\nu )\) with \(\mathbf{v}^\nu \rightarrow \mathbf{v}\).
We call \(\partial f(\bar{\mathbf{x}})\) the subdifferential of f at \(\bar{\mathbf{x}}\). For a given number \(x \in \mathfrak {R}\), we define its sign by
Apparently, \(\phi _\delta (x)\) is continuous for \(x>0\) and its subdifferential at \(x>0\) is given by directly applying Def. 1 (note \(\delta >0\))
We note that the subdifferential of \(\phi _{\delta } (x)\) at \(x=0\) is more complicated to describe. Fortunately, we won’t need it in our analysis. We state our key lemma below.
Lemma 1
Let \(\delta >0\) be given. It holds
Compared with the definition of classical convex function, the inequality above has an extra negative quadratic term, which is not sufficient for \(\phi _{\delta } (x)\) to be convex on \((0,+\infty )\) (actually, \(\phi _{\delta } (x)\) is quasi-convex.) Fortunately, this property allows us to make the function \(\phi _{\delta }+\mu \varphi \) convex provided \(\varphi \) is a strongly convex and \(\mu \) is chosen properly. And this is one of the key ideas developed in this paper.
3.2 Construction of the majorization function
A major building block in our algorithm is the majorization function \(g_m(D, D^k)\) at a given point \(D^k\) for the function g(A) defined in (14). We construct it below.
Suppose \(A \in \mathcal{S}^n\) has the following eigenvalue-eigenvector decomposition:
where \(\lambda _1 \ge \lambda _2 \ge \cdots \ge \lambda _n\) are the eigenvalues of A in non-increasing order, and \(\mathbf{p}_i\), \(i=1, \ldots , n\) are the corresponding orthonormal eigenvectors. We define a PCA-style matrix truncated at r:
Recall the definition of \(\varPi ^B_{{\mathcal{K}^n_+(r)}}(A)\) in (13). We let \(\varPi _{{\mathcal{K}^n_+(r)}}(A)\) be an element of \(\varPi ^B_{{\mathcal{K}^n_+(r)}}(A)\) and note that the function g(A) in (14) does not depend on the choice of \(\varPi _{{\mathcal{K}^n_+(r)}}(A)\). As seen from the known results below, one particular element \(\varPi _{{\mathcal{K}^n_+(r)}}(A)\) can be computed through \({\mathrm {PCA}}_r^+(A)\).
Lemma 2
For a given matrix \(A \in \mathcal{S}^n\) and an integer \(r \le n\). The following results hold.
-
(i)
[40, Eq. (22), Proposition 3.3] One particular \(\varPi _{\mathcal{K}^n_+(r)}(A)\) can be computed through
$$\begin{aligned} \varPi _{\mathcal{K}^n_+(r)}(A) = {\mathrm {PCA}}_r^+(JAJ) + (A -JAJ) \end{aligned}$$(28) -
(ii)
[40, Eq. (26), Proposition 3.3] We have
$$\begin{aligned} \langle \varPi _{\mathcal{K}^n_+(r)}(A), \ A - \varPi _{\mathcal{K}^n_+(r)}(A) \rangle = 0. \end{aligned}$$(29) -
(iii)
[40, Proposition 3.4] The function
$$\begin{aligned} h(A) := \frac{1}{2} \Vert \varPi _{\mathcal{K}^n_+(r)}(A) \Vert ^2 \end{aligned}$$is well defined and is convex. Moreover,
$$\begin{aligned} \varPi _{\mathcal{K}^n_+(r)}(A) \in \partial h(A), \end{aligned}$$where \(\partial h(A)\) is the subdifferential of \(h(\cdot )\) at A.
-
(iv)
[56, Lemma 2.2] Let g(A) be defined in (14). We have for any \(A \in \mathcal{S}^n\)
$$\begin{aligned} g(A) = \frac{1}{2} \Vert A\Vert ^2 - h(-A) \ \ \hbox {and} \ \ \Vert \varPi _{\mathcal{K}^n_+(r)} (A) \Vert \le 2 \Vert A\Vert . \end{aligned}$$(30)
Since \(h(\cdot )\) is convex and \(\varPi _{\mathcal{K}^n_+(r)}(A) \in \partial h(A)\) (Lemma 2)(ii)), we have
This, with Lemma 2(iii), implies
It is straightforward to check that the function \(g_m(\cdot , \cdot )\) in (31) satisfies the majorization properties (17).
3.3 Positive roots of depressed cubic equations
In our algorithm, we will encounter the positive root of a depressed cubic equation [7, Chp. 7], which arises from the optimality condition of the following problem
where \(\nu >0\) and \(t \in \mathfrak {R}\) are given. A positive stationary point x must satisfy the optimality condition
Let \(y := \sqrt{x}\). The optimality condition above becomes
This is in the classical form of the so-called depressed cubic equation [7, Chp. 7]. Its roots (complex or real) and their computational formulae have a long history with fascinating and entertaining stories. A comprehensive revisit of this subject can be found in Xing [51] and a successful application of the depressed cubic equation to the compressed sensing can be found in [36, 52]. The following lemma says that, under certain conditions, the Eq. (33) has two distinctive positive roots and its proof is a specialization of [9, Lemma 2.1(iii)] when \(p =1/2\) therein.
Lemma 3
[9, Lemma 2.1(iii)] Consider the problem (32). Let
When \(t > \bar{t}\), s(x) has two different positive stationary point \(\hat{x}_1\) and \(\hat{x}_2\) satisfying
4 Penalized REE model and its’ majorization subproblem
With the preparation above, we are ready to address our penalized REE problem (16) and its majorization subproblem (18). We first address the relationship between (16) and its original problem (11). We then show how the subproblem (18) is solved.
4.1 \(\epsilon \)-optimal solution and critical point
The classical results on penalty methods in [34] on the differentiable case (i.e., all functions involved are differentiable) are not applicable here. Recently, the penalty approach was studied by Gao in her PhD thesis [18] in the context of semidefinite programming, which motivated our investigation below. The main result is that (16) provides an \(\epsilon \)-optimal solution and \(\epsilon \)-critical point for the original problem when the penalty parameter is above certain threshold.
We first introduce the concept of critical point of (11) [i.e., (15)], which is a first order optimality condition. It is associated with the Lagrange function of (15) defined by
where \(\beta \in \mathfrak {R}\) is the Lagrangian multiplier and \(I_{\mathcal{B}}(D)\) is the indicator function taking value 0 if \(D\in \mathcal{B}\) and \(+\infty \) otherwise. We say \(\overline{D} \in \mathcal{S}^n\) is a critical point of (15) if it satisfies \(0\in \partial \mathcal L(\bar{D}, \bar{\beta })\) for some \(\bar{\beta } \in \mathfrak {R}\), namely, there exists \(\overline{\varGamma } \in \partial f(\overline{D})\) and \(\overline{C} \in \hbox {Conv}(\varPi _{\mathcal{K}^n_+(r)}^B(- \overline{D}))\) such that
where we used the following facts: \(\partial g(D)= D + \hbox {Conv}(\varPi _{\mathcal{K}^n_+(r)}^B(- D))\) (see [40, Proposition 3.4] and \(\hbox {Conv}(\varOmega )\) denotes the convex hull of a given set \(\varOmega \)) , and the sub-differential of the indicator function of a convex set \(\varOmega \) is its normal cone, namely, \(\partial I_\varOmega (x)=N_{\varOmega }(x)=\{y:~\langle y, z-x\rangle \le 0,~\forall ~z\in \varOmega \}. \) We will see the choice \(\overline{C} = \varPi _{\mathcal{K}^n_+(r)}(-\overline{D})\) is enough for our analysis and hence it is used in our definition of approximate critical point.
Definition 2
-
(i)
(\(\epsilon \)-Optimal solution) Suppose \(D^*\) is an optimal solution of (11). For a given error tolerance \(\epsilon >0\), a point \(\widehat{D}\) is called an \(\epsilon \)-optimal solution of (11) if it satisfies
$$\begin{aligned} \widehat{D} \in \mathcal{B}, \quad g(\widehat{D}) \le \epsilon \quad \hbox {and} \quad f(\widehat{D}) \le f(D^*). \end{aligned}$$ -
(ii)
(\(\epsilon \)-Critical point) A point \(\overline{D}\) is called an \(\epsilon \)-critical point of (11) if there exist \(\overline{\varGamma } \in \partial f(\overline{ D})\) and \(\bar{\beta } \in \mathfrak {R}\) such that
$$\begin{aligned} g({\overline{D}} )\le \epsilon ,~ \left\langle {\bar{\varGamma }} + {\bar{\beta }} {\overline{D}} + {\bar{\beta }} \varPi _{\mathcal{K}^n_+(r)}(-{\overline{D}}), \; D-{\overline{D}} \right\rangle \ge 0,~~\forall ~D\in \mathcal{B}. \end{aligned}$$(35)
Obviously, if \(\epsilon =0\), \(\epsilon \)-optimal solution \(\widehat{D}\) would be an optimal solution of (11). We will show that the optimal solution of (16) is \(\epsilon \)-optimal as well as an \(\epsilon \)-critical point provided that \(\rho \) is large enough. Let \(D^*_\rho \) be an optimal solution of the penalized REE (16) and \(D_r\) be any feasible solution of the original problem (11). If the lower bound matrix \(L \equiv 0\), then we can simply choose \(D_r =0\). Define
We have the following result.
Proposition 1
For any \(\rho \ge \rho _{\epsilon }\), \(D^*_\rho \) must be an \(\epsilon \)-optimal solution as well as an \(\epsilon \)-critical point of (11). That is, there exists \(\varGamma ^* \in \partial f(D^*_\rho )\) such that
4.2 Solving the subproblem
Having constructed the majorization function in (31), we now focus on how to solve the majorization subproblem (18), which is equivalent to the solution of the following problem. Given the current iterate \(Z \in \mathcal{B}\), the majorization subproblem aims to compute an improved iterate, denoted by \(Z^+\), by solving
where the matrix \(Z_K := - \varPi _{\mathcal{K}^n_+(r)}(-Z)\). This subproblem has a perfect separability property that allows it to be computed elementwise:
For the ease of our description, we denote the subproblem solution process by
Here, PREEEDM stands for the Penalized REE by EDM optimization. We will show how (40) can be computed.
Let us consider a simplified one-dimensional optimization problem, whose solution will eventually give rise to PREEEDM. Let B denote the interval [a, b] in \(\mathfrak {R}\) with \(0 \le a \le b\). For given \(\omega \in \mathfrak {R}, \delta >0\) and \(\beta > 0\), we aim to compute
The acronym dcroot stands for the root of depressed cubic equation, which will eventually give rise to the solution formula of (41). It suffices to consider the case that matters to us:
Before solving the above problem, we define some notation for convenience
Obviously, q(x) has a representation of two pieces:
It is noted that \(q_-(x)\) is convex, but \(q_+(x)\) may not necessarily so. We will show that both pieces have a closed-form formula for their respective minimum.
Proposition 2
Consider the optimization problem:
Define
Then (43) has a unique solution \(x^*_-\) given by
Now we characterize the optimal solution of \(q_+(x)\) over \(B_+\).
Proposition 3
Assume that \(\beta < 4 \delta ^3\) and consider the optimization problem:
Define
Then \(q_+(x)\) is strictly convex over the interval \([\delta ^2, \infty )\) and
Putting together Propositions 2 and 3 gives rise to the optimal solution of (41). The optimal solution is either \(x^*_-\) or \(x^*_+\), whichever gives a lower functional value of q(x). This is the first result of our major theorem below. We note that both Propositions 2 and 3 make use of the convexity of \(q_-(x)\) and \(q_+(x)\) on the respective interval \([a, \delta ^2]\) and \([\delta ^2, b]\). In fact, we can establish a stronger result that when the two pieces join together, the resulting function q(x) is still convex on the whole interval [a, b]. This result is very important to our convergence analysis in the next section and is the second result of the theorem below. A key tool for the proof is Lemma 1.
Theorem 1
Let B denote the interval [a, b] with \(0 \le a \le \delta ^2 \le b\). We assume \(0< \beta < 4 \delta ^3\). Then, the following hold.
-
(i)
The optimal solution of the problem (41) is given by
$$\begin{aligned} {\texttt {dcroot}} _B[\omega , \beta ,\delta ] = \left\{ \begin{array}{lll} x^*_-,&{}\mathrm{if}&{} q(x^*_-)\le q(x^*_+),\\ x^*_+,&{}\mathrm{if}&{} q(x^*_-)> q(x^*_+).\\ \end{array} \right. \end{aligned}$$ -
(ii)
The function q(x) is strictly convex on [a, b]. Consequently, there exists \(\xi \in \partial q({\texttt {dcroot}} _B[\omega , \beta ,\delta ] )\) such that
$$\begin{aligned} \xi (x-{\texttt {dcroot}} _B[\omega , \beta , \delta ] )\ge 0 ~~~~\text {for~any}~ x\in B. \end{aligned}$$ -
(iii)
Let \(\gamma _{\omega , \beta }\) be defined in (42), then \({\texttt {dcroot}} _B[\omega , \beta ,\delta ] \ge \min \{\delta ^2, b, 1, \gamma _{\omega , \beta }\}\). We view \({\texttt {dcroot}} _B[\omega , \beta ,\delta ]\) as a function of \(\omega \). Suppose \(C>0\) is an arbitrarily given constant. Then there exists a constant \(\kappa >0\) such that
$$\begin{aligned} {\texttt {dcroot}} _B[\omega , \beta ,\delta ] > \kappa \qquad \forall \ \omega \ \hbox {such that} \ | \omega | \le C. \end{aligned}$$
Comment: The optimal solution \({{\texttt {dcroot}}}_B[\omega , \beta ,\delta ]\) is unique, since q(x) is strictly convex over [a, b]. However, its location could be within the interval \([a, \sigma ^2]\) or \([\sigma ^2, b]\), depending on the magnitudes of the parameters (\(\omega , \beta \) and \(\delta \)) involved. The dependence is illustrated in Fig. 1. We also note that the function q(x) may not be convex if the condition \(\beta < 4 \delta ^3\) is violated. \(\square \)
It now follows from Theorem 1 that the optimal solution \(Z^+_{ij}\) in (39) can be computed by:
Consequently, \(Z^+ = {\texttt {PREEEDM}}_\mathcal{B}(Z_K, W/\rho , \varDelta )\) in (40) is well defined and its elements can be computed by (47).
5 Algorithm PREEEDM and its convergence
With the preparations above, we are ready to state our algorithm. Let \(D^k \in \mathcal{B}\) be the current iterate. We update it by solving the majorization subproblem of the type (38) with Z replaced by \(D^k\):
which can be computed by
In more detail, we have
where \(Z_K^k := - \varPi _{\mathcal{K}^n_+(r)} (-D^k)\), and the elements of \(D^{k+1}\) are computed as follows:
Our algorithm PREEEDM is formally stated as follows.
A major obstacle in analysing the convergence for the penalized EDM model (16) is the non-differentiability of the objective function. We need the following two reasonable assumptions:
Assumption 1
The constrained box \(\mathcal{B}\) is bounded.
Assumption 2
For \(\varDelta \) and U, we require \(W_{ij} =0\) if \(\delta _{ij} =0\) and \(U_{ij}\ge \delta _{ij}^2 \ge L_{ij}\) if \(\delta _{ij} >0\)
Assumption 1 can be easily satisfied (e.g., setting the upper bound to be \(n^2 \max \{\delta _{ij}^2\}\)). Assumption 2 means that if \(\delta _{ij}=0\) (e.g., value missing), the corresponding weight \(W_{ij}\) should be 0. This is a common practice in applications. If \(\delta _{ij}>0\), then we require \(\delta _{ij}^2\) to be between \(L_{ij}\) and \(U_{ij}\). We further define a quantity that bounds our penalty parameter \(\rho \) from below:
Our first result in this section is about the boundedness of the subdifferential of \(f(\cdot )\) along the generated sequence \(\{ D^k\}\).
Proposition 4
Suppose Assumptions 1 and 2 hold. Let \(\rho > \rho _o \) and \(\{ D^k\}\) be the sequence generated by Algorithm 1. Then the following hold.
-
(i)
There exists a constant \(c_1 >0\) such that
$$\begin{aligned} D^k_{ij} \ge c_1 \qquad \hbox {for all} \ (i,j) \ \hbox {such that}\ W_{ij} >0 \ \hbox {and} \ k=1,2,\ldots . \end{aligned}$$ -
(ii)
Let \(\partial f(D)\) denote the subdifferential of \(f(D) = \Vert W \circ (\sqrt{D} - \varDelta ) \Vert _1\). Then there exists a constant \(c_2 >0\) such that
$$\begin{aligned} \Vert \varGamma \Vert \le c_2 \qquad \forall \ \varGamma \in \partial f(D^k), \ \ k=1,2,\ldots . \end{aligned}$$ -
(iii)
The function \(f_\rho ^k(D)\) is convex for all \(k=1, 2, \ldots \). Moreover, there exists \(\varGamma ^{k+1} \in \partial f(D^{k+1})\) such that the first-order optimality condition for (49) is
$$\begin{aligned} \left\langle \varGamma ^{k+1} + \rho D^{k+1} + \rho \varPi _{\mathcal{K}^n_+(r)} (-D^k), \ D - D^{k+1} \right\rangle \ge 0, \qquad \forall \ D \in \mathcal{B}. \end{aligned}$$(52)
Theorem 1(i) ensures that \(D^k_{ij} >0\) for all \(k=1,\ldots ,.\) Hence, we can apply Lemma 1 to each function \(\phi _{\delta _{ij}}(\cdot )\) with \(x = D^{k+1}_{ij}\) and \(y= D^k_{ij}\). This yields for any \(\zeta ^{k+1}_{ij} \in \partial \phi _{\delta _{ij}} (D^{k+1}_{ij})\)
Multiplying \(W_{ij}\) on both sides and adding those inequalities over (i, j), we get
where \(\varGamma ^{k+1}_{ij} :=W_{ij} \zeta ^{k+1}_{ij}\). We note that the inequality (53) holds for any \(\varGamma ^{k+1} \in \partial f(D^{k+1})\).
Theorem 2
Let \(\rho > \rho _o \) and \(\{ D^k\}\) be the sequence generated by Algorithm 1. Suppose Assumptions 1 and 2 hold.
-
(i)
We have
$$\begin{aligned} f_\rho (D^{k+1}) - f_\rho (D^k) \le -\frac{\rho - \rho _o }{2} \Vert D^{k+1} - D^k \Vert ^2~~\text {for~any}~~k=0,1,\ldots ,. \end{aligned}$$Consequently, \(\Vert D^{k+1} - D^k \Vert \rightarrow 0\).
-
(ii)
Let \(\widehat{D}\) be an accumulation point of \(\{D^k\}\). Then there is \(\widehat{\varGamma }\in \partial f(\widehat{D})\) such that
$$\begin{aligned} \langle \widehat{\varGamma } + \rho \widehat{D} + \rho \varPi _{\mathcal{K}^n_+(r)}(-\widehat{D}), \; D - \widehat{D} \rangle \ge 0~~~~\text {for~any}~D \in \mathcal{B}. \end{aligned}$$(54)That is, \(\widehat{D}\) is a critical point of the problem (16). Moreover, for a given \(\epsilon >0\), if \(D^0 \in \mathcal{K}^n_+(r)\cap \mathcal{B}\) and
$$\begin{aligned} \rho \ge \rho _\epsilon := \max \{\rho _o, f(D^0)/\epsilon \}, \end{aligned}$$then \(\widehat{D}\) is an \(\epsilon \)-critical point of the original problem (11).
-
(iii)
If \(\widehat{D}\) is an isolated accumulation point of the sequence \(\{D^k\}\), then the whole sequence \(\{D^k\}\) converges to \(\widehat{D}\).
Proof
(i) We are going to use the following facts that are stated on \(D^{k+1}\) and \(D^k\). The first fact is the identity:
The second fact is due to the convexity of h(D) (see Lemma 2(ii)):
The last fact is that there exists \(\varGamma ^{k+1}\in \partial f(D^{k+1})\) such that (52). Those facts yield the following chain of inequalities:
This proves that the sequence \(\{ F_\rho (D^k)\}\) is non-increasing and it is also bounded below by 0. Taking the limits on both sides yields \(\Vert D^{k+1} - D^k \Vert \rightarrow 0\).
(ii) Suppose \(\widehat{D}\) is the limit of a subsequence \(\{ D^{k_\ell }\}\), \(\ell =1, \ldots ,\). Since we have established in (i) that \((D^{k_{\ell }+1} - D^{k_\ell }) \rightarrow 0\), the sequence \(\{ D^{k_\ell +1}\}\) also converges to \(\widehat{D}\). Furthermore, there exist a sequence of \(\varGamma ^{k_\ell +1} \in \partial f(D^{k_\ell +1})\) such that (52) holds. Proposition 4(ii) ensures that there exists a constant \(c_2>0\) such that \(\Vert \varGamma ^{k_\ell +1} \Vert \le c_2\) for all \(k_\ell \). Hence, there exists a subsequence of \(\{k_\ell \}\) (we still denote the subsequence by \(\{k_\ell \}\) for simplicity) such that \(\varGamma ^{k_\ell +1}\) converges to some \(\widehat{\varGamma } \in \partial f(\widehat{D})\). Now taking the limits on both sides of (52) on \(\{ k_\ell \}\), we reach the desired inequality (54). We now prove \(\widehat{D}\) is an \(\epsilon \)-critical point of (11). Since we already have (54), we only need to show \(g(\widehat{D}) \le \epsilon \). It follows from \(D^0 \in {\mathcal{K}^n_+(r)}\cap \mathcal{B}\) that
Taking the limit on the right-hand side yields
where we used \(f(\widehat{D}) \ge 0\). Therefore, thanks to \(\rho >\rho _\epsilon \), it has
(iii) We note that we have proved in (i) that \((D^{k+1} - D^k) \rightarrow 0\). The convergence of the whole sequence to \(\widehat{D}\) follows from [26, Proposition 7]. \(\square \)
6 Numerical experiments
In this part, we will conduct extensive numerical experiments of our algorithm PREEEDM by using MATLAB (R2014a) on a desktop of 8GB of memory and Inter(R) Core(TM) i5-4570 3.2Ghz CPU, against 6 leading solvers on the problems of sensor network localizations (SNL) in \(\mathfrak {R}^2\) (\(r=2\)) and Molecular Conformation (MC) in \(\mathfrak {R}^3\) (\(r=3\)). This section is split into the following parts. The MATLAB package is available at DOI: https://doi.org/10.5281/zenodo.3343047. Our implementation of PREEEDM was described in Sect. 6.1. We will give a brief explanation how the six benchmark methods were selected in Sect. 6.2. Descriptions of how the test data of SNL and MC were collected and generated, and extensive numerical comparisons are reported in Sect. 6.3.
6.1 Implementation
The PREEEDM Algorithm 1 is easy to implement. We first address the issue of its stopping criterion that is to be used in Step 3 of Algorithm 1. We monitor two quantities. One is on how close of the current iterate \(D^k\) is to be Euclidean (belonging to \(\mathcal{K}^n_+(r)\)). This can be computed by using (28) as follows.
where \(\lambda _1 \ge \lambda _2 \ge \cdots \ge \lambda _n\) are the eigenvalues of \((-JD^kJ)\). The smaller \(\texttt {Kprog}_k\) is, the closer \(D^k\) is to \(\mathcal{K}^n_+(r)\). The benefit of using \(\texttt {Kprog}\) over g(D) is that the former is independent of any scaling of D.
The other quantity is to measure the progress in the functional values \(f_\rho (\cdot )\) by the current iterate \(D^k\). In theory (see Theorem 2), we should require \(\rho >\rho _o\), which is defined as (51) and is potentially large. As with the most penalty methods [34, Chp. 17], starting with a very large penalty parameter may degrade the performance of the method (e.g., causing ill-conditioning). We adopt a dynamic updating rule for \(\rho \). In particular, we choose \(\rho _0 =\frac{\kappa \max \delta _{ij}}{n^{3/2}}\) and update it as
where
and \(\texttt {Ftol}=\ln (\kappa )\times 10^{-4}\) and \(\texttt {Ktol}=10^{-2}\) with \(\kappa \) being the number of non-zero elements of \(\varDelta \). We terminate PREEEDM when
Since our computation of each iteration is dominated by \(\varPi _{\mathcal{K}^n_+(r)} (-D)\) in the construction of the majorization function \(g_m(\cdot , \cdot )\) in (31), the computational complexity is about \(O(rn^2)\) (we used MATLAB’s built-in function eigs.m to compute \({\mathrm {PCA}}_r^+(A)\) in (27)). For the problem data input, \(\varDelta \), L and U will be described below. For the initial point, we follow the popular choice used in [44, 47] \(\sqrt{D^0} := \widehat{\varDelta }\), where \(\widehat{\varDelta }\) is the matrix obtained by the shortest path distances among \(\varDelta \). If \(\varDelta \) has no missing values, then \(\widehat{\varDelta } = \varDelta \).
6.2 Benchmark methods
We select six representative state-of-the-art methods for comparison. They are ADMMSNL [37], ARAP [54], EVEDM (short for EepVecEDM) [12], PC [1], PPAS (short for PPA Semismooth) [24] and SFSDP [27]. Those methods have been shown to be capable of returning satisfactory localization/embedding in many applications. We will compare our method PREEEDM with ADMMSNL, ARAP, EVEDM, PC and SFSDP for Sensor Network Localization (SNL, \(r=2\)) problems and with EVEDM, PC, PPAS and SFSDP for Molecular Conformation (MC, \(r=3\)) problems since the current implementations of ADMMSNL, ARAP do not support the embedding for \(r\ge 3\).
We note that ADMMSNL is motivated by [45] and aims to enhance the package diskRelax of [45] for the SNL problems (\(r=2\)). Both methods are based on the stress minimization (5). As we mentioned before, SMACOF [13, 14] has been a very popular method for (5). However, we will not compare it with other methods here since its performance demonstrated in [54, 56] was not very satisfactory (e.g., when comparing with ARAP) for either SNL or MC problems. To our best knowledge, PC is the only viable method, whose code is also publicly available for the model (3). We select SFSDP and PPAS because of their high reputation in the field of SDP and quadratic SDP in returning quality localizations and conformations. We note that SFSDP is for the model (4) and the methods PPAS and EVEDM are proposed for the model (6). It is worth mentioning that the MADMM package in [29] is capable of solving the Robust MDS (4) as well as other nonsmooth optimization problems. However, MADMM does not contain the implementation of its listed example: Robust MDS. So we were not able to compare it with ours here. We also implemented the subgradient method of Cayton and Dasgupta [8] for their robust Euclidean embedding. Numerical experiments showed that its performance was similar to PC on our tested problems. It works well when a large number of the dissimilarities in \(\varDelta \) are available and it often performs poorly otherwise. Hence, we omitted it from our reported results.
In our tests, we used all of their default parameters except one or two in order to achieve the best results. In particular, for PC, we terminate it when \(|f(D^{k-1})-f(D^{k})|<10^{-4}\times f(D^{k})\) and set its initial point to be the embedding by cMDS on \(\varDelta \). For SFSDP which is a high-level MATLAB implementation of the SDP approach initiated in [50], we set pars.SDPsolver\(=\) “sedumi” because it returns the best overall performance, and pars.objSW\(=1\) when \(m>r+1\) and \(=3\) when \(m=0\). We also note that the parameter pars.minDegree controls the degree of a graph and thus enhances the strength of the SDP relaxation. Numerical experiments have shown that the larger it is, the more accurate solutions might be generated by SFSDP. However, the computational time shoots up dramatically when it increases even for small n. Our extensive experiments suggest that its default value (\(\mathtt{pars.minDegree} =r+2\)) is a balanced choice between solution quality and time of computation for large n. Hence we choose to use its default setting in our test. For ARAP, in order to speed up the termination, we let \(\texttt {tol} = 0.05\) and \(\texttt {IterNum} =20\) to compute its local neighbour patches. Numerical performance demonstrated that ARAP could yield satisfactory embedding, but would take very long time for some examples with large n.
6.3 Numerical comparison
To assess the embedding quality, we adopt a widely used measure RMSD (Root of the Mean Squared Deviation) defined by
where \(\mathbf{x}_i\)’s are the true positions of the sensors or atoms in our test problems and \(\widehat{\mathbf{x}}_i\)’s are their corresponding estimates. The \(\widehat{\mathbf{x}}_i\)’s were obtained by applying the classical MDS (cMDS) method to the final output of the distance matrix, followed by aligning them to the existing anchors through the well-known Procrustes procedure (see [54, 6, Chp. 20] or [41, Proposition 4.1] for more details). Furthermore, upon obtaining \(\widehat{\mathbf{x}}_i\)’s, a heuristic gradient method can be applied to improve their accuracy and it is called the refinement step in [5]. We report rRMSD to highlight its contribution. As we will see, all tested methods benefit from this step, but with varying degrees.
The quality of the general performance of each method can be better appreciated through visualizing their key indicators: RMSD, rRMSD, rTime (time for the refinement step) and the CPU Time (in s) which is the total time including rTime. Hereafter, for all examples, we test 20 randomly generated instances for each case \((n,m,R,\texttt {nf})\) in SNL or each case \((n,R,\texttt {nf})\) in MC, and record the average results.
6.3.1 Comparison on SNL
SNL has been widely used to test the viability of many existing methods for the stress minimization. In such a problem, we typically have m anchors (e.g., sensors with known locations) and the rest sensors need to be located. We will test two types of SNL problems. One has a regular topological layout (Examples 1 and 2 below). The other has an irregular layout (Example 3).
Example 1
(Square Network with 4 fixed anchors) This example is widely tested since its detailed study in [5]. In the square region \([-0.5, 0.5]^2\), 4 anchors \(\mathbf{x}_1 = \mathbf{a}_1, \ldots , \mathbf{x}_4 = \mathbf{a}_4\) (\(m=4\)) are placed at \((\pm 0.2, \pm 0.2\)). The generation of the rest \((n-m)\) sensors (\(\mathbf{x}_{m+1}, \ldots , \mathbf{x}_n\)) follows the uniform distribution over the square region. The noisy \(\varDelta \) is usually generated as follows.
where R is known as the radio range, \(\epsilon _{ij}\)’s are independent standard normal random variables, and nf is the noise factor (e.g., \(\texttt {nf} = 0.1\) was used and it corresponds to \(10\%\) noise level). In literature (e.g., [5]), this type of perturbation in \(\delta _{ij}\) is known to be multiplicative and follows the unit-ball rule in defining \(\mathcal {N}_x\) and \(\mathcal {N}_a\) (see [3, Sect. 3.1] for more detail). The corresponding weight matrix W and the lower and upper bound matrices L and U are given as in the table below. Here, M is a large positive quantity. For example, \(M :=n\max _{ij}\varDelta _{ij}\) is the upper bound of the longest shortest path if the network is viewed as a graph.
(i, j) | \(W_{ij}\) | \(\varDelta _{ij}\) | \(L_{ij}\) | \(U_{ij}\) |
---|---|---|---|---|
\(i=j\) | 0 | 0 | 0 | 0 |
\(i,j\le m\) | 0 | 0 | \(\Vert \mathbf{a}_i - \mathbf{a}_j\Vert ^2\) | \(\Vert \mathbf{a}_i - \mathbf{a}_j\Vert ^2\) |
\((i,j)\in \mathcal {N}\) | 1 | \(\delta _{ij}\) | 0 | \(R^2\) |
otherwise | 0 | 0 | \(R^2\) | \(M^2\) |
Example 2
(Square Network with m random anchors) This example also tested in [5] is similar to Example 1 but with randomly generated anchors. The generation of n points follows the uniform distribution over the square region \([-0.5, 0.5]^2\). Then the first m points are chosen to be anchors and the last \((n-m)\) points to be sensors. The rest of the data generation is same as in Example 1.
Example 3
(EDM word network) This problem has a non-regular topology and was first used in [3] to challenge existing methods. In this example, n points are randomly generated in a region whose shape is similar to the letters “E”, “D” and “M”. The ground truth network is depicted in Fig. 2. We choose the first m points to be the anchors. The rest of the data generation is same as in Example 1.
(a) Effect of the radio range R It is easy to see that the radio range R decides the amount of missing dissimilarities among all elements of \(\varDelta \). The smaller R is, the fewer numbers of \(\delta _{ij}\) are available, leading to more challenging problems. Therefore, we first demonstrate the performance of each method to the radio range R. For Example 1, we fix \(n=200,m=4\), nf\(=0.1\), and alter the radio range R among \(\{0.2,0.4,\ldots ,1.4\}\). The average results were demonstrated in Fig. 3. It can be seen that ARAP and PREEEDM were joint winners in terms of both RMSD and rRMSD. However, the time used by ARAP was the longest. When R became bigger than 0.6, ADMMSNL, SFSDP and EVEDM produced similar rRMSD as ARAP and PREEEDM, while the time consumed by ADMMSNL was significantly larger than that by SFSDP, EVEDM and PREEEDM. By contrast, PC only worked well when \(R\ge 1\).
Next we test a number of instances with larger size \(n\in \{300, 500,1000,2000\}\). For Example 1, the average results were recorded in Table 1. When \(R=\sqrt{2}\) under which no dissimilarities were missing because Example 1 was generated in a unit region, PC, ARAP and PREEEDM produced the better RMSD ( almost in the order of \(10^{-3}\)). But with the refinement step, all methods led to similar rRMSD. This meant SFSDP and EVEDM benefited a lot from the refinement step. For the computational speed, PREEEDM outperformed others, followed by PC, EVEDM and SFSDP. By contrast, ARAP consumed too much time even for \(n=500\). When \(R=0.2\), the picture was significantly different since there were large amounts of unavailable dissimilarities in \(\varDelta \). Basically, ADMMSNL, PC and SFSDP failed to localize even with the refinement due to undesirable RMSD and rRMSD (both in the order of \(10^{-1}\)). Clearly, ARAP and PREEEDM produced the best RMSD and rRMSD, and EVEDM got comparable rRMSD but inaccurate RMSD. In terms of the computational speed, EVEDM and PREEEDM were very fast, consuming about 30 s to solve problems with \(n=2000\) nodes. By contrast, ARAP still was the slowest, followed by ADMMSNL and PC.
Now we test those methods for the irregular network in Example 3. The average results were recorded in Table 2. We note that this example was generated in the region \([0,1]\times [0,0.5]\) as presented in Fig. 2. It implies that no dissimilarities in \(\varDelta \) were missing when \(R=\sqrt{1.25}\) while a large number of dissimilarities in \(\varDelta \) were missing when \(R=0.1\). When \(R=\sqrt{1.25}\), it can be clearly seen that SFSDP and EVEDM failed to localize before the refinement step due to their large RMSD (in the order of \(10^{-1}\)), whilst the rest four methods succeeded. However, they all achieved a similar rRMSD after the refinement except for EVEDM under the case \(n=500\). Still, PREEEDM ran the fastest and ARAP came the last, (5.13s vs. 2556.3s when \(n=500\)). Their performances for the case \(R=0.1\) are quite contrasting. PREEEDM generated the most accurate RMSD and rRMSD (in the order of \(10^{-3}\)) whilst the results of the rest methods were only in the order of \(10^{-2}\). Obviously, ADMMSNL, PC and EVEDM failed to localize. Compared with the other methods, EVEDM and PREEEDM were joint winners in terms of the computational speed, only using 30s when \(n=2000\) (a larger scale network). But we should mention that EVEDM failed to localize.
(b) Effect of the number of anchorsm As one would expect, more anchors would lead to more information available, and hence lead to easier localization. In this part, we demonstrate the degree of the effect of the varying anchors’ numbers on the 6 methods. For Example 2, we fix \(n=200,R=0.2\), nf\(=0.1\) with choosing m from \(\{5,10,\ldots ,40\}\). As illustrated in Fig. 4, ARAP and PREEEDM were again joint winners in terms of RMSD and rRMSD. And rRMSD produced by the rest methods declined rapidly as more anchors being used. Moreover, PREEEDM was the fastest, followed by EVEDM, PC and SFSDP, whilst ADMMSNL and ARAP were quite slow.
For Example 3 with fixed \(n=500, R=0.1\), nf\(=0.1\), we test it under \(m\in \{10,30,50\}\). As depicted in Fig. 5, ARAP and PREEEDM were always capable of capturing the shape of letters ‘E’, ‘D’ and ‘M’ that was similar to Fig. 2. By contrast, SFSDP and EVEDM derived desirable outline of three letters only when \(m=50\), and the localization quality of both ADMMSNL and PC improved along with the increasing m but still with a deformed shape of letter ‘M’.
Finally we test a number of instances of Example 2 with choosing \(n\in \{300,\) 500, 1000, \(2000\}\) and \(m \in \{10, 50\}\). The average results were recorded in Table 3. When \(m=10\), ADMMSNL and PC produced undesirable RMSD and rRMSD (both in the order of \(10^{-1}\)). SFSDP benefited greatly from the refinement because it generated relatively inaccurate RMSD. By contrast the rest three methods enjoyed the successful localization except for EVEDM under the case \(n=300\). With regard to the computational speed, EVEDM and PREEEDM were the fastest, followed by SFSDP, PC, ADMMSNL and ARAP. When \(m=50\), more information was known, the results were better than before, especially for the methods ADMMSNL and PC. But PC still heavily relied on the refinement step to get the satisfactory localization. The rest five methods produced a satisfactory localization with varying degree of accuracy. It is encouraging to see that PREEEDM produced the most accurate rRMSD for all cases. The comparison of the computational speed is similar to the case of \(m=10\). We repeated the test for Example 3 and the average results were recorded in Table 4, where we observed a similar performance of the six methods as for Example 2. We omit the details.
(c) Effect of the noise factornf To see the dependence of the performance of each method on the noise factor, we first test Example 3 with fixing \(n=200, m=10, R=0.3\) and varying the noise factor \(\texttt {nf}\in \{0.1,0.2,\ldots ,0.7\}\). As shown in Fig. 6, in terms of RMSD it can be seen that ARAP got the smallest ones, whilst EVEDM and PC obtained the worst ones. The line of ADMMSNL dropped down from \(0.1\le \texttt {nf}\le 0.3\) and then ascended. By contrast the line of PREEEDM reached the peak at \(\texttt {nf}=0.3\) but declined afterwards and gradually approached to RMSD of ARAP. However, after the refinement step, ARAP, SFSDP and PREEEDM all derived a similar rRMSD while the other three methods produced undesirable ones. Apparently, EVEDM was indeed the fastest (yet with the worst rRMSD), followed by PC, SFSDP and PREEEDM. Again, ARAP and ADMMSNL were quite slow.
Next, we test Example 2 with a moderate size (for the visualization purpose in Fig. 7) \(n=200, m=4\) and \(R=0.3\) and varying \(\texttt {nf}\in \{0.1,0.3,0.5\}\). The actual embedding by each method was shown in Fig. 7, where the four anchors were plotted in green square and \(\widehat{\mathbf{x}}_i\) in pink points were jointed to its ground truth location (blue circle). It can be clearly seen that ARAP and PREEEDM were quite robust to the noise factor since their localization matched the ground truth well. EVEDM failed to locate when \(\texttt {nf}=0.5\). By contrast, SFSDP generated worse results when nf got bigger, and ADMMSNL and PC failed to localize for all cases.
Finally, we test Example 1 with larger sizes \(n\in \{300, 500, 1000, 2000\}\) and fixed \(m=4, R=0.3\). The average results were recorded in Table 5. When \(\texttt {nf}=0.1\), ADMMSNL and PC failed to render accurate embedding. Compared with ARAP, EVEDM and PREEEDM, SFSDP generated lager RMSD and rRMSD. Again, EVEDM and PREEEDM ran faster than ARAP. When \(\texttt {nf}=0.7\), the results were different. ARAP and PREEEDM were still able to produce high-quality RMSD and rRMSD. However, the former took extremely long time (16617 vs. 83 s). By contrast, ADMMSNL and PC again failed to reconstruct the network. Furthermore, EVEDM got large RMSD but comparable rRMSD when \(n\le 1000\), but it failed when \(n=2000\).
6.3.2 Comparison on MC
MC has long been an important application of EDM optimization [2, 21, 33]. We will test two types of MCs respectively from an artificial data set and a real data set in Protein Data Bank (PDB) [4]. For the former, we adopt the rule of generating data from [2, 33]. For the latter, we used the real data of 12 molecules derived from 12 structures of proteins from PDB. They are 1GM2, 304D, 1PBM, 2MSJ, 1AU6, 1LFB, 104D, 1PHT, 1POA, 1AX8, 1RGS, 2CLJ. They provide a good set of test problems in terms of the size n, which ranges from a few hundreds to a few thousands (the smallest \(n=166\) for 1GM and the largest \(n=4189\) for 2CLJ). The distance information was obtained in a realistic way as done in [24].
Example 4
(Artificial data) As described in [2, 33], the artificial molecule has \(n = s^3\) atoms \((\mathbf{x}_{1}, \ldots , \mathbf{x}_n)\) located in the three-dimensional lattice
for some integer \(s \ge 1\), i.e., \(\mathbf{x}_{i}=(i_1, i_2, i_3)^T\). We define \(\mathcal {N}_{x}\) for the index set on which \(\delta _{ij}\) are available as:
where \(p(\mathbf{x}_{i}):=1+(1,s,s^2)^T\mathbf{x}_{i}=1+i_1+si_2+s^2i_3\) and R is a given constant (e.g., \(R=s^2\)). The corresponding dissimilarity matrix \(\varDelta \), weight matrix W and the lower and upper bound matrices L and U are given as in the table below. Here the generation of \(\delta _{ij}\) is the same as Example 1.
(i, j) | \(W_{ij}\) | \(\varDelta _{ij}\) | \(L_{ij}\) | \(U_{ij}\) |
---|---|---|---|---|
\(i=j\) | 0 | 0 | 0 | 0 |
\((i,j)\in \mathcal {N}_{x}\) | 1 | \(\delta _{ij}\) | 1 | \(\max _{(i,j)\in \mathcal {N}_{x}} || \mathbf{x}_{i} -\mathbf{x}_{j} ||^2\) |
otherwise | 0 | 0 | 1 | \(3(s-1)^2\) |
Example 5
(Real PDB data) Each molecule comprises n atoms \(\{\mathbf{x}_1,\ldots \mathbf{x}_n\}\) in \(\mathfrak {R}^3\) and its distance information is collected as follows. If the Euclidean distance between two of the atoms is less than R, the distance is chosen; otherwise no distance information about this pair is known. For example, \(R= 6{\AA }~(1{\AA } = 10^{-8}\)cm) is nearly the maximal distance that the nuclear magnetic resonance (NMR) experiment can measure between two atoms. For realistic molecular conformation problems, not all the distances below R are known from NMR experiments, so one may obtain \(c\%\) (e.g., \(c=50\%\)) of all the distances below R. Denote \(\mathcal {N}_{x}\) the set formed by indices of those measured distances. Moreover, the distances in \(\mathcal {N}_{x}\) can not be exactly measured. Instead, only lower bounds \(\ell _{ij}\) and upper bounds \(u_{ij}\) are provided, that is for \((i,j)\in \mathcal {N}_{x},\)
where \(\epsilon _{ij},\varepsilon _{ij} \sim N(0, \texttt {nf}^2 \times \pi /2)\) are independent normal random variables. In our test, we set the noise factor \(\texttt {nf}=0.1\) and the parameters \(W,\varDelta , L, U \in \mathcal{S}^n\) are given as in the table below, where \(M>0\) is the upper bound (e.g., \(M :=n\max _{ij}\varDelta _{ij}\)).
(i, j) | \(W_{ij}\) | \(\varDelta _{ij}\) | \(L_{ij}\) | \(U_{ij}\) |
---|---|---|---|---|
\(i=j\) | 0 | 0 | 0 | 0 |
\((i,j)\in \mathcal {N}_{x}\) | 1 | \((\ell _{ij}+ u_{ij})/2\) | \(\ell ^2_{ij}\) | \(u^2_{ij}\) |
otherwise | 0 | 0 | 0 | \(M^2\) |
As we mentioned before, the current implementations of ADMMSNL, ARAP do not support the embedding for \(r\ge 3\) and thus are removed in the following comparison, where the method PPAS will be added. The main reason for adding PPAS is that it is particularly suitable and credible for the MC problems [24, 25].
(d) Test on Example 4. To see the performance of each method on this problem, we first test it with fixing \(s=6 \; (n=6^3), \; \texttt {nf}=0.1\) but varying \(R\in \{36,38,\ldots ,48\}\). We note that the percentage of available dissimilarities increased from 32.47 to 39.87% with R increasing from 36 to 48, making the problem become ‘easier’ for conformation. The Average results were recorded in Fig. 8. Clearly, PREEEDM and PPAS outperformed the other three methods in terms of RMSD and rRMSD. The former generated the best RMSD when \(R\ge 42\) while the latter got the best RMSD when \(R\le 42\), but they both obtained similar rRMSD. As for the computational speed, PREEEDM ran much faster than PPAS. By contrast, the other three methods failed to produce accurate embeddings due to the worse RMSD and rRMSD. Notice that the refinement would not always make the final results better. For instance, rRMSD yielded by SFSDP was bigger than RMSD for each s.
We then test the example with fixing \(s=6\; (n=6^3), \; R=s^2\) and varying \(\texttt {nf}\in \{0.1,0.2,\ldots ,0.5\}\). As illustrated in Fig. 9, in terms of RMSD and rRMSD, it can be clearly seen that PREEEDM and PPAS were the joint winners. In particular, our method rendered the best RMSD when \(\texttt {nf}\ge 0.2\) and also ran much faster than PPAS. Obviously, the other three methods again failed to obtain desirable RMSD and rRMSD irrelevant of the time they used.
Finally, for larger size problems with \(n=s^3\) and \(s\in \{7,8,\ldots ,13\}\), the average results were presented in Fig. 10, where we omitted the results by PPAS for \(s>10\) because it took too much time to terminate. It is worth mentioning that the percentage of the available dissimilarities over all elements of \(\varDelta \) decreases from \(26.78\%\) to \(14.83\%\) when s increasing from 7 to 13, making the problems more and more challenging. Clearly PC, SFSDP and EVEDM failed to locate all atoms in \(\mathfrak {R}^3\). PPAS rendered the most accurate RMSD when \(s\le 10\) whilst PREEEDM achieved the most accurate RMSD when \(s >10\) and the most accurate rRMSD for all cases. Equally important for PREEEDM is that it spent less than 50 s for all tested cases, while PPAS took much more time to terminate (e.g., consuming over 2000 s when \(s\ge 10\)).
(e) Test on Example 5 For the 12 collected real data, we fixed \(R = 6, c = 50\%\) and \(\texttt {nf}= 0.1\). The generated embeddings by the five methods for the three molecules 1GM2, 1AU6 and 1LFB were shown in Fig. 11, where the true and estimated positions of the atoms were plotted by blue circles and pink stars respectively. Each pink star was linked to its corresponding blue circle by a pink line. For these three data, PREEEDM and PPAS almost conformed the shape of the original data. Clearly, the other three methods failed to conform. The complete numerical results for the 12 problems were reported in Table 6. It can be clearly seen that PREEEDM and PPAS performed significantly better in terms of the RMSD and rRMSD than the other methods. What is more impressive is that PREEEDM only used a small fraction of the time by PPAS, which in general took relatively long time to terminate. For example, PREEEDM only used 22.64 s for 2CLJ, which is a very large data set with \(n = 4189\). In contrast, we had to omit the result of PPAS for this instance (as well as to omit for other tested instances, and the missed results were indicated as “–” in Table 6) because it took too long to terminate.
6.4 Robustness of PREEEDM
The excellent performance of PREEEDM reported above was actually due to its robustness to noise. Previous examples all had Gaussian noise. We now demonstrate below that PREEEDM works much better than the other methods when the noise is from a heavy-tailed distribution, for instance, t-distribution with a small degree of freedom. We also take this opportunity to test SQREDM solver of our own [56], which also made use of penalty, majorization and minimization techniques, yet for the least squares problem (5). We will see that PREEEDM outperforms SQREDM for both types of noise (Gaussian and t distributions).
To shorten the presentation, we restrict our numerical tests on two representative examples: Example 1 with \(n=100, R=0.3\) and Example 4 with \(s=5, R=s^2\). For each example, we generate 20 instances under two types of noise from standard normal distribution and Student-t distribution with the degree of freedom being 1. We alter nf from \(\{0.1,0.2,\ldots ,0.9\}\) and from \(\{0.01,0.02,\ldots ,0.09\}\) for the Gaussian and the Student-t noises respectively. Average RMSD were recorded in Fig. 12. We have the following observations.
-
(i)
PREEEDM is competitive under Gaussian noise. For Example 1, Fig. 12a showed ARAP yielded the best RMSD followed by PREEEDM and SQREDM. For Example 4, Fig. 12c showed that PREEEDM rendered the smallest RMSD for most cases, followed by PPAS and SQREDM (note that the current implementation of ARAP is only for \(r=2\) and hence is not applicable to this example).
In particular, when the nf is over 50%, PPAS and PREEEDM closely follow each other. The behaviour of those methods under Gaussian noise is expected as the least-squares formulation is equivalent to the maximum-likelihood criterion. On the one hand, least squares favour large distances. On the other hand, under Gaussian (thin-tailed), the number of large distance errors is relatively small and hence would not cause significant distortion in locating the unknown sensors.
-
(ii)
PREEEDM performs the best under heavy-tailed noise (from Student \(t_1\) distribution). For Example 1, both PREEEDM and SQREDM behaved much better than the other methods, see Fig. 12b. For Example 4, PREEEDM stood out as the best method when nf is bigger than 0.02 and is much better than SQREDM, see Fig. 12d. The test data now has more numbers of large distance errors than under the Gaussian distribution and the absolute value criterion alleviates the tendency of favouring large distances. Therefore, PREEEDM yielded the best performance in such situations.
We conclude that PREEEDM based on the model (3) is robust to the noise in terms of these two examples.
7 Conclusion
The purpose of this paper is to develop an efficient method for one of the most challenging distance embedding problems in a low-dimensional space, which have been widely studied under the framework of multi-dimensional scaling. The problem employs \(\ell _1\) norm to quantify the embedding errors. Hence, the resulting model (3) appears to be robust to outliers and is referred to as the robust Euclidean embedding (REE) model.
To the best knowledge of the authors, the only viable method, whose matlab code is also publicly available for REE is the PlaceCenter (PC) algorithm proposed in [1]. Our extensive numerical results on the SNL and MC test problems convincingly demonstrated that the proposed PREEEDM method outperform PC in terms of both the embedding quality and the CPU time. Moreover, PREEEDM is also comparable to several state-of-the-art methods for other embedding models in terms of the embedding quality, but is far more efficient in terms of the CPU time. The advantage becomes even more superior as the size of the problem gets bigger.
The novelty of the proposed PREEEDM lies with its creative use of the Euclidean distance matrix and a computationally efficient majorization technique to derive its subproblem, which has a closed-form solution closely related to the positive root of the classical depressed cubic equation. Furthermore, a great deal of effort has been devoted to its convergence analysis, which well justifies the numerical performance of PREEEDM. We feel that PREEEDM will become a very competitive embedding method in the field of SNL and MC and expect its wide use in other visualization problems.
References
Agarwal, A., Phillips, J.M., Venkatasubramanian, S.: Universal multi-dimensional scaling, In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1149–1158, ACM (2010)
An, L.T.H., Tao, P.D.: Large-scale molecular optimization from distance matrices by a dc optimization approach. SIAM J. Optim. 14, 77–114 (2003)
Bai, S., Qi, H.-D.: Tackling the flip ambiguity in wireless sensor network localization and beyond. Digit. Signal Process. 55, 85–97 (2016)
Berman, H.M., Westbrook, J., Feng, Z., Gillilan, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acids Res. 28, 235–242 (2000)
Biswas, P., Liang, T.-C., Toh, K.-C., Wang, T.-C., Ye, Y.: Semidefinite programming approaches for sensor network localization with noisy distance measurements. IEEE Trans. Auto. Sci. Eng. 3, 360–371 (2006)
Borg, I., Groenen, P.J.F.: Modern Multidimensional Scaling: Theory and Applications. Springer Series in Statistics, 2nd edn. Springer, Berlin (2005)
Burton, D.M.: The History of Mathematics, 7th edn. MaGraw-Hill, New York City (2011)
Cayton, L., Dasgupta, S.: Robust Euclidean embedding. In: Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, pp. 169–176 (2006)
Chen, Y.Q., Xiu, N.H., Peng, D.T.: Global solutions of non-Lipschitz \(S_{2}-S_{p}\) minimization over the positive semidefinite cone. Optim. Lett. 8, 2053–2064 (2014)
Cox, T.F., Cox, M.A.A.: Multidimensional Scaling, 2nd edn. Chapman and Hall/CRC, Boca Raton (2001)
Ding, C., Qi, H.-D.: Convex optimization learning of faithful Euclidean distance representations in nonlinear dimensionality reduction. Math. Program. 164, 341–381 (2017)
Drusvyatskiy, D., Krislock, N., Voronin, Y.L., Wolkowicz, H.: Noisy Euclidean distance realization: robust facial reduction and the Pareto frontier. SIAM J. Optim. 27(4), 2301–2331 (2017)
de Leeuw, J.: Applications of Convex analysis to multidimensional scaling. In: Barra, J., Brodeau, F., Romier, G., van Cutsem, B. (eds.) Recent Developments in Statistics, pp. 133–145. North Holland Publishing Company, Amsterdam, The Netherlands (1977)
de Leeuw, J., Mair, P.: Multidimensional scaling using majorization: Smacof in R. J. Stat. Softw. 31, 1–30 (2009)
Biswas, P., Ye, Y.: Semidefinite programming for ad hoc wireless sensor network localization. In: Proceedings of the 3rd IPSN, pp. 46–54. Berkeley, CA (2004)
Drusvyatskiy, D., Krislock, N., Voronin, Y.-L., Wolkowicz, H.: Noisy Euclidean distance realization: robust facial reduction and the Pareto frontier. SIAM J. Optim. 27, 2301–2331 (2017)
France, S.L., Carroll, J.D.: Two-way multidimensional scaling: a review. IEEE Trans. Syst. Man Cyber. Part C 41, 644–661 (2011)
Gao, Y.: Structured Low Rank Matrix Optimization Problems: a Penalty Approach, PhD Thesis, National University of Singapore (2010)
Gaffke, N., Mathar, R.: A cyclic projection algorithm via duality. Metrika 36, 29–54 (1989)
Glunt, W., Hayden, T.L., Hong, S., Wells, J.: An alternating projection algorithm for computing the nearest Euclidean distance matrix. SIAM J. Matrix Anal. Appl. 11, 589–600 (1990)
Glunt, W., Hayden, T.L., Raydan, R.: Molecular conformations from distance matrices. J. Comput. Chem. 14, 114–120 (1993)
Gower, J.C.: Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53(1966), 325–338 (1966)
Heiser, W.J.: Multidimensional scaling with least absolute residuals. In: Proceedings of the First Conference of the International Federation of Classification Societies (IFCS), pp. 455–462. Germany, Aachen (1987)
Jiang, K.F., Sun, D.F., Toh, K.C.: Solving Nuclear Norm Regularized and Semidefinite Matrix Least Squares Problems with Linear Equality Constraints, Discrete Geometry and Optimization, pp. 133–162. Springer International Publishing, Berlin (2013)
Jiang, K.F., Sun, D.F., Toh, K.-C.: A partial proximal point algorithm for nuclear norm regularized matrix least squares problems. Math. Progr. omput. 6, 281–325 (2014)
Kanzow, C., Qi, H.-D.: A QP-free constrained Newton-type method for variational inequality problems. Math. Progr. 85, 81–106 (1999)
Kim, S., Kojima, M., Waki, H., Yamashita, M.: Algorithm 920: SFSDP: a sparse version of full semidefinite programming relaxation for sensor network localization problems. ACM Trans. Math. Softw. 38(4), 27:1–27:19 (2012)
Korkmaz, S., Van der Veen, A.J.: Robust localization in sensor networks with iterative majorization techniques. ICASSP 2049–2052 (2009)
Kovnatsky, A., Glashoff, K., Bronstein M.M.: MADMM: a generic algorithm for non-smooth optimization on manifolds. In: European Conference on Computer Vision, Springer, Cham., pp. 680–696 (2016)
Kruskal, J.B.: Nonmetric multidimensional scaling: a numerical method. Psychometrika 29, 115–129 (1964)
Mandanas, F.D., Kotropoulos, C.L.: Robust multidimensional scaling using a maximum correntropy criterion. IEEE Trans. Signal Process. 65, 919–932 (2017)
Micchelli, C.A.: Interpolation of scattered data: distance matrices and conditionally positive definite functinos. Constr. Approx. 2, 11–22 (1986)
More, J.J., Wu, Z.: Global continuation for distance geometry problems. SIAM J. Optim. 7, 814–836 (1997)
Nocedal, G., Wright, S.J.: Numerical Optimization, 2nd edn. Springer, Berlin (2006)
Oğuz-Ekim, P., Gomes, J.P., Xavier, J., Oliveira, P.: Robust localization of nodes and time-recursive tracking in sensor networks using noisy range measurements. IEEE Trans. Signal Process. 59, 3930–3942 (2011)
Peng, D.T., Xiu, N.H., Yu, J.: \(S_{1/2}\) regularization methods and fixed point algorithms for affine rank minimization problems. Comput. Optim. Appl. 67, 543–569 (2017)
Piovesan, N., Erseghe, T.: Cooperative localization in WSNs: a hybrid convex/non-convex solution. IEEE Trans. Signal and Information Processing over Networks. https://doi.org/10.1109/TSIPN.2016.2639442. (IEEE early access article, 2016)
Pong, T.K.: Edge-based semidefinite programming relaxation of sensor network localization with lower bound constraints. Comput. Optim. Appl. 53, 23–44 (2012)
Qi, H.-D.: A semismooth Newton method for the nearest Euclidean distance matrix problem. SIAM J. Matrix Anal. Appl. 34, 67–93 (2013)
Qi, H.-D., Yuan, X.M.: Computing the nearest Euclidean distance matrix with low embedding dimensions. Math. Progr. 147, 351–389 (2014)
Qi, H.-D., Xiu, N.H., Yuan, X.M.: A Lagrangian dual approach to the single source localization problem. IEEE Trans. Signal Process. 61, 3815–3826 (2013)
Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis, 3rd edn. Springer, Berlin (2009)
Schoenberg, I.J.: Remarks to Maurice Frechet’s article Sur la definition axiomatque d’une classe d’espaces vectoriels distancies applicbles vectoriellement sur l’espace de Hilbet. Ann. Math. 36, 724–732 (1935)
Shang, Y., Ruml, W., Zhang, Y., Fromherz, M.P.J.: Localization from mere connectivity. In: Proceedings of the 4th ACM International Symposium on Mobile Ad Hoc Networking & Computing, MobiHoc 03, ACM, New York, NY, USA, pp. 201–212 (2003)
Soares, C., Xavier, J., Gomes, J.: Simple and fast convex relaxation method for cooperative localization in sensor networks using range measurements. IEEE Trans. Signal Process. 63(17), 4532–4543 (2015)
Sun, Y., Babu, P., Palomar, D.P.: Majorization-minimization algorithms in signal processing, communications, and machine learning. IEEE Trans. Signal Process. 65, 794–816 (2017)
Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000)
Toh, K.C.: An inexact path-following algorithm for convex quadratic SDP. Math. Progr. 112, 221–254 (2008)
Torgerson, W.S.: Multidimensional scaling: I. Theory and method. Psychometrika 17, 401–419 (1952)
Wang, Z., Zheng, S., Ye, Y., Boyd, S.: Further relaxations of the semidefinite pro-gramming approach to sensor network localization. SIAM J. Optim. 19, 655–673 (2008)
Xing, F.C.: Investigation on solutions of cubic equations with one unknown. J. Central Univ. Nat. (Natural Sci. Ed.) 12, 207–218 (2003)
Xu, Z., Chang, X., Xu, F., Zhang, H.: \(\text{ L }_{1/2}\) regularization: a thresholding representation theory and a fast solver. IEEE Trans. Neural Netw. Learn. Sys. 23, 1013–1027 (2012)
Young, G., Householder, A.S.: Discussion of a set of points in terms of their mutual distances. Psychometrika 3, 19–22 (1938)
Zhang, L., Liu, L., Gotsman, C., Gortler, S.J.: An as-rigid-as-possible approach to sensor network localization. ACM Trans. Sen. Netw. 6, 35:1–35:21 (2010)
Zhang, L., Wahba, G., Yuan, M.: Distance shrinkage and Euclidean embedding via regularized kernel estimation. J. R. Stat. Soc.: Ser. B 78, 849–867 (2016)
Zhou, S.L., Xiu, N.H., Qi, H.D.: A fast matrix majorization-projection method for penalized stress minimization with box constraints. IEEE Trans. Signal Process. 66, 4331–4346 (2018)
Acknowledgements
The authors are grateful to the three anonymous referees for their constructive comments, which have helped to improve the quality of the paper. We would also like to thank Mr Nicola Piovsan and Prof Nathan Krislock respectively for sharing with us their excellent packages ADMMSNL and EepVecEDM. In particular, we thank Prof Kim-Chuan Toh for his encouragement over the past years in algorithmic implementation, acquiring PDB data and for sharing with us his well-structured package PPAS. This work is partly supported by the National Science Foundation of China (11728101, 71611130218).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Proof of Lemma 1
Proof
We prove it by considering three cases. Case 1: \(0<x < \delta ^2\); Case 2: \(x > \delta ^2\) and Case 3: \(x=\delta ^2\). For simplicity, we use \(\phi (x)\) for \(\phi _{\delta }(x)\) in our proof. Let \(\zeta :=\eta /(2\sqrt{x})\), then \(\zeta \in \partial \phi (x)\) is equivalent to \(\eta \in {\mathrm {sign}}(\sqrt{x}-\delta )\).
Case 1: \(0<x<\delta ^2\). For this case, \({\mathrm {sign}}(\sqrt{x} - \delta ) = \{-1\}\) and \(\eta =-1\). We note that \(\phi (x)=\delta -\sqrt{x}\) is convex and differentiable at \(0<x< \delta ^2\). Thus,
For \(y\ge \delta ^2\), we have the following chain of inequalities
Hence, we proved the conclusion for this case.
Case 2: \(x>\delta ^2\). For this case, \({\mathrm {sign}}(\sqrt{x} - \delta ) = \{1\}\) and \(\eta =1\). By defining \(\varPhi (\theta , \mu ) := \theta (\theta ^2 - \mu ^2)^2 - 4\delta ^3(\theta + \mu )^2 + 16 \theta \delta ^4\) with \(\theta >\delta \) and \(0<\mu <\delta \), we have
which indicates \(\varPhi (\theta , \mu )\) is non-increasing with respect \(\mu \) and thus
For \(0<y<\delta ^2\), we have
For \(y\ge \delta ^2\), we have the following chain of inequalities
Hence, we proved the claim for this case.
Case 3: \(x=\delta ^2\). For this case, \({\mathrm {sign}}(\sqrt{x} - \delta ) = [-1, 1]\) and \(-1 \le \eta \le 1\). For \(0<y<\delta ^2\), we have
where the first and last inequalities hold due to \(y<\delta ^2=x\) and \(|\eta |\le 1\). For \(y\ge \delta ^2\), similar to obtaining (60), we have
where the last inequality is due to \(|\eta |\le 1\) and \(x-y \le 0\)
For all three cases, we proved our claim and hence accomplish our proof. \(\square \)
1.2 Proof of Proposition 1
Proof
Since \(D^*_\rho \) is an optimal solution of (16), we have \(D^*_\rho \in \mathcal{B}\). For any feasible solution D to (11) (i.e., \(g(D)=0, D\in \mathcal{B}\) in (15)), it holds the following chain of inequalities.
which together with the feasibility of \(D_r\) to (11) yields
and the feasibility of \(D^*\) to (11) derives
It is well known from [42, Theorem 6.12] that a necessary condition for \(D^*_\rho \) being (locally) optimal of (16) is \(0\in \partial (f(D^*_\rho ) + \rho g(D^*_\rho )+I_{\mathcal{B}}(D^*_\rho )),\) which in turn implies (37). This together with \(g(D^*_\rho ) \le \epsilon \) completes our proof. \(\square \)
1.3 Proof of Proposition 2
Proof
For notational simplicity, denote \(z:=x^-_{\omega , \beta }\). Let us consider
By noticing that the second derivative \(q''_-(x) = 1 +(\beta /4) x^{-2/3} > 1\) for all \(x >0\), \(q_-(x)\) is strongly convex over \((0, \infty )\). It has been proved in [56, Proposition 3.1] that \(z>0\) is the optimal solution of (61). Since \(q_-(x)\) is a univariate convex function, its optimal solution over \(B_-\) is the projection of z onto \(B_-\), i.e., \(x^*_- = \varPi _{B_-}(x^-_{\omega , \beta })\).
Note that z is the optimal solution of (61) and \(z>0\). We must have \(q'_-(z)= z-\omega -\beta /(2\sqrt{z})=0\). If \(z\le 1\) then \(\sqrt{z} \ge z\), implying \(\sqrt{z}-\omega - \beta /(2\sqrt{z})\ge q'_-(z)=0\), which is equivalent to \({z}\ge \gamma _{\omega , \beta }>0\). Thus we must have \({z}\ge \min \{1,\gamma _{\omega , \beta }\}\) and it holds \( x_-^* =\varPi _{B_-}(z) = \min \{\delta ^2,\max \{a,\; z\}\} \ge \min \{\delta ^2,\; 1, \; \gamma _{\omega , \beta } \}, \) which is the claimed lower bound for \(x^*_-\). \(\square \)
1.4 Proof of Proposition 3
Proof
The first and the second derivatives of \(q_+(x)\) are
It is easy to verify that for \(x \ge \delta ^2\) and \(\beta < 4 \delta ^3\)
which implies that q(x) is strictly convex on \([\delta ^2, \infty )\).
We consider two cases. Case 1: \(\tau \ge 0\). This implies \(\omega \le 3 u^{2/3}\). It follows that for \(x>0\)
This implies that \(q_+(x)\) is non-decreasing and hence \(x_+^* = \delta ^2\).
Case 2: \(\tau <0\), which implies \(\omega > 3 u^{2/3}\). Consider the problem:
We will apply Lemma 3 to the problem (62) and show that exactly one of its two positive stationary points falls within the interval \([\delta ^2, \infty )\). We will further show that this stationary point is defined by (46) for the case \(\tau <0\). Since \(q_+(x)\) is convex over this interval, the optimal solution of the problem (45) is just the projection of this stationary point onto the interval \(B_+ =[\delta ^2, b]\). This would complete the proof.
Comparing the problem (62) with the problem (32), the corresponding quantities are
It is obvious that \(t = w > 3 u^{2/3} = 3\bar{x}\) (the condition of Lemma 3 is satisfied). Lemma 3 implies that the problem (62) has two positive stationary points, which must satisfy the optimality condition \(q'_+(\hat{x}) = 0\), leading to
Let \(\hat{y} := \sqrt{\hat{x}}\), we then have
This is the well-known depressed cubic equation, whose solution (i.e., Cardan formula) has a long history [7, Chp. 7].
Since \(\omega > 3 u^{2/3}\), it follows from the Cardan formula (in terms of the trigonometric functions, see [51, Sect. 3]) that (63) has three real roots, namely
with \(\cos (\theta ) = -uv^{-3/2}\). Moreover, the three roots satisfy that \(\hat{y}_1\ge \hat{y}_2 \ge \hat{y}_3\). According to Lemma 3, two of them are positive. That is, \(\hat{y}_1 >0\), \(\hat{y}_2 >0\) and
Since \(\beta < 4 \delta ^3\), we have
Therefore, \( \hat{y}_1^2\) is the only point that falls within the interval \([\delta ^2, \infty ]\). Since \(q_+(x)\) is strictly convex, the minimum of the problem (45) must be the projection of \(\hat{y}_1^2\) onto the interval \(B_+\). Hence, for the Case 2, we must have \( x^*_+ = \varPi _{B_+} ( \hat{y}_1^2). \) The proof is completed by noting that \(\hat{y}_1^2\) is just \(x^+_{\omega , \beta }\) defined in (46) for the case \(\tau <0\). \(\square \)
1.5 Proof of Theorem 1
Proof
(i) is a direct consequence of Propositions 2 and 3. We now prove (ii). For any \(x,y>0\) and any \(\xi _x \in \partial q(x)\), it follows that
and
where the first inequality above used Lemma 1 and the last inequality used the fact \(4\delta ^3>\beta >0\). Swapping the role of x and y above yields
Therefore, we have
This together with Theorem 12.17 of [42] proves that q(x) is strictly convex over [a, b]. The rest in (ii) is just the first order optimality condition of the convex optimization problem (41) because we just proved the convexity of q(x) over [a, b]. Finally, we prove (iii). It follows from (42) that
and from Propositions 2 and 3 that
Therefore,
We finish our proof. \(\square \)
1.6 Proof of Proposition 4
Proof
(i) Let us pick a pair (i, j) such that \(W_{ij} > 0\), which implies \(\delta _{ij}>0\) (Assumption 2). It follows from (50) that
where \(Z_K^{k-1} := - \varPi _{\mathcal{K}^n_+(r)} (-D^{k-1}) \). Since \(\mathcal{B}\) is bounded (Assumption 1) and \(D^k \in \mathcal{B}\), the sequence \(\{D^k\}\) is bounded. Lemma 2 implies
which further implies \(| (Z_K^{k-1})_{ij} | \le 2\Vert U\Vert \) for all \(k=1,\ldots .\) Let \(\beta _{ij} := W_{ij}/\rho \). Then
owing to \(\rho > \rho _o (W,\varDelta )\). It follows from Theorem 1(iii) that there exists \(\kappa _{ij}>0\) such that \(D^k_{ij} \ge \kappa _{ij}\) for all \(k=1,2,\ldots , .\) The choice of \(c_1\) by
satisfies the bound in (i).
(ii) We write f(D) in terms of \(D_{ij}\):
We let \(\partial _{ij} f(D)\) denote the subdifferential of f with respect to its (i, j)th element \(D_{ij}\). We consider two cases. Case 1: \(W_{ij}=0\). This implies that f(D) is a constant function (\(\equiv 0\)) of \(D_{ij}\) and hence f(D) is continuously differentiable with respect to \(D_{ij}\). Consequently, \(\partial _{ij} f(D^k) = \left\{ 0 \right\} \).
Case 2: \(W_{ij}>0\), which implies \(\delta _{ij} >0\) (Assumption 2). It follows from (i) that there exists \(c_1>0\) such that \(D^k_{ij} \ge c_1\) for all \(k=1,2,\ldots ,.\) The Eqs. (64) and (25) yield
which implies that for any \(\xi ^k_{ij} \in \partial _{ij} f(D^k)\) there exists \(\zeta ^k_{ij} \in {\mathrm {sign}}((D^k_{ij})^{1/2}-\delta _{ij})\) such that
In other words, \(\partial _{ij} f(D^k)\) is bounded by \(W_{ij}/\sqrt{c_1}\), which is independent of the index k. It follows directly from the definition of subdifferential [42, Chp. 8.3] that
in the sense that for any \(\varGamma ^k \in \partial f(D^k)\), there exist \(\xi ^k_{ij} \in \partial _{ij} f(D^k)\) such that
Consequently, we have for all \(k=1,2,\ldots ,\)
This completes the proof for (ii).
(iii) Since \(\rho > \rho _o\), for each pair (i, j) we have \(\beta _{ij} := W_{ij}/\rho < 4 \delta _{ij}^3\). It then follows from Theorem 1(ii) that each separable function \(f^k_{ij}(D_{ij})\) is convex and hence the function \(f^k_\rho (D)\) is convex over \(D \in \mathcal{B}\). Consequently, subproblem (48) is convex. The first-order necessary and sufficient optimality condition is just (52). \(\square \)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Zhou, S., Xiu, N. & Qi, HD. Robust Euclidean embedding via EDM optimization. Math. Prog. Comp. 12, 337–387 (2020). https://doi.org/10.1007/s12532-019-00168-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12532-019-00168-0