1 Introduction

This paper aims to propose an efficient numerical method for the most challenging problem in the Multi-Dimensional Scaling (MDS) family, which has found many applications in social and engineering sciences [6, 10]. The problem is known as the Robust Euclidean Embedding, a term borrowed from [8]. In the following, we first describe the problem and its three variants. We then explain our approach and main contribution. We will postpone the relevant literature review to the next section in order to shorten the introduction.

1.1 Problem description

The problem can be described as follows. Suppose we are given some dissimilarity measurements (e.g., noisy distances), collectively denoted as \(\delta _{ij}\), for some pairs (ij) among n items. The problem is to find a set of n points \(\mathbf{x}_i \in \mathfrak {R}^r\), \(i=1,\ldots , n\) such that

$$\begin{aligned} d_{ij} := \Vert \mathbf{x}_i - \mathbf{x}_j \Vert \ \approx \ \delta _{ij}\quad \ (i, j) \in \mathcal{E}, \end{aligned}$$
(1)

where \(\Vert \mathbf{x}\Vert \) is the Euclidean norm (i.e., \(\ell _2\) norm) in \(\mathfrak {R}^r\) and \(\mathcal{E}\) is the set of the pairs (ij), whose dissimilarities \(\delta _{ij} >0\) are known (\(\mathcal{E}\) can be thought of the edge set if we treat \(\delta _{ij}\) as a weighted edge distance between vertex i and vertex j, resulting in a weighted graph.) Throughout, we use “\(:=\)” or “\(=:\)” to mean “define”. The space \(\mathfrak {R}^r\) is called an embedding space and it is most interesting when r is small (e.g., \(r=2, 3\) for data visualization). One may also try to find a set of embedding points such that:

$$\begin{aligned} D_{ij} := \Vert \mathbf{x}_i - \mathbf{x}_j \Vert ^2 \ \approx \ \delta ^2_{ij} \quad \ (i, j) \in \mathcal{E}. \end{aligned}$$
(2)

A great deal of effort has been made to seek the best approximation from (1) or (2). The most robust criterion to quantify the best approximation is the Robust Euclidean Embedding (REE) defined by

$$\begin{aligned} \min _{X} \ f^{(d,1)}(\mathbf{x}_1, \ldots , \mathbf{x}_n) := \sum _{i,j=1}^n W_{ij} |d_{ij} - \delta _{ij} |, \end{aligned}$$
(3)

where \(W_{ij} >0\) if \(\delta _{ij} >0\) and \(W_{ij} \ge 0\) otherwise (\(W_{ij}\) can be treated as a weight for the importance of \(\delta _{ij}\)), and \(X := [\mathbf{x}_1, \ldots , \mathbf{x}_n]\) with each \(\mathbf{x}_i\) being a column vector. In [1, 8], Problem (3) was referred to as a robust variant of MDS and is denoted as rMDS. We will reserve rMDS for the Robust MDS problem:

$$\begin{aligned} \min _{X} \ f^{(D,1)}(\mathbf{x}_1, \ldots , \mathbf{x}_n) := \sum _{i,j=1}^n W_{ij} |D_{ij} - \delta ^2_{ij} |. \end{aligned}$$
(4)

The reference rMDS for the problem (4) is more appropriate because it involved the squared distances \(D_{ij}\), which were used by the classical MDS [22, 29, 43, 49, 53]. The preceding two problems are robust because of the robustness of the \(\ell _1\) norm used to quantify the errors [31, Sect. IV].

When the least squares criterion is used to (1), we have the popular model known as the Kruskal’s stress [30] minimization:

$$\begin{aligned} \min _{X} \ f^{(d,2)}(\mathbf{x}_1, \ldots , \mathbf{x}_n) := \sum _{i,j=1}^n W_{ij} \Big ( d_{ij} - \delta _{ij} \Big )^2, \end{aligned}$$
(5)

Similarly, when the least-squares criterion was applied to (2), we get the so-called squared stress [6]:

$$\begin{aligned} \min _{X} \ f^{(D,2)}(\mathbf{x}_1, \ldots , \mathbf{x}_n) := \sum _{i,j=1}^n W_{ij} \Big ( D_{ij} - \delta ^2_{ij} \Big )^2, \end{aligned}$$
(6)

In many applications such as molecular conformation [21], lower and upper bounds data on the distances can also be collected:

$$\begin{aligned} L_{ij} \le D_{ij} \le U_{ij}, \ \ \forall \ (i,j), \end{aligned}$$
(7)

where \(0 \le L_{ij} \le U_{ij}\). In applications such as nonlinear dimensionality reduction [47] and sensor network localization [44, 54], upper bounds \(U_{ij}\) can be computed by the shortest path distances and \(L_{ij}\) are simply set to be zero.

According to [8, Sect. 5.1], all of those problems are NP-hard. However, some problems are computationally more “difficult” to solve than the others. The most challenging one, which is also the main focus of this paper, is the problem (3) with/without the constraint (7). The difficulty comes from the nonsmooth term of \(\ell _1\) norm and the distance terms \(d_{ij}\) used. All other problems either involve the squared distances \(D_{ij}\) or the squared \(\ell _2\) norm, which make them “easier” to approximate. We will explain the reasons in the literature review part.

In contrast to all other three problems, there lacks efficient methods for the REE problem (3). One of the earliest computational papers that discuss this problem is Heiser [23], which is followed up by [28], where the Huber smoothing function was used to approximate the \(\ell _1\) norm near zero with a majorization technique. It was emphasized in [28] that “the function is not differentiable at its minimum and is hard to majorize, leading to a degeneracy that makes the problem numerically unstable”. Another important method is the PlaceCenter (PC for short) algorithm studied in [1]. We will compare with it in the numerical part. The difficulty in solving (3) is also well illustrated by a sophisticated Semi-definite Programming (SDP) approach in [35, Sect. IV] (see the literature review part). We now describe our approach proposed in this paper.

1.2 Our approach and main contributions

Our approach heavily makes use of the concept of Euclidean Distance Matrix (EDM). We need some notation. Let \(\mathcal{S}^n\) denote the space of all \(n \times n\) symmetric matrices, endowed with the standard inner product. The induced norm is the Frobenius norm, denoted by \(\Vert A\Vert \) for \(A \in \mathcal{S}^n\). The (ij)th element of \(A \in \mathcal{S}^n\) is often written as \(A_{ij}\). Let \(\mathcal{S}^n_+\) be the cone of positive semidefinite matrices in \(\mathcal{S}^n\) and we write \(A \succeq 0\) for \(A \in \mathcal{S}^n_+\). A matrix \(D \in \mathcal{S}^n\) is called an EDM if there exists a set of points \(\mathbf{x}_i \in \mathfrak {R}^r\), \(i=1,2\ldots , n\) such that the (ij)th element of D is given by \(D_{ij} := \Vert \mathbf{x}_i - \mathbf{x}_j \Vert ^2\), \(i, j=1, \ldots , n\). The smallest dimension r is called the embedding dimension of D and \(r={\mathrm {rank}}(JDJ)\), where \(J := I - \frac{1}{n} \mathbf{1}\mathbf{1}^T\) is known as the centring matrix with I being the identity matrix in \(\mathcal{S}^n\) and \(\mathbf{1}\) being the vector of all ones in \(\mathfrak {R}^n\). We use \(\mathcal{D}^n\) to denote the set of all Euclidean distance matrices of size \(n \times n\).

A very useful characterization for \(D \in \mathcal{D}^n\) [22, 49] is

$$\begin{aligned} {\mathrm {diag}}(D) = 0 \qquad \hbox {and} \qquad -(JDJ) \succeq 0. \end{aligned}$$
(8)

This result shows that \(\mathcal{D}^n\) is a closed and convex cone. Moreover, a set of embedding points are generated by the classical MDS method [22, 43, 49, 53]:

$$\begin{aligned} \left[ \mathbf{x}_1, \mathbf{x}_2, \ldots , \mathbf{x}_n \right] = {\mathrm {diag}}(\sqrt{\lambda _1}, \sqrt{\lambda _2}, \ldots , \sqrt{\lambda _r}) \left[ \mathbf{p}_1, \mathbf{p}_2, \ldots , \mathbf{p}_r \right] ^T, \end{aligned}$$
(9)

where the eigenvalues \(\lambda _1 \ge \lambda _2 \ge \cdots \ge \lambda _r >0\) and the corresponding eigenvectors \(\mathbf{p}_1, \mathbf{p}_2, \ldots , \mathbf{p}_r\) are from the eigen-decomoposition:

$$\begin{aligned} - \frac{1}{2} (JDJ) = \left[ \mathbf{p}_1, \mathbf{p}_2, \ldots , \mathbf{p}_r \right] {\mathrm {diag}}( \lambda _1, \lambda _2, \ldots , \lambda _r ) \left[ \mathbf{p}_1, \mathbf{p}_2, \ldots , \mathbf{p}_r \right] ^T \end{aligned}$$
(10)

with \(r = {\mathrm {rank}}(JDJ)\). Therefore, the REE problem (3) with the constraint (7) can be reformulated in terms of EDM as

$$\begin{aligned} \begin{array}{ll} \min _{D} &{} f(D) := \sum _{i,j=1}^n W_{ij} | \sqrt{D_{ij}} - \delta _{ij} | = \Vert W \circ ( \sqrt{D} - \varDelta ) \Vert _1 \\ \hbox {s.t.} &{} D \in \mathcal{D}^n, \ \ {\mathrm {rank}}(JDJ) \le r \\ &{} D \in \mathcal{B}:= \left\{ A \ | \ L \le D \le U \right\} , \end{array} \end{aligned}$$
(11)

where “\(\circ \)” is the Hadamard product for matrices (i.e., \(A \circ B = (A_{ij}B_{ij})\)), \(\sqrt{D}\) is the elementwise square root of D, \(\varDelta _{ij} := \delta _{ij}\), and \(\Vert \cdot \Vert _1\) is the \(\ell _1\) norm. Once we obtained an optimal solution of (11), we use (9) and (10) to generate the required embedding points.

The reformulation well captures four difficulties in solving the REE problem (3).

  1. (i)

    The objective function f(D) is not convex. The term \(| \sqrt{D_{ij}} - \delta _{ij} |\) is convex when \(\delta _{ij}^2 > D_{ij}\) and concave otherwise.

  2. (ii)

    The objective function is nonsmooth. It is not differentiable at certain points due to the \(\ell _1\) norm and the square root operation involved.

  3. (iii)

    The objective function is not Lipschizian. The Lipschitz constant goes to infinity as \(D_{ij}\) goes to zero. The implication is that the subdifferential of the objective function [42, Def. 8.3] may be unbounded. This would create a huge obstacle in establishing any convergence results of iterative algorithms for (11).

  4. (iv)

    The rank constraint is not convex and is hard to approximate. This is a common issue for any optimization problem with a rank constraint.

We note that no matter what reformulations one may use for (3), those four difficulties would appear in different forms and won’t go away. We also note that all other three problems, when reformulated in terms of EDM, have a convex objective function. This distinctive feature alone makes the problem (11) the most challenging one to solve.

Existing numerical experiments have shown that the MDS embedding (9) and (10) works well as long as D is close to a true EDM. A typical example is when the data sits on a lower-dimensional manifold [47]. Motivated by this, we are going to generate an approximate EDM instead of a true EDM in our algorithm. It follows from (8) that (also see [32, Theorem A]):

$$\begin{aligned} D \in \mathcal{D}^n \Longleftrightarrow {\mathrm {diag}}(D) = 0 \quad \hbox {and} \quad -D \in \mathcal{K}^n_+, \end{aligned}$$
(12)

where \(\mathcal{K}^n_+\) is known to be the conditionally positive semidefinite cone:

$$\begin{aligned} \mathcal{K}^n_+ := \left\{ A \in \mathcal{S}^n \ | \ \mathbf{v}^T A \mathbf{v}\ge 0, \ \forall \ \mathbf{v}\in \mathbf{1}^\perp \right\} \end{aligned}$$

and \(\mathbf{1}^\perp \) is the subspace in \(\mathfrak {R}^n\) orthogonal to \(\mathbf{1}\). The diagonal constraint in (12) can be integrated to the set \(\mathcal{B}\) with the choice \(L_{ii} = U_{ii} = 0\) for \(i=1,\ldots , n\). We combine \(\mathcal{K}^n_+\) with the rank constraint into the set \(\mathcal{K}^n_+(r)\):

$$\begin{aligned} \mathcal{K}^n_+(r) := \mathcal{K}^n_+ \cap \left\{ A \in \mathcal{S}^n \ | \ {\mathrm {rank}}(JAJ) \le r \right\} . \end{aligned}$$

We call it the conditionally positive semidefinite cone with the rank-r cut. Consequently, the constraints in (11) become \(-D \in \mathcal{K}^n_+(r)\) and \(D \in \mathcal{B}\).

Next, we quantify the feasibility of \(-D\) belonging to \(\mathcal{K}^n_+(r)\) as follows. Let \(\varPi _{{\mathcal{K}^n_+(r)}}^B(A)\) be the set of all nearest points in \({\mathcal{K}^n_+(r)}\) from a give matrix \(A \in \mathcal{S}^n\). That is

$$\begin{aligned} \varPi _{\mathcal{K}^n_+(r)}^B(A) : = {{\,\mathrm{argmin}\,}}\left\{ \Vert A - Y \Vert \ | \ Y \in {\mathcal{K}^n_+(r)}\right\} . \end{aligned}$$
(13)

Since \({\mathcal{K}^n_+(r)}\) is not convex (unless \(r \ge n-1\)), the projection \(\varPi _{\mathcal{K}^n_+(r)}^B(A)\) is a set instead of a single point. We let \(\varPi _{\mathcal{K}^n_+(r)}(A)\) be any element in \(\varPi _{\mathcal{K}^n_+(r)}^B(A)\) and define the function

$$\begin{aligned} g(A) := \frac{1}{2} \Vert A + \varPi _{\mathcal{K}^n_+(r)}(-A) \Vert ^2. \end{aligned}$$
(14)

Since g(A) is just the half of the squared distance from \((-A)\) to \({\mathcal{K}^n_+(r)}\), it does not depend on which element \(\varPi _{\mathcal{K}^n_+(r)}(A)\) is being used. It is easy to see that

$$\begin{aligned} -D \in {\mathcal{K}^n_+(r)}\qquad \hbox {if and only if} \qquad g(D) = 0. \end{aligned}$$

Hence, the problem (11) is equivalent to

$$\begin{aligned} \begin{array}{ll} \min _D &{} f(D) = \Vert W \circ ( \sqrt{D} - \varDelta ) \Vert _1 \\ \hbox {s.t.} &{} g(D) = 0, \ \ D \in \mathcal{B}. \end{array} \end{aligned}$$
(15)

This is a classical constrained optimization problem with an equality constraint and a simple box constraint. Therefore, the quadratic penalty method [34, Chp. 17] can be applied to get the following problem:

$$\begin{aligned} \begin{array}{ll} \min _D&f_\rho (D) := f(D) + \rho g(D), \qquad \hbox {s.t.} \ \ D \in \mathcal{B}, \end{array} \end{aligned}$$
(16)

where \(\rho >0\) is the penalty parameter. We refer to this problem as the penalized REE problem (PREE).

The quadratic penalty method is used often in practice [34, P. 497]. In fact, it is particularly suitable to (11) because it overcomes all four difficulties discussed above. We will need two more important tools to help us efficiently solve the penalty problem (16). One is the majorization technique that has recently become very popular in engineering sciences [46] (also see [6, Chp. 8] for its extensive use in MDS). Suppose we have the current iterate \(D^k\). We construct a majorization function \(g_m(D, D^k)\) for g(D) at \(D^k\) such that

$$\begin{aligned} g_m(D^k, D^k) = g(D^k) \qquad \hbox {and} \qquad g_m(D, D^k) \ge g(D) \quad \forall \ D \in \mathcal{S}^n. \end{aligned}$$
(17)

The majorization is constructed in such a way that it is easier to solve the majorized problem:

$$\begin{aligned} D^{k+1} = {{\,\mathrm{argmin}\,}}\left\{ f_\rho ^k (D) := f(D) + \rho g_m(D, D^k), \ \ D \in \mathcal{B}. \right\} \end{aligned}$$
(18)

It can be seen that

$$\begin{aligned} f_\rho (D^{k+1})&= f(D^{k+1}) + \rho g(D^{k+1}) \\&{\mathop {\le }\limits ^{\text {(17)}}} f(D^{k+1}) + \rho g_m(D^{k+1}, D^k) = f_\rho ^k (D^{k+1}) \\&{\mathop {\le }\limits ^{\text {(18)}}} f_\rho ^k (D^k) = f(D^k) + \rho g_m(D^k, D^k) = f(D^k) + \rho g(D^k) = f_\rho (D^k). \end{aligned}$$

Hence, the algorithm generates a sequence \(\{D^k\}\) that is nonincreasing in \(f_\rho (D)\). Since \(f_\rho (D)\) is bounded below by 0, the functional sequence \(\{f_\rho (D^k)\}\) converges. However, we are more concerned where the iterate sequence \(\{ D^k\}\) converges. The second concern is how the subproblem (18) has to be solved. This brings out the second technique, which is to solve the following one-dimensional problem:

$$\begin{aligned} \min _{x \in \mathfrak {R}} \left\{ q(x) := (1/2) (x-\omega )^2 + \beta |\sqrt{x}-\delta | \ \ | \ \ a \le x \le b \right\} , \end{aligned}$$
(19)

for given \(\delta >0\) and \(0 \le a \le b\). We will show that the solution of this problem will lead to a close-form solution of (18).

Since our method is for the Penalized REE by EDM optimization, we call it PREEEDM. The major contribution of this paper is to make the outlined solution procedure water-tight. In particular, we will investigate the relationship between the PREE problem (16) and the original problem (11) in terms of the \(\epsilon \)-optimality (Proposition 1). We will also show that the majorization function \(g_m(\cdot , \cdot )\) can be economically constructed (Sect. 3.2). Moreover, the majorized function \(f_\rho ^k(D)\) is guaranteed to be convex provided that the penalty parameter is above certain threshold and the subdifferentials at the generated sequences are bounded (Proposition 4). Furthermore, each majorization subproblem has a closed form solution (Theorem 1). We are also able to prove that any accumulation of the generated sequence by PREEEDM is a stationary point of (16), which is also an approximate stationary point of (11) (Theorem 2). Built upon its solid convergence results and simple implementation, PREEEDM is demonstrated to be comparable to six state-of-the-art software packages in terms of solution quality and outperform them in terms of the computational time for a large number of tested problems from sensor network localizations and molecular conformations.

1.3 Organization of the paper

In the next section, we give a selective literature review mainly on the Semi-Definite Programming (SDP) and EDM approaches. In Sect. 3. we introduce some necessary background and prove a key technical result (Lemma 1) that is crucial to the convexity of the majorization subproblem. We study the relationship between the penalized REE (16) and the original REE in Sect. 4, where the majorized subproblem is shown to have a closed-form solution. In Sect. 5, we provide a complete set of convergence results for the proposed PREEEDM algorithm. Numerical experiments are included in Sect. 6. The paper concludes in Sect. 7. All proofs except that of Theorem 2 can be found in “Appendix”.

2 Literature review

One can find a thorough review on all of the four problems in [17] by France and Carroll, mainly from the perspective of applications. One can also find valuable discussion on some of those problems in [2]. So the starting point of our review is that those problems have their own reasons to be studied and we are more concerned how they can be efficiently solved.

Most of existing algorithms can be put in three groups. The first group consists of alternating coordinates descent methods, whose main variables are \(\mathbf{x}_i\), \(i=1,\ldots ,n\). A famous representative in this group is the method of SMACOF for the stress minimization (5) [13, 14]. The key idea is to alternatively minimize the function \(f^{(d, 2)}\) with respect to each \(\mathbf{x}_i\), while keeping other points \(\mathbf{x}_j\) (\(j \not =i\)) unchanged, and each minimization problem is relatively easier to solve by employing the technique of majorization. SMACOF has been widely used and the interested reader can refer to [6] for more references and to [54] for some critical comments on SMACOF when it is applied to the sensor network localization problem. The second and third group consist respectively the methods of SDP and EDM optimization. We will give a more detailed review on the two groups because of their close relevance to our proposed method in this paper. The main purpose of our review is to show that there lacks efficient numerical methods for the REE problem (3).

2.1 On SDP approach

We note that each of the four objective functions either involves the Euclidean distance \(d_{ij}\) or its squared \(D_{ij} = d_{ij}^2\). A crucial observation is that constraints on them often have SDP relaxations. For example, it is easy to see

$$\begin{aligned} D_{ij} = d_{ij}^2 = \Vert \mathbf{x}_i - \mathbf{x}_j \Vert ^2= & {} \Vert \mathbf{x}_i \Vert ^2 + \Vert \mathbf{x}_j \Vert ^2 - 2 \mathbf{x}_i^T \mathbf{x}_j \nonumber \\= & {} Y_{ii} + Y_{jj} - 2 Y_{ij}, \end{aligned}$$
(20)

where \(Y := X^TX \succeq 0\). Hence, the squared distance \(d^2_{ij}\) is a linear function of the positive semidefinite matrix Y. Consequently, the EDM cone \(\mathcal{D}^n\) can be described through linear transformations of positive semidefinite matrices. One can further relax the constraint \(Y=X^TX\) to \(Y \succeq X^TX\). By the Schur-complement, one has

$$\begin{aligned} Z := \left[ \begin{array}{ll} Y &{} X^T \\ X &{} I_r \end{array} \right] \succeq 0 \ \ \hbox {has rank} \ r \Longleftrightarrow Y = X^TX. \end{aligned}$$
(21)

By dropping the rank constraint, the robust MDS problem (4) can be relaxed to a SDP, which was initiated by Biswas and Ye [15].

For the Euclidean distance \(d_{ij}\), we introduce a new variable \(T_{ij}=d_{ij}\). One may relax this constraint to \(T_{ij} \le d_{ij}\), which has a SDP representation:

$$\begin{aligned} T_{ij}^2 \le d_{ij}^2 = D_{ij} \quad \Longleftrightarrow \quad \left[ \begin{array}{ll} 1 &{} \ T_{ij} \\ T_{ij} &{}\ D_{ij} \end{array} \right] \succeq 0. \end{aligned}$$
(22)

Combination of (20), (21) and (22) leads to a large number of SDP relaxations. Typical examples, for the robust MDS problem (4), are the SDP relaxation method [5] and the edge-based SDP relaxation method [38, 50] and [27], which leads to a comprehensive Matlab package SFSDP. For the squared stress (6), one may refer to [16, 25]. For the stress problem (5), a typical SDP relaxation can be found in [35, Problem (8)]. However, unlike the problems (4), (5) and (6), the REE problem (3) does not have a straightforward SDP relaxation. We use an attempt made in [35] to illustrate this point below.

First, it is noted that problem (3) can be written in terms of EDM:

$$\begin{aligned} \begin{array}{ll} \min &{} \sum \nolimits _{i,j=1}^n W_{ij} | \sqrt{D_{ij}} - \delta _{ij} | \\ \hbox {s.t.} &{} D \in \mathcal{D}^n, \ \ {\mathrm {rank}}(JDJ) \le r. \end{array} \end{aligned}$$

The term \(| \sqrt{D_{ij}} - \delta _{ij} |\) is convex if \(\delta _{ij} > \sqrt{D_{ij}}\) and is concave otherwise. A major obstacle is how to efficiently deal with the concavity in the objective.

Secondly, by dropping the rank constraint and through certain linear approximation to the concave term, a SDP problem is proposed for (3) (see [35, Eq. (20)]):

$$\begin{aligned} \begin{array}{ll} \min _{D, T \in \mathcal{S}^n} &{} \langle W, \; T \rangle \\ \hbox {s.t.} &{} (\delta _{ij} - T_{ij})^2 \le D_{ij}, \qquad (i,j) \in \mathcal{E}\\ &{} a_{ij} D_{ij} + b_{ij} \le T_{ij}, \qquad (i,j) \in \mathcal{E}\\ &{} D \in \mathcal{D}^n, \end{array} \end{aligned}$$
(23)

where the quantities \(a_{ij}\) and \(b_{ij}\) can be computed from \(\delta _{ij}\). We note that each quadratic constraint in (23) is equivalent to a positive semidefinite constraint on \(\mathcal{S}^2_+\) and \(D \in \mathcal{D}^n\) is a semidefinite constraint on \(\mathcal{S}^n_+\) by (8). Therefore, the total number of the semidefinite constraints is \(|\mathcal{E}| + 1\). Finally, the optimal solution of (23) is then refined through a second-stage algorithm (see [35, Sect. IV(B)]). Both stages of the algorithmic scheme above would need sophisticated implementation skills and its numerical efficiency and solution quality are yet to be confirmed due to the twice relaxation schemes (dropping the rank constraint and linearization). The lack of efficient algorithms for (3) motivated our research in this paper.

2.2 On EDM approach

A distinguishing feature from the SDP approach is that this approach treats EDM D as the main variable, without having to rely on its SDP representation. This approach works because of the characterization (12) and that the orthogonal projection onto \(\mathcal{K}^n_+\) has a closed-form formula [19, 20]. Several methods are based on this formula. The basic model for this approach is the so-called nearest EDM problem:

$$\begin{aligned} \min _{D \in \mathcal{S}^n} \ \Vert D - \varDelta ^{(2)} \Vert ^2 \qquad \hbox {s.t.} \ \ {\mathrm {diag}}(D) = 0 \ \ \hbox {and} \ \ -D \in \mathcal{K}^n_+, \end{aligned}$$
(24)

which is a convex relaxation of (6) with the special choice \(W_{ij} \equiv 1\). Here the elements of the matrix \(\varDelta ^{(2)}\) are given by \(\varDelta ^{(2)}_{ij} := \delta _{ij}^2\). The relaxation is obtained by dropping the rank constraint \({\mathrm {rank}}(JDJ) \le r\). Since the constraints of (24) are the intersection of a subspace and a convex cone, the method of alternation projection was proposed in [19, 20] with applications to molecule conformation [21]. Newton’s method for (24) was developed in [39]. Extensions of Newton’s method for the model (24) with more constraints including general weights \(W_{ij}\), the rank constraint \({\mathrm {rank}}(JDJ) \le r\) or the box constraints (7) can be found in [3, 11, 40]. A recent application of the model (24) with a regularization term to Statistics is [55], where the problem is solved by an SDP, similar to that proposed by Toh [48].

There are two common features in this class of methods. One is that they require the objective function to be convex, which is true for the problems (4), (5) and (6) when formulated in EDM. The second feature is that the nonconvexity is only caused by the rank constraint. However, as already seen in Sect. 1.2, the REE problem (3) in terms of EDM has a nonconvex objective coupled with the distance \(d_{ij}\) (not squared distances) being used. This has caused various difficulties in applying the existing EDM-based methods mentioned above to (3). A latest research [56] by the authors has tried to extend the EDM approach to the stress minimization problem (5) along a similar line as outlined in Sect. 1.2. Once again, we emphasize that the key difference between the problem (3) and (5) is about nonconvex objective vs convex objective and non-differentiability vs differentiability. Hence, the problem (3) is significantly more difficult to solve than (5). Nevertheless, we will show that it can be efficiently solved by the proposed EDM optimization.

3 Background and technical lemmas

In this part, we introduce the necessary background about subgradient and positive roots of a special depressed cubic equation. In particular, we will prove a technical result about a composite function between the absolute value and the square root functions. This result (Lemma 1) is in the style of Taylor-expansion for differentiable functions.

3.1 Subgradients of functions

An important function appearing in our EDM reformulation (11) of the REE problem (3) is \(\phi _{\delta }(\cdot ): \mathfrak {R}_+ \mapsto \mathfrak {R}_+\) defined for a given constant \(\delta >0\) by

$$\begin{aligned} \phi _{\delta }(x) := | \sqrt{x} - \delta |, \qquad \forall \ x \ge 0, \end{aligned}$$

where \(\mathfrak {R}_+\) is the set of all nonnegative numbers. We will need to compute its subgradient in the sense of Rockafellar and Wets [42].

Definition 1

[42, Def. 8.3] Consider a function \(f: \mathfrak {R}^n \mapsto \mathfrak {R}\cup \{-\infty , + \infty \}\) and a point \(\bar{\mathbf{x}}\) with \(f(\bar{\mathbf{x}})\) finite. For a vector \(\mathbf{v}\in \mathfrak {R}^n\), one says that

  1. (a)

    \(\mathbf{v}\) is a regular subgradient of f at \(\bar{\mathbf{x}}\), written \(\mathbf{v}\in \widehat{\partial } {f}(\bar{\mathbf{x}})\), if

    $$\begin{aligned} f(\mathbf{x}) \ge f(\bar{\mathbf{x}}) + \langle \mathbf{v}, \; \mathbf{x}- \bar{\mathbf{x}} \rangle + o( \Vert \bar{\mathbf{x}}- \mathbf{x}\Vert ), \end{aligned}$$

    where the little ‘o’ term is a short-hand notation for the one-sided limit condition:

    $$\begin{aligned} \liminf _{ {\mathop {\mathbf{x}\not = \bar{\mathbf{x}} }\limits ^{ \mathbf{x}\rightarrow \bar{\mathbf{x}} }} } \frac{f(\mathbf{x}) - f(\bar{\mathbf{x}}) - \langle \mathbf{v}, \; \mathbf{x}- \bar{\mathbf{x}} \rangle }{\Vert \mathbf{x}- \bar{\mathbf{x}} \Vert } \ge 0; \end{aligned}$$
  2. (b)

    \(\mathbf{v}\) is a (general) subgradient of f at \(\bar{\mathbf{x}}\), written \(\mathbf{v}\in \partial {f}(\bar{\mathbf{x}})\), if there are sequences \(\mathbf{x}^\nu \rightarrow \bar{\mathbf{x}}\) with \(f(\mathbf{x}^\nu ) \rightarrow f(\bar{\mathbf{x}})\) and \(\mathbf{v}^\nu \in \widehat{\partial } f(\mathbf{x}^\nu )\) with \(\mathbf{v}^\nu \rightarrow \mathbf{v}\).

We call \(\partial f(\bar{\mathbf{x}})\) the subdifferential of f at \(\bar{\mathbf{x}}\). For a given number \(x \in \mathfrak {R}\), we define its sign by

$$\begin{aligned} {\mathrm {sign}}(x) := \left\{ \begin{array}{ll} \{1 \} &{} \ \hbox {if} \ x > 0 \\ { [-1, 1] } &{} \ \hbox {if} \ x = 0 \\ \{-1\} &{} \ \hbox {if} \ x < 0. \end{array} \right. \end{aligned}$$

Apparently, \(\phi _\delta (x)\) is continuous for \(x>0\) and its subdifferential at \(x>0\) is given by directly applying Def. 1 (note \(\delta >0\))

$$\begin{aligned} \partial \phi _{\delta }(x) = \frac{{\mathrm {sign}}(\sqrt{x} - \delta )}{2\sqrt{x}} \qquad \hbox {for} \ x >0. \end{aligned}$$
(25)

We note that the subdifferential of \(\phi _{\delta } (x)\) at \(x=0\) is more complicated to describe. Fortunately, we won’t need it in our analysis. We state our key lemma below.

Lemma 1

Let \(\delta >0\) be given. It holds

$$\begin{aligned} \phi _\delta (y)\ge \phi _\delta (x)+\zeta (y-x) -\frac{(x - y)^2}{8\delta ^3} , \quad \forall \ x>0,\ y>0, \ \zeta \in \partial \phi _{\delta }(x). \end{aligned}$$

Compared with the definition of classical convex function, the inequality above has an extra negative quadratic term, which is not sufficient for \(\phi _{\delta } (x)\) to be convex on \((0,+\infty )\) (actually, \(\phi _{\delta } (x)\) is quasi-convex.) Fortunately, this property allows us to make the function \(\phi _{\delta }+\mu \varphi \) convex provided \(\varphi \) is a strongly convex and \(\mu \) is chosen properly. And this is one of the key ideas developed in this paper.

3.2 Construction of the majorization function

A major building block in our algorithm is the majorization function \(g_m(D, D^k)\) at a given point \(D^k\) for the function g(A) defined in (14). We construct it below.

Suppose \(A \in \mathcal{S}^n\) has the following eigenvalue-eigenvector decomposition:

$$\begin{aligned} A = \lambda _1 \mathbf{p}_1 \mathbf{p}_1^T + \lambda _2 \mathbf{p}_2 \mathbf{p}_2^T + \cdots + \lambda _n \mathbf{p}_n \mathbf{p}_n^T, \end{aligned}$$
(26)

where \(\lambda _1 \ge \lambda _2 \ge \cdots \ge \lambda _n\) are the eigenvalues of A in non-increasing order, and \(\mathbf{p}_i\), \(i=1, \ldots , n\) are the corresponding orthonormal eigenvectors. We define a PCA-style matrix truncated at r:

$$\begin{aligned} {\mathrm {PCA}}_r^+ (A) := \sum _{i=1}^r \max \{0, \lambda _i\} \mathbf{p}_i \mathbf{p}_i^T. \end{aligned}$$
(27)

Recall the definition of \(\varPi ^B_{{\mathcal{K}^n_+(r)}}(A)\) in (13). We let \(\varPi _{{\mathcal{K}^n_+(r)}}(A)\) be an element of \(\varPi ^B_{{\mathcal{K}^n_+(r)}}(A)\) and note that the function g(A) in (14) does not depend on the choice of \(\varPi _{{\mathcal{K}^n_+(r)}}(A)\). As seen from the known results below, one particular element \(\varPi _{{\mathcal{K}^n_+(r)}}(A)\) can be computed through \({\mathrm {PCA}}_r^+(A)\).

Lemma 2

For a given matrix \(A \in \mathcal{S}^n\) and an integer \(r \le n\). The following results hold.

  1. (i)

    [40, Eq. (22), Proposition 3.3] One particular \(\varPi _{\mathcal{K}^n_+(r)}(A)\) can be computed through

    $$\begin{aligned} \varPi _{\mathcal{K}^n_+(r)}(A) = {\mathrm {PCA}}_r^+(JAJ) + (A -JAJ) \end{aligned}$$
    (28)
  2. (ii)

    [40, Eq. (26), Proposition 3.3] We have

    $$\begin{aligned} \langle \varPi _{\mathcal{K}^n_+(r)}(A), \ A - \varPi _{\mathcal{K}^n_+(r)}(A) \rangle = 0. \end{aligned}$$
    (29)
  3. (iii)

    [40, Proposition 3.4] The function

    $$\begin{aligned} h(A) := \frac{1}{2} \Vert \varPi _{\mathcal{K}^n_+(r)}(A) \Vert ^2 \end{aligned}$$

    is well defined and is convex. Moreover,

    $$\begin{aligned} \varPi _{\mathcal{K}^n_+(r)}(A) \in \partial h(A), \end{aligned}$$

    where \(\partial h(A)\) is the subdifferential of \(h(\cdot )\) at A.

  4. (iv)

    [56, Lemma 2.2] Let g(A) be defined in (14). We have for any \(A \in \mathcal{S}^n\)

    $$\begin{aligned} g(A) = \frac{1}{2} \Vert A\Vert ^2 - h(-A) \ \ \hbox {and} \ \ \Vert \varPi _{\mathcal{K}^n_+(r)} (A) \Vert \le 2 \Vert A\Vert . \end{aligned}$$
    (30)

Since \(h(\cdot )\) is convex and \(\varPi _{\mathcal{K}^n_+(r)}(A) \in \partial h(A)\) (Lemma 2)(ii)), we have

$$\begin{aligned} h(-D) \ge h(-Z) + \langle \varPi _{\mathcal{K}^n_+(r)}(-Z), \; -D + Z \rangle \qquad \forall \ D, Z \in \mathcal{S}^n. \end{aligned}$$

This, with Lemma 2(iii), implies

$$\begin{aligned} g(D)&= (1/2) \Vert D\Vert ^2 - h(-D) \nonumber \\&\le (1/2) \Vert D\Vert ^2 - h(-Z) + \langle \varPi _{\mathcal{K}^n_+(r)}(-Z), \; D - Z \rangle \nonumber \\&= (1/2) \Vert D+\varPi _{\mathcal{K}^n_+(r)}(-Z)\Vert ^2+\langle \varPi _{\mathcal{K}^n_+(r)}(-Z),-Z- \varPi _{\mathcal{K}^n_+(r)}(-Z)\rangle \nonumber \\&\overset{\text {(29)}}{=} (1/2)\Vert D+\varPi _{\mathcal{K}^n_+(r)}(-Z)\Vert ^2 \nonumber \\&=: g_m(D, Z). \end{aligned}$$
(31)

It is straightforward to check that the function \(g_m(\cdot , \cdot )\) in (31) satisfies the majorization properties (17).

3.3 Positive roots of depressed cubic equations

In our algorithm, we will encounter the positive root of a depressed cubic equation [7, Chp. 7], which arises from the optimality condition of the following problem

$$\begin{aligned} \min _{x \ge 0} \ \ s(x) := (x-t)^2 + \nu \sqrt{x}, \end{aligned}$$
(32)

where \(\nu >0\) and \(t \in \mathfrak {R}\) are given. A positive stationary point x must satisfy the optimality condition

$$\begin{aligned} 0 = s'(x) = 2 (x-t) + \frac{\nu }{2 \sqrt{x}}. \end{aligned}$$
(33)

Let \(y := \sqrt{x}\). The optimality condition above becomes

$$\begin{aligned} 4y^3 - 4ty + \nu = 0. \end{aligned}$$

This is in the classical form of the so-called depressed cubic equation [7, Chp. 7]. Its roots (complex or real) and their computational formulae have a long history with fascinating and entertaining stories. A comprehensive revisit of this subject can be found in Xing [51] and a successful application of the depressed cubic equation to the compressed sensing can be found in [36, 52]. The following lemma says that, under certain conditions, the Eq. (33) has two distinctive positive roots and its proof is a specialization of [9, Lemma 2.1(iii)] when \(p =1/2\) therein.

Lemma 3

[9, Lemma 2.1(iii)] Consider the problem (32). Let

$$\begin{aligned} \bar{x} = (\nu /8)^{2/3} \qquad \hbox {and} \qquad \bar{t} = 3 \bar{x}. \end{aligned}$$

When \(t > \bar{t}\), s(x) has two different positive stationary point \(\hat{x}_1\) and \(\hat{x}_2\) satisfying

$$\begin{aligned} s'(x) = 0 \qquad \hbox {and} \qquad \hat{x}_1< \bar{x} < \hat{x}_2. \end{aligned}$$

4 Penalized REE model and its’ majorization subproblem

With the preparation above, we are ready to address our penalized REE problem (16) and its majorization subproblem (18). We first address the relationship between (16) and its original problem (11). We then show how the subproblem (18) is solved.

4.1 \(\epsilon \)-optimal solution and critical point

The classical results on penalty methods in [34] on the differentiable case (i.e., all functions involved are differentiable) are not applicable here. Recently, the penalty approach was studied by Gao in her PhD thesis [18] in the context of semidefinite programming, which motivated our investigation below. The main result is that (16) provides an \(\epsilon \)-optimal solution and \(\epsilon \)-critical point for the original problem when the penalty parameter is above certain threshold.

We first introduce the concept of critical point of (11) [i.e., (15)], which is a first order optimality condition. It is associated with the Lagrange function of (15) defined by

$$\begin{aligned} \mathcal L(D,\beta ):=f(D) + \beta g(D)+I_{\mathcal{B}}(D), \end{aligned}$$
(34)

where \(\beta \in \mathfrak {R}\) is the Lagrangian multiplier and \(I_{\mathcal{B}}(D)\) is the indicator function taking value 0 if \(D\in \mathcal{B}\) and \(+\infty \) otherwise. We say \(\overline{D} \in \mathcal{S}^n\) is a critical point of (15) if it satisfies \(0\in \partial \mathcal L(\bar{D}, \bar{\beta })\) for some \(\bar{\beta } \in \mathfrak {R}\), namely, there exists \(\overline{\varGamma } \in \partial f(\overline{D})\) and \(\overline{C} \in \hbox {Conv}(\varPi _{\mathcal{K}^n_+(r)}^B(- \overline{D}))\) such that

$$\begin{aligned} g(\overline{D} ) = 0 \qquad \hbox {and} \qquad \left\langle \overline{ \varGamma } + {\bar{\beta }} \overline{D} + {\bar{\beta }} \overline{C},\ D-\overline{D} \right\rangle \ge 0,~~\forall ~D\in \mathcal{B}\end{aligned}$$

where we used the following facts: \(\partial g(D)= D + \hbox {Conv}(\varPi _{\mathcal{K}^n_+(r)}^B(- D))\) (see [40, Proposition 3.4] and \(\hbox {Conv}(\varOmega )\) denotes the convex hull of a given set \(\varOmega \)) , and the sub-differential of the indicator function of a convex set \(\varOmega \) is its normal cone, namely, \(\partial I_\varOmega (x)=N_{\varOmega }(x)=\{y:~\langle y, z-x\rangle \le 0,~\forall ~z\in \varOmega \}. \) We will see the choice \(\overline{C} = \varPi _{\mathcal{K}^n_+(r)}(-\overline{D})\) is enough for our analysis and hence it is used in our definition of approximate critical point.

Definition 2

  1. (i)

    (\(\epsilon \)-Optimal solution) Suppose \(D^*\) is an optimal solution of (11). For a given error tolerance \(\epsilon >0\), a point \(\widehat{D}\) is called an \(\epsilon \)-optimal solution of (11) if it satisfies

    $$\begin{aligned} \widehat{D} \in \mathcal{B}, \quad g(\widehat{D}) \le \epsilon \quad \hbox {and} \quad f(\widehat{D}) \le f(D^*). \end{aligned}$$
  2. (ii)

    (\(\epsilon \)-Critical point) A point \(\overline{D}\) is called an \(\epsilon \)-critical point of (11) if there exist \(\overline{\varGamma } \in \partial f(\overline{ D})\) and \(\bar{\beta } \in \mathfrak {R}\) such that

    $$\begin{aligned} g({\overline{D}} )\le \epsilon ,~ \left\langle {\bar{\varGamma }} + {\bar{\beta }} {\overline{D}} + {\bar{\beta }} \varPi _{\mathcal{K}^n_+(r)}(-{\overline{D}}), \; D-{\overline{D}} \right\rangle \ge 0,~~\forall ~D\in \mathcal{B}. \end{aligned}$$
    (35)

Obviously, if \(\epsilon =0\), \(\epsilon \)-optimal solution \(\widehat{D}\) would be an optimal solution of (11). We will show that the optimal solution of (16) is \(\epsilon \)-optimal as well as an \(\epsilon \)-critical point provided that \(\rho \) is large enough. Let \(D^*_\rho \) be an optimal solution of the penalized REE (16) and \(D_r\) be any feasible solution of the original problem (11). If the lower bound matrix \(L \equiv 0\), then we can simply choose \(D_r =0\). Define

$$\begin{aligned} \rho _{\epsilon } := f(D_r)/\epsilon . \end{aligned}$$

We have the following result.

Proposition 1

For any \(\rho \ge \rho _{\epsilon }\), \(D^*_\rho \) must be an \(\epsilon \)-optimal solution as well as an \(\epsilon \)-critical point of (11). That is, there exists \(\varGamma ^* \in \partial f(D^*_\rho )\) such that

$$\begin{aligned} D^*_\rho \in \mathcal{B}, \qquad f(D^*_\rho ) \le f(D^*), \qquad g(D^*_\rho )\le & {} \epsilon \end{aligned}$$
(36)
$$\begin{aligned} \left\langle \varGamma ^*+ \rho D^*_\rho + \rho \varPi _{\mathcal{K}^n_+(r)}(-D^*_\rho ), D-D^*_\rho \right\rangle\ge & {} 0,~~\forall ~D\in \mathcal{B}. \end{aligned}$$
(37)

4.2 Solving the subproblem

Having constructed the majorization function in (31), we now focus on how to solve the majorization subproblem (18), which is equivalent to the solution of the following problem. Given the current iterate \(Z \in \mathcal{B}\), the majorization subproblem aims to compute an improved iterate, denoted by \(Z^+\), by solving

$$\begin{aligned} Z^+= & {} \arg \min _{D \in \mathcal{B}} \; f(D) + \rho g_m(D, Z) \nonumber \\= & {} \arg \min _{D \in \mathcal{B}} \sum _{i,j=1}^n W_{ij} | \sqrt{D_{ij}} - \delta _{ij} | + \frac{\rho }{2} \Vert D + \varPi _{\mathcal{K}^n_+(r)}(-Z) \Vert ^2 \nonumber \\= & {} \arg \min _{D\in \mathcal{B}} \; \sum _{i,j=1}^n W_{ij} | \sqrt{D_{ij}} - \delta _{ij} | + \frac{\rho }{2}\Vert D - Z_K \Vert ^2, \end{aligned}$$
(38)

where the matrix \(Z_K := - \varPi _{\mathcal{K}^n_+(r)}(-Z)\). This subproblem has a perfect separability property that allows it to be computed elementwise:

$$\begin{aligned} Z_{ij}^+= & {} \underset{L_{ij} \le D_{ij}\le U_{ij}}{\arg \min }~\frac{\rho }{2}\left[ D_{ij} - (Z_K)_{ij} \right] ^2 + W_{ij} |\sqrt{D_{ij}}-\delta _{ij}| \nonumber \\= & {} \underset{L_{ij} \le D_{ij}\le U_{ij}}{\arg \min }~\frac{1}{2}\left[ D_{ij} - (Z_K)_{ij} \right] ^2 + \frac{W_{ij}}{\rho } |\sqrt{D_{ij}}-\delta _{ij}| . \end{aligned}$$
(39)

For the ease of our description, we denote the subproblem solution process by

$$\begin{aligned} Z^+ = {\texttt {PREEEDM}}_{\mathcal{B}} (Z_K,\; W/\rho , \; \varDelta ). \end{aligned}$$
(40)

Here, PREEEDM stands for the Penalized REE by EDM optimization. We will show how (40) can be computed.

Let us consider a simplified one-dimensional optimization problem, whose solution will eventually give rise to PREEEDM. Let B denote the interval [ab] in \(\mathfrak {R}\) with \(0 \le a \le b\). For given \(\omega \in \mathfrak {R}, \delta >0\) and \(\beta > 0\), we aim to compute

$$\begin{aligned} {\texttt {dcroot}}_B[\omega , \beta ,\delta ]:=\arg \min _{a \le x \le b}\ q(x) := \frac{1}{2} (x-\omega )^2 + \beta |\sqrt{x}-\delta |. \end{aligned}$$
(41)

The acronym dcroot stands for the root of depressed cubic equation, which will eventually give rise to the solution formula of (41). It suffices to consider the case that matters to us:

$$\begin{aligned} \beta>0, \quad \delta >0 \quad \hbox {and} \quad a \le \delta ^2 \le b. \end{aligned}$$

Before solving the above problem, we define some notation for convenience

$$\begin{aligned} \left\{ \begin{array}{l} \gamma _{\omega , \beta }:=\frac{\left[ \omega +\sqrt{\omega ^2+2\beta }\right] ^2}{4}, \\ u := \frac{\beta }{4}, \ \ v := \frac{\omega }{3} \ \ \hbox {and} \ \ \tau := u^2 - v^3 \\ B_- := [a, \; \delta ^2] \qquad \hbox {and} \qquad B_+ :=[\delta ^2, \; b]. \end{array} \right. \end{aligned}$$
(42)

Obviously, q(x) has a representation of two pieces:

$$\begin{aligned} q(x) = \left\{ \begin{array}{ll} q_-(x) := \frac{1}{2} (x - \omega )^2 - \beta \sqrt{x} + \beta \delta &{} \ \ \hbox {for } x \in B_- \\ q_+(x) := \frac{1}{2} (x - \omega )^2 + \beta \sqrt{x} - \beta \delta &{}\ \ \hbox {for } x \in B_+ \end{array} \right. \end{aligned}$$

It is noted that \(q_-(x)\) is convex, but \(q_+(x)\) may not necessarily so. We will show that both pieces have a closed-form formula for their respective minimum.

Proposition 2

Consider the optimization problem:

$$\begin{aligned} x^*_- := {{\,\mathrm{argmin}\,}}q_-(x), \qquad \hbox {s.t.} \ \ x \in B_-. \end{aligned}$$
(43)

Define

$$\begin{aligned} x^-_{\omega , \beta } =\left\{ \begin{array}{ll} \left[ ( u + \sqrt{\tau })^{1/3} + ( u - \sqrt{\tau })^{1/3} \right] ^2,&{}\tau \ge 0,\\ &{} \\ 4v \cos ^2\left[ \frac{1}{3}{} \text {arccos} (uv^{-3/2})\right] ,&{}\tau < 0. \end{array} \right. \end{aligned}$$
(44)

Then (43) has a unique solution \(x^*_-\) given by

$$\begin{aligned} x^*_- = \varPi _{B_-}(x^-_{\omega , \beta }) := \min \{\delta ^2,\max \{a,\; x^-_{\omega , \beta } \}\} \ \ \ \text {and}\ \ \ \ x^*_- \ge \min \{ \delta ^2, \; 1, \; \gamma _{\omega , \beta } \}. \end{aligned}$$

Now we characterize the optimal solution of \(q_+(x)\) over \(B_+\).

Proposition 3

Assume that \(\beta < 4 \delta ^3\) and consider the optimization problem:

$$\begin{aligned} x^*_+ := {{\,\mathrm{argmin}\,}}q_+(x), \qquad \hbox {s.t.} \ \ x \in B_+. \end{aligned}$$
(45)

Define

$$\begin{aligned} x^+_{\omega , \beta } := \left\{ \begin{array}{ll} \delta ^2 &{} \quad \hbox {if }\ \tau \ge 0 \\ &{} \\ 4 v \cos ^2\left[ \frac{1}{3}\arccos (-uv^{-3/2})\right] &{} \quad \hbox {if }\ \tau < 0, \end{array} \right. \end{aligned}$$
(46)

Then \(q_+(x)\) is strictly convex over the interval \([\delta ^2, \infty )\) and

$$\begin{aligned} x^*_+ = \varPi _{B_+} (x^+_{\omega , \beta }) := \max \{\delta ^2,\min \{ b,\; x^+_{\omega , \beta } \}\} . \end{aligned}$$

Putting together Propositions 2 and 3 gives rise to the optimal solution of (41). The optimal solution is either \(x^*_-\) or \(x^*_+\), whichever gives a lower functional value of q(x). This is the first result of our major theorem below. We note that both Propositions 2 and 3 make use of the convexity of \(q_-(x)\) and \(q_+(x)\) on the respective interval \([a, \delta ^2]\) and \([\delta ^2, b]\). In fact, we can establish a stronger result that when the two pieces join together, the resulting function q(x) is still convex on the whole interval [ab]. This result is very important to our convergence analysis in the next section and is the second result of the theorem below. A key tool for the proof is Lemma 1.

Theorem 1

Let B denote the interval [ab] with \(0 \le a \le \delta ^2 \le b\). We assume \(0< \beta < 4 \delta ^3\). Then, the following hold.

  1. (i)

    The optimal solution of the problem (41) is given by

    $$\begin{aligned} {\texttt {dcroot}} _B[\omega , \beta ,\delta ] = \left\{ \begin{array}{lll} x^*_-,&{}\mathrm{if}&{} q(x^*_-)\le q(x^*_+),\\ x^*_+,&{}\mathrm{if}&{} q(x^*_-)> q(x^*_+).\\ \end{array} \right. \end{aligned}$$
  2. (ii)

    The function q(x) is strictly convex on [ab]. Consequently, there exists \(\xi \in \partial q({\texttt {dcroot}} _B[\omega , \beta ,\delta ] )\) such that

    $$\begin{aligned} \xi (x-{\texttt {dcroot}} _B[\omega , \beta , \delta ] )\ge 0 ~~~~\text {for~any}~ x\in B. \end{aligned}$$
  3. (iii)

    Let \(\gamma _{\omega , \beta }\) be defined in (42), then \({\texttt {dcroot}} _B[\omega , \beta ,\delta ] \ge \min \{\delta ^2, b, 1, \gamma _{\omega , \beta }\}\). We view \({\texttt {dcroot}} _B[\omega , \beta ,\delta ]\) as a function of \(\omega \). Suppose \(C>0\) is an arbitrarily given constant. Then there exists a constant \(\kappa >0\) such that

    $$\begin{aligned} {\texttt {dcroot}} _B[\omega , \beta ,\delta ] > \kappa \qquad \forall \ \omega \ \hbox {such that} \ | \omega | \le C. \end{aligned}$$

Comment: The optimal solution \({{\texttt {dcroot}}}_B[\omega , \beta ,\delta ]\) is unique, since q(x) is strictly convex over [ab]. However, its location could be within the interval \([a, \sigma ^2]\) or \([\sigma ^2, b]\), depending on the magnitudes of the parameters (\(\omega , \beta \) and \(\delta \)) involved. The dependence is illustrated in Fig. 1. We also note that the function q(x) may not be convex if the condition \(\beta < 4 \delta ^3\) is violated. \(\square \)

Fig. 1
figure 1

Illustration of the convexity of \(q(x)=0.5(x-\omega )^2+\beta |\sqrt{x}-\delta |\) over the interval [0, 6] and \(\beta = 4\): global minimum happens on \(q_-(x)\) (left) with \(\omega = 1\), \(\delta =2 \) and global minimum happens on \(q_+(x)\) (right) with \(\omega = 5\), \(\delta = \sqrt{2}\)

It now follows from Theorem 1 that the optimal solution \(Z^+_{ij}\) in (39) can be computed by:

$$\begin{aligned} Z^+_{ij} =\left\{ \begin{array}{lll} {\texttt {dcroot}}_{[L_{ij}, U_{ij}]}[ (Z_{K})_{ij},\ W_{ij}/\rho ,\ \delta _{ij} ], &{}&{} W_{ij}>0 \\ &{}\\ \varPi _{[L_{ij}, U_{ij}]}((Z_{K})_{ij}), &{}&{} W_{ij}=0 \end{array} \right. \end{aligned}$$
(47)

Consequently, \(Z^+ = {\texttt {PREEEDM}}_\mathcal{B}(Z_K, W/\rho , \varDelta )\) in (40) is well defined and its elements can be computed by (47).

5 Algorithm PREEEDM and its convergence

With the preparations above, we are ready to state our algorithm. Let \(D^k \in \mathcal{B}\) be the current iterate. We update it by solving the majorization subproblem of the type (38) with Z replaced by \(D^k\):

$$\begin{aligned} D^{k+1} = \arg \min \left\{ f_\rho ^k(D) := f(D) + \rho g_m(D, D^k) \right\} , \ \hbox {s.t.} \ \ D \in \mathcal{B}, \end{aligned}$$
(48)

which can be computed by

$$\begin{aligned} D^{k+1} = {\texttt {REEEDM}}_\mathcal{B}(- \varPi _{\mathcal{K}^n_+(r)} (-D^k),\; W/\rho ,\; \varDelta ). \end{aligned}$$
(49)

In more detail, we have

$$\begin{aligned} f_\rho ^k(D)= & {} \Vert W \circ (\sqrt{D} - \varDelta ) \Vert _1 + \frac{\rho }{2} \Vert D + \varPi _{\mathcal{K}^n_+(r)} (-D^k) \Vert ^2 \\= & {} \sum _{i,j} \underbrace{ \left[ \frac{\rho }{2} \Big ( D_{ij} - (Z_K^k)_{ij} \Big )^2 + W_{ij} | \sqrt{D_{ij}} - \delta _{ij} | \right] }_{=: f^k_{ij}(D_{ij})} \end{aligned}$$

where \(Z_K^k := - \varPi _{\mathcal{K}^n_+(r)} (-D^k)\), and the elements of \(D^{k+1}\) are computed as follows:

$$\begin{aligned} D^{k+1}_{ij}= & {} {{\,\mathrm{argmin}\,}}_{L_{ij} \le D_{ij} \le U_{ij}} \left\{ \frac{1}{2} \Big [D_{ij} -(Z_K^k)_{ij} \Big ]^2 + \frac{W_{ij}}{\rho } \Big | \sqrt{D_{ij}} - \delta _{ij}\Big | \right\} \nonumber \\= & {} \left\{ \begin{array}{ll} {\texttt {dcroot}}_{[L_{ij}, U_{ij}]}\Big [(Z_K^k)_{ij} ,\; W_{ij}/\rho ,\; \delta _{ij} \Big ], &{}\ \ \hbox {if} \ W_{ij}>0 \\ &{}\\ \varPi _{[L_{ij}, U_{ij}]}\Big [(Z_K^k)_{ij} \Big ], &{} \ \ \hbox {if} \ W_{ij}=0. \end{array} \right. \end{aligned}$$
(50)

Our algorithm PREEEDM is formally stated as follows.

figure a

A major obstacle in analysing the convergence for the penalized EDM model (16) is the non-differentiability of the objective function. We need the following two reasonable assumptions:

Assumption 1

The constrained box \(\mathcal{B}\) is bounded.

Assumption 2

For \(\varDelta \) and U, we require \(W_{ij} =0\) if \(\delta _{ij} =0\) and \(U_{ij}\ge \delta _{ij}^2 \ge L_{ij}\) if \(\delta _{ij} >0\)

Assumption 1 can be easily satisfied (e.g., setting the upper bound to be \(n^2 \max \{\delta _{ij}^2\}\)). Assumption 2 means that if \(\delta _{ij}=0\) (e.g., value missing), the corresponding weight \(W_{ij}\) should be 0. This is a common practice in applications. If \(\delta _{ij}>0\), then we require \(\delta _{ij}^2\) to be between \(L_{ij}\) and \(U_{ij}\). We further define a quantity that bounds our penalty parameter \(\rho \) from below:

$$\begin{aligned} \rho _o := \rho _o (W,\varDelta ):=\max _{(i,j):W_{ij}>0}\frac{W_{ij}}{4\delta ^{3}_{ij}} \end{aligned}$$
(51)

Our first result in this section is about the boundedness of the subdifferential of \(f(\cdot )\) along the generated sequence \(\{ D^k\}\).

Proposition 4

Suppose Assumptions 1 and 2 hold. Let \(\rho > \rho _o \) and \(\{ D^k\}\) be the sequence generated by Algorithm 1. Then the following hold.

  1. (i)

    There exists a constant \(c_1 >0\) such that

    $$\begin{aligned} D^k_{ij} \ge c_1 \qquad \hbox {for all} \ (i,j) \ \hbox {such that}\ W_{ij} >0 \ \hbox {and} \ k=1,2,\ldots . \end{aligned}$$
  2. (ii)

    Let \(\partial f(D)\) denote the subdifferential of \(f(D) = \Vert W \circ (\sqrt{D} - \varDelta ) \Vert _1\). Then there exists a constant \(c_2 >0\) such that

    $$\begin{aligned} \Vert \varGamma \Vert \le c_2 \qquad \forall \ \varGamma \in \partial f(D^k), \ \ k=1,2,\ldots . \end{aligned}$$
  3. (iii)

    The function \(f_\rho ^k(D)\) is convex for all \(k=1, 2, \ldots \). Moreover, there exists \(\varGamma ^{k+1} \in \partial f(D^{k+1})\) such that the first-order optimality condition for (49) is

    $$\begin{aligned} \left\langle \varGamma ^{k+1} + \rho D^{k+1} + \rho \varPi _{\mathcal{K}^n_+(r)} (-D^k), \ D - D^{k+1} \right\rangle \ge 0, \qquad \forall \ D \in \mathcal{B}. \end{aligned}$$
    (52)

Theorem 1(i) ensures that \(D^k_{ij} >0\) for all \(k=1,\ldots ,.\) Hence, we can apply Lemma 1 to each function \(\phi _{\delta _{ij}}(\cdot )\) with \(x = D^{k+1}_{ij}\) and \(y= D^k_{ij}\). This yields for any \(\zeta ^{k+1}_{ij} \in \partial \phi _{\delta _{ij}} (D^{k+1}_{ij})\)

$$\begin{aligned} \phi _{\delta _{ij}}(D^{k+1}_{ij}) - \phi _{\delta _{ij}}(D^{k}_{ij}) \le \zeta ^{k+1}_{ij} (D^{k+1}_{ij} - D^{k}_{ij}) + \frac{1}{2} \frac{(D^{k+1}_{ij} - D^{k}_{ij})^2}{4\delta ^3_{ij}} , \end{aligned}$$

Multiplying \(W_{ij}\) on both sides and adding those inequalities over (ij), we get

$$\begin{aligned} f(D^{k+1})-f(D^k)\le \langle \varGamma ^{k+1}, \; D^{k+1}-D^k \rangle +\frac{ \rho _o }{2} \Vert D^{k+1} - D^k \Vert ^2, \end{aligned}$$
(53)

where \(\varGamma ^{k+1}_{ij} :=W_{ij} \zeta ^{k+1}_{ij}\). We note that the inequality (53) holds for any \(\varGamma ^{k+1} \in \partial f(D^{k+1})\).

Theorem 2

Let \(\rho > \rho _o \) and \(\{ D^k\}\) be the sequence generated by Algorithm 1. Suppose Assumptions 1 and 2 hold.

  1. (i)

    We have

    $$\begin{aligned} f_\rho (D^{k+1}) - f_\rho (D^k) \le -\frac{\rho - \rho _o }{2} \Vert D^{k+1} - D^k \Vert ^2~~\text {for~any}~~k=0,1,\ldots ,. \end{aligned}$$

    Consequently, \(\Vert D^{k+1} - D^k \Vert \rightarrow 0\).

  2. (ii)

    Let \(\widehat{D}\) be an accumulation point of \(\{D^k\}\). Then there is \(\widehat{\varGamma }\in \partial f(\widehat{D})\) such that

    $$\begin{aligned} \langle \widehat{\varGamma } + \rho \widehat{D} + \rho \varPi _{\mathcal{K}^n_+(r)}(-\widehat{D}), \; D - \widehat{D} \rangle \ge 0~~~~\text {for~any}~D \in \mathcal{B}. \end{aligned}$$
    (54)

    That is, \(\widehat{D}\) is a critical point of the problem (16). Moreover, for a given \(\epsilon >0\), if \(D^0 \in \mathcal{K}^n_+(r)\cap \mathcal{B}\) and

    $$\begin{aligned} \rho \ge \rho _\epsilon := \max \{\rho _o, f(D^0)/\epsilon \}, \end{aligned}$$

    then \(\widehat{D}\) is an \(\epsilon \)-critical point of the original problem (11).

  3. (iii)

    If \(\widehat{D}\) is an isolated accumulation point of the sequence \(\{D^k\}\), then the whole sequence \(\{D^k\}\) converges to \(\widehat{D}\).

Proof

(i) We are going to use the following facts that are stated on \(D^{k+1}\) and \(D^k\). The first fact is the identity:

$$\begin{aligned} \Vert D^{k+1} \Vert ^2 - \Vert D^k\Vert ^2 = 2 \langle D^{k+1} - D^k,\; D^{k+1} \rangle - \Vert D^{k+1} - D^k \Vert ^2. \end{aligned}$$
(55)

The second fact is due to the convexity of h(D) (see Lemma 2(ii)):

$$\begin{aligned} h(-D^{k+1}) - h(-D^k) \ge \langle \varPi _{\mathcal{K}^n_+(r)} (-D^k),\; -D^{k+1} +D^k \rangle . \end{aligned}$$
(56)

The last fact is that there exists \(\varGamma ^{k+1}\in \partial f(D^{k+1})\) such that (52). Those facts yield the following chain of inequalities:

$$\begin{aligned}&f_\rho (D^{k+1}) - f_\rho (D^k) \\&\quad = f(D^{k+1}) - f(D^k) + \rho g(D^{k+1}) - \rho g(D^k) \\&\quad {\mathop {\le }\limits ^{\text {(53)}}} \langle \varGamma ^{k+1}, \; D^{k+1} - D^k \rangle +\frac{ \rho _o }{2} \Vert D^{k+1} - D^k \Vert ^2 +\rho g(D^{k+1}) - \rho g(D^k) \\&\quad {\mathop {=}\limits ^{\text {(30)}}} \langle \varGamma ^{k+1}, \; D^{k+1} - D^k \rangle +\frac{ \rho _o }{2} \Vert D^{k+1} - D^k \Vert ^2\\&\qquad + (\rho /2) ( \Vert D^{k+1} \Vert ^2 - \Vert D^k\Vert ^2 ) - \rho [ h(-D^{k+1}) - h(-D^k) ] \\&\quad {\mathop {=}\limits ^{\text {(55)}}} \langle \varGamma ^{k+1} +\rho D^{k+1}, \; D^{k+1} - D^k \rangle \\&\qquad -\frac{\rho - \rho _o }{2} \Vert D^{k+1} - D^k \Vert ^2 - \rho [ h(-D^{k+1}) - h(-D^k) ] \\&\quad {\mathop {\le }\limits ^{\text {(56)}}} \langle \varGamma ^{k+1} +\rho D^{k+1} + \rho \varPi _{\mathcal{K}^n_+(r)}(-D^k), \; D^{k+1} - D^k \rangle -\frac{\rho - \rho _o }{2} \Vert D^{k+1} - D^k \Vert ^2 \\&\quad {\mathop {\le }\limits ^{\text {(52)}}} -\frac{\rho - \rho _o }{2} \Vert D^{k+1} - D^k \Vert ^2. \end{aligned}$$

This proves that the sequence \(\{ F_\rho (D^k)\}\) is non-increasing and it is also bounded below by 0. Taking the limits on both sides yields \(\Vert D^{k+1} - D^k \Vert \rightarrow 0\).

(ii) Suppose \(\widehat{D}\) is the limit of a subsequence \(\{ D^{k_\ell }\}\), \(\ell =1, \ldots ,\). Since we have established in (i) that \((D^{k_{\ell }+1} - D^{k_\ell }) \rightarrow 0\), the sequence \(\{ D^{k_\ell +1}\}\) also converges to \(\widehat{D}\). Furthermore, there exist a sequence of \(\varGamma ^{k_\ell +1} \in \partial f(D^{k_\ell +1})\) such that (52) holds. Proposition 4(ii) ensures that there exists a constant \(c_2>0\) such that \(\Vert \varGamma ^{k_\ell +1} \Vert \le c_2\) for all \(k_\ell \). Hence, there exists a subsequence of \(\{k_\ell \}\) (we still denote the subsequence by \(\{k_\ell \}\) for simplicity) such that \(\varGamma ^{k_\ell +1}\) converges to some \(\widehat{\varGamma } \in \partial f(\widehat{D})\). Now taking the limits on both sides of (52) on \(\{ k_\ell \}\), we reach the desired inequality (54). We now prove \(\widehat{D}\) is an \(\epsilon \)-critical point of (11). Since we already have (54), we only need to show \(g(\widehat{D}) \le \epsilon \). It follows from \(D^0 \in {\mathcal{K}^n_+(r)}\cap \mathcal{B}\) that

$$\begin{aligned} f(D^0)&= f(D^0) + \rho g(D^0) \qquad (\hbox {because} \ g(D^0) = 0)\\&{\mathop {\ge }\limits ^{\text {(18)}}} f(D^1) + \rho g_m(D^1, D^0) \qquad (\hbox {because} \ D^0\in \mathcal{B})\\&{\mathop {\ge }\limits ^{\text {(31)}}} f(D^1) + \rho g(D^1)= f_\rho (D^1)\\&\ge \cdots \ge f_\rho (D^k)= f(D^k) + \rho g(D^k). \qquad (\hbox {because~of (i)}) \end{aligned}$$

Taking the limit on the right-hand side yields

$$\begin{aligned} f(D^0) \ge f(\widehat{D}) + \rho g(\widehat{D}) \ge \rho g(\widehat{D}), \end{aligned}$$

where we used \(f(\widehat{D}) \ge 0\). Therefore, thanks to \(\rho >\rho _\epsilon \), it has

$$\begin{aligned} g(\widehat{D}) \le f(D^0) /\rho \le f(D^0) /\rho _\epsilon = \epsilon . \end{aligned}$$

(iii) We note that we have proved in (i) that \((D^{k+1} - D^k) \rightarrow 0\). The convergence of the whole sequence to \(\widehat{D}\) follows from [26, Proposition 7]. \(\square \)

6 Numerical experiments

In this part, we will conduct extensive numerical experiments of our algorithm PREEEDM by using MATLAB (R2014a) on a desktop of 8GB of memory and Inter(R) Core(TM) i5-4570 3.2Ghz CPU, against 6 leading solvers on the problems of sensor network localizations (SNL) in \(\mathfrak {R}^2\) (\(r=2\)) and Molecular Conformation (MC) in \(\mathfrak {R}^3\) (\(r=3\)). This section is split into the following parts. The MATLAB package is available at DOI: https://doi.org/10.5281/zenodo.3343047. Our implementation of PREEEDM was described in Sect. 6.1. We will give a brief explanation how the six benchmark methods were selected in Sect. 6.2. Descriptions of how the test data of SNL and MC were collected and generated, and extensive numerical comparisons are reported in Sect. 6.3.

6.1 Implementation

The PREEEDM Algorithm 1 is easy to implement. We first address the issue of its stopping criterion that is to be used in Step 3 of Algorithm 1. We monitor two quantities. One is on how close of the current iterate \(D^k\) is to be Euclidean (belonging to \(\mathcal{K}^n_+(r)\)). This can be computed by using (28) as follows.

$$\begin{aligned} \texttt {Kprog}_k := \frac{2 g(D^k)}{\Vert JD^kJ \Vert ^2}= & {} \frac{\Vert {\mathrm {PCA}}^+_r(-JD^kJ) + (JD^kJ) \Vert ^2}{\Vert JD^kJ \Vert ^2}\\= & {} 1 - \frac{\sum _{i=1}^r \left[ \lambda _i^2-(\lambda _i - \max \{\lambda _i, 0\} )^2\right] }{\lambda _1^2 + \cdots + \lambda _n^2}\\\le & {} 1, \end{aligned}$$

where \(\lambda _1 \ge \lambda _2 \ge \cdots \ge \lambda _n\) are the eigenvalues of \((-JD^kJ)\). The smaller \(\texttt {Kprog}_k\) is, the closer \(D^k\) is to \(\mathcal{K}^n_+(r)\). The benefit of using \(\texttt {Kprog}\) over g(D) is that the former is independent of any scaling of D.

The other quantity is to measure the progress in the functional values \(f_\rho (\cdot )\) by the current iterate \(D^k\). In theory (see Theorem 2), we should require \(\rho >\rho _o\), which is defined as (51) and is potentially large. As with the most penalty methods [34, Chp. 17], starting with a very large penalty parameter may degrade the performance of the method (e.g., causing ill-conditioning). We adopt a dynamic updating rule for \(\rho \). In particular, we choose \(\rho _0 =\frac{\kappa \max \delta _{ij}}{n^{3/2}}\) and update it as

$$\begin{aligned} \rho _{k+1}=\left\{ \begin{array}{rll} 1.25\rho _{k},&{}&{}\text {if}~ \texttt {Kprog}_k> \texttt {Ktol}, \texttt {Fprog}_k \le 0.2\texttt {Ftol}, \\ 0.75\rho _{k},&{}&{}\text {if}~ \texttt {Fprog}_k > \texttt {Ftol}, \texttt {Kprog}_k \le 0.2\texttt {Ktol}, \\ \rho _{k},&{}&{}\text {otherwise}, \end{array}\right. \end{aligned}$$

where

$$\begin{aligned} \texttt {Fprog}_k := \frac{f_{\rho _{k-1}} (D^{k-1}) - f_{\rho _{k-1}} (D^{k})}{1+\rho _{k-1} + f_{\rho _{k-1}} (D^{k-1}) }, \end{aligned}$$
(57)

and \(\texttt {Ftol}=\ln (\kappa )\times 10^{-4}\) and \(\texttt {Ktol}=10^{-2}\) with \(\kappa \) being the number of non-zero elements of \(\varDelta \). We terminate PREEEDM when

$$\begin{aligned} \texttt {Fprog}_k \le \texttt {Ftol} \ \ \ \ \ \hbox {and} \ \ \ \ \ \texttt {Kprog}_k \le \texttt {Ktol}, \end{aligned}$$

Since our computation of each iteration is dominated by \(\varPi _{\mathcal{K}^n_+(r)} (-D)\) in the construction of the majorization function \(g_m(\cdot , \cdot )\) in (31), the computational complexity is about \(O(rn^2)\) (we used MATLAB’s built-in function eigs.m to compute \({\mathrm {PCA}}_r^+(A)\) in (27)). For the problem data input, \(\varDelta \), L and U will be described below. For the initial point, we follow the popular choice used in [44, 47] \(\sqrt{D^0} := \widehat{\varDelta }\), where \(\widehat{\varDelta }\) is the matrix obtained by the shortest path distances among \(\varDelta \). If \(\varDelta \) has no missing values, then \(\widehat{\varDelta } = \varDelta \).

6.2 Benchmark methods

We select six representative state-of-the-art methods for comparison. They are ADMMSNL [37], ARAP [54], EVEDM (short for EepVecEDM) [12], PC [1], PPAS (short for PPA Semismooth) [24] and SFSDP [27]. Those methods have been shown to be capable of returning satisfactory localization/embedding in many applications. We will compare our method PREEEDM with ADMMSNL, ARAP, EVEDM, PC and SFSDP for Sensor Network Localization (SNL, \(r=2\)) problems and with EVEDM, PC, PPAS and SFSDP for Molecular Conformation (MC, \(r=3\)) problems since the current implementations of ADMMSNL, ARAP do not support the embedding for \(r\ge 3\).

We note that ADMMSNL is motivated by [45] and aims to enhance the package diskRelax of [45] for the SNL problems (\(r=2\)). Both methods are based on the stress minimization (5). As we mentioned before, SMACOF [13, 14] has been a very popular method for (5). However, we will not compare it with other methods here since its performance demonstrated in [54, 56] was not very satisfactory (e.g., when comparing with ARAP) for either SNL or MC problems. To our best knowledge, PC is the only viable method, whose code is also publicly available for the model (3). We select SFSDP and PPAS because of their high reputation in the field of SDP and quadratic SDP in returning quality localizations and conformations. We note that SFSDP is for the model (4) and the methods PPAS and EVEDM are proposed for the model (6). It is worth mentioning that the MADMM package in [29] is capable of solving the Robust MDS (4) as well as other nonsmooth optimization problems. However, MADMM does not contain the implementation of its listed example: Robust MDS. So we were not able to compare it with ours here. We also implemented the subgradient method of Cayton and Dasgupta [8] for their robust Euclidean embedding. Numerical experiments showed that its performance was similar to PC on our tested problems. It works well when a large number of the dissimilarities in \(\varDelta \) are available and it often performs poorly otherwise. Hence, we omitted it from our reported results.

In our tests, we used all of their default parameters except one or two in order to achieve the best results. In particular, for PC, we terminate it when \(|f(D^{k-1})-f(D^{k})|<10^{-4}\times f(D^{k})\) and set its initial point to be the embedding by cMDS on \(\varDelta \). For SFSDP which is a high-level MATLAB implementation of the SDP approach initiated in [50], we set pars.SDPsolver\(=\)sedumi” because it returns the best overall performance, and pars.objSW\(=1\) when \(m>r+1\) and \(=3\) when \(m=0\). We also note that the parameter pars.minDegree controls the degree of a graph and thus enhances the strength of the SDP relaxation. Numerical experiments have shown that the larger it is, the more accurate solutions might be generated by SFSDP. However, the computational time shoots up dramatically when it increases even for small n. Our extensive experiments suggest that its default value (\(\mathtt{pars.minDegree} =r+2\)) is a balanced choice between solution quality and time of computation for large n. Hence we choose to use its default setting in our test. For ARAP, in order to speed up the termination, we let \(\texttt {tol} = 0.05\) and \(\texttt {IterNum} =20\) to compute its local neighbour patches. Numerical performance demonstrated that ARAP could yield satisfactory embedding, but would take very long time for some examples with large n.

6.3 Numerical comparison

To assess the embedding quality, we adopt a widely used measure RMSD (Root of the Mean Squared Deviation) defined by

$$\begin{aligned} \texttt {RMSD} :=\left[ \dfrac{1}{n-m}\sum _{i=m+1}^n \Vert \widehat{\mathbf{x}}_i - \mathbf{x}_i\Vert ^2\right] ^{1/2}, \end{aligned}$$

where \(\mathbf{x}_i\)’s are the true positions of the sensors or atoms in our test problems and \(\widehat{\mathbf{x}}_i\)’s are their corresponding estimates. The \(\widehat{\mathbf{x}}_i\)’s were obtained by applying the classical MDS (cMDS) method to the final output of the distance matrix, followed by aligning them to the existing anchors through the well-known Procrustes procedure (see [54, 6, Chp. 20] or [41, Proposition 4.1] for more details). Furthermore, upon obtaining \(\widehat{\mathbf{x}}_i\)’s, a heuristic gradient method can be applied to improve their accuracy and it is called the refinement step in [5]. We report rRMSD to highlight its contribution. As we will see, all tested methods benefit from this step, but with varying degrees.

The quality of the general performance of each method can be better appreciated through visualizing their key indicators: RMSD, rRMSD, rTime (time for the refinement step) and the CPU Time (in s) which is the total time including rTime. Hereafter, for all examples, we test 20 randomly generated instances for each case \((n,m,R,\texttt {nf})\) in SNL or each case \((n,R,\texttt {nf})\) in MC, and record the average results.

6.3.1 Comparison on SNL

SNL has been widely used to test the viability of many existing methods for the stress minimization. In such a problem, we typically have m anchors (e.g., sensors with known locations) and the rest sensors need to be located. We will test two types of SNL problems. One has a regular topological layout (Examples 1 and 2 below). The other has an irregular layout (Example 3).

Example 1

(Square Network with 4 fixed anchors) This example is widely tested since its detailed study in [5]. In the square region \([-0.5, 0.5]^2\), 4 anchors \(\mathbf{x}_1 = \mathbf{a}_1, \ldots , \mathbf{x}_4 = \mathbf{a}_4\) (\(m=4\)) are placed at \((\pm 0.2, \pm 0.2\)). The generation of the rest \((n-m)\) sensors (\(\mathbf{x}_{m+1}, \ldots , \mathbf{x}_n\)) follows the uniform distribution over the square region. The noisy \(\varDelta \) is usually generated as follows.

$$\begin{aligned} \delta _{ij}:= & {} \Vert \mathbf{x}_i - \mathbf{x}_j \Vert \times | 1 + \epsilon _{ij} \times \texttt {nf} |, \ \forall \ (i, j) \in \mathcal {N}:=\mathcal {N}_x \cup \mathcal {N}_a \\ \mathcal {N}_x:= & {} \left\{ (i,j) \ | \ \Vert \mathbf{x}_i - \mathbf{x}_j \Vert \le R, \ i> j> m \right\} \\ \mathcal {N}_a:= & {} \left\{ (i,j) \ | \ \Vert \mathbf{x}_i - \mathbf{a}_j \Vert \le R, \ i > m, \ 1\le j \le m \right\} , \end{aligned}$$

where R is known as the radio range, \(\epsilon _{ij}\)’s are independent standard normal random variables, and nf is the noise factor (e.g., \(\texttt {nf} = 0.1\) was used and it corresponds to \(10\%\) noise level). In literature (e.g., [5]), this type of perturbation in \(\delta _{ij}\) is known to be multiplicative and follows the unit-ball rule in defining \(\mathcal {N}_x\) and \(\mathcal {N}_a\) (see [3, Sect. 3.1] for more detail). The corresponding weight matrix W and the lower and upper bound matrices L and U are given as in the table below. Here, M is a large positive quantity. For example, \(M :=n\max _{ij}\varDelta _{ij}\) is the upper bound of the longest shortest path if the network is viewed as a graph.

(ij)

\(W_{ij}\)

\(\varDelta _{ij}\)

\(L_{ij}\)

\(U_{ij}\)

\(i=j\)

0

0

0

0

\(i,j\le m\)

0

0

\(\Vert \mathbf{a}_i - \mathbf{a}_j\Vert ^2\)

\(\Vert \mathbf{a}_i - \mathbf{a}_j\Vert ^2\)

\((i,j)\in \mathcal {N}\)

1

\(\delta _{ij}\)

0

\(R^2\)

otherwise

0

0

\(R^2\)

\(M^2\)

Example 2

(Square Network with m random anchors) This example also tested in [5] is similar to Example 1 but with randomly generated anchors. The generation of n points follows the uniform distribution over the square region \([-0.5, 0.5]^2\). Then the first m points are chosen to be anchors and the last \((n-m)\) points to be sensors. The rest of the data generation is same as in Example 1.

Example 3

(EDM word network) This problem has a non-regular topology and was first used in [3] to challenge existing methods. In this example, n points are randomly generated in a region whose shape is similar to the letters “E”, “D” and “M”. The ground truth network is depicted in Fig. 2. We choose the first m points to be the anchors. The rest of the data generation is same as in Example 1.

Fig. 2
figure 2

Ground truth EDM network with \(n=500\) nodes

(a) Effect of the radio range R It is easy to see that the radio range R decides the amount of missing dissimilarities among all elements of \(\varDelta \). The smaller R is, the fewer numbers of \(\delta _{ij}\) are available, leading to more challenging problems. Therefore, we first demonstrate the performance of each method to the radio range R. For Example 1, we fix \(n=200,m=4\), nf\(=0.1\), and alter the radio range R among \(\{0.2,0.4,\ldots ,1.4\}\). The average results were demonstrated in Fig. 3. It can be seen that ARAP and PREEEDM were joint winners in terms of both RMSD and rRMSD. However, the time used by ARAP was the longest. When R became bigger than 0.6, ADMMSNL, SFSDP and EVEDM produced similar rRMSD as ARAP and PREEEDM, while the time consumed by ADMMSNL was significantly larger than that by SFSDP, EVEDM and PREEEDM. By contrast, PC only worked well when \(R\ge 1\).

Fig. 3
figure 3

Average results for Example 1 with \(n=200, m=4\), nf\(=0.1\)

Next we test a number of instances with larger size \(n\in \{300, 500,1000,2000\}\). For Example 1, the average results were recorded in Table 1. When \(R=\sqrt{2}\) under which no dissimilarities were missing because Example 1 was generated in a unit region, PC, ARAP and PREEEDM produced the better RMSD ( almost in the order of \(10^{-3}\)). But with the refinement step, all methods led to similar rRMSD. This meant SFSDP and EVEDM benefited a lot from the refinement step. For the computational speed, PREEEDM outperformed others, followed by PC, EVEDM and SFSDP. By contrast, ARAP consumed too much time even for \(n=500\). When \(R=0.2\), the picture was significantly different since there were large amounts of unavailable dissimilarities in \(\varDelta \). Basically, ADMMSNL, PC and SFSDP failed to localize even with the refinement due to undesirable RMSD and rRMSD (both in the order of \(10^{-1}\)). Clearly, ARAP and PREEEDM produced the best RMSD and rRMSD, and EVEDM got comparable rRMSD but inaccurate RMSD. In terms of the computational speed, EVEDM and PREEEDM were very fast, consuming about 30 s to solve problems with \(n=2000\) nodes. By contrast, ARAP still was the slowest, followed by ADMMSNL and PC.

Table 1 Comparison of six methods for Example 1 with \(m=4, \texttt {nf}=0.1\)
Table 2 Comparisons of six methods for Example 3 with \(m=10, \texttt {nf}=0.1\)

Now we test those methods for the irregular network in Example 3. The average results were recorded in Table 2. We note that this example was generated in the region \([0,1]\times [0,0.5]\) as presented in Fig. 2. It implies that no dissimilarities in \(\varDelta \) were missing when \(R=\sqrt{1.25}\) while a large number of dissimilarities in \(\varDelta \) were missing when \(R=0.1\). When \(R=\sqrt{1.25}\), it can be clearly seen that SFSDP and EVEDM failed to localize before the refinement step due to their large RMSD (in the order of \(10^{-1}\)), whilst the rest four methods succeeded. However, they all achieved a similar rRMSD after the refinement except for EVEDM under the case \(n=500\). Still, PREEEDM ran the fastest and ARAP came the last, (5.13s vs. 2556.3s when \(n=500\)). Their performances for the case \(R=0.1\) are quite contrasting. PREEEDM generated the most accurate RMSD and rRMSD (in the order of \(10^{-3}\)) whilst the results of the rest methods were only in the order of \(10^{-2}\). Obviously, ADMMSNL, PC and EVEDM failed to localize. Compared with the other methods, EVEDM and PREEEDM were joint winners in terms of the computational speed, only using 30s when \(n=2000\) (a larger scale network). But we should mention that EVEDM failed to localize.

Fig. 4
figure 4

Average results for Example 2 with \(n=200, R=0.2\), nf\(=0.1\)

(b) Effect of the number of anchorsm As one would expect, more anchors would lead to more information available, and hence lead to easier localization. In this part, we demonstrate the degree of the effect of the varying anchors’ numbers on the 6 methods. For Example 2, we fix \(n=200,R=0.2\), nf\(=0.1\) with choosing m from \(\{5,10,\ldots ,40\}\). As illustrated in Fig. 4, ARAP and PREEEDM were again joint winners in terms of RMSD and rRMSD. And rRMSD produced by the rest methods declined rapidly as more anchors being used. Moreover, PREEEDM was the fastest, followed by EVEDM, PC and SFSDP, whilst ADMMSNL and ARAP were quite slow.

Fig. 5
figure 5

Localization for Example 3 with \(n=500, R=0.1\), nf\(=0.1\)

For Example 3 with fixed \(n=500, R=0.1\), nf\(=0.1\), we test it under \(m\in \{10,30,50\}\). As depicted in Fig. 5, ARAP and PREEEDM were always capable of capturing the shape of letters ‘E’, ‘D’ and ‘M’ that was similar to Fig. 2. By contrast, SFSDP and EVEDM derived desirable outline of three letters only when \(m=50\), and the localization quality of both ADMMSNL and PC improved along with the increasing m but still with a deformed shape of letter ‘M’.

Table 3 Comparisons of six methods for Example 2 with \(R=0.2, \texttt {nf}=0.1\)

Finally we test a number of instances of Example 2 with choosing \(n\in \{300,\) 500,  1000,  \(2000\}\) and \(m \in \{10, 50\}\). The average results were recorded in Table 3. When \(m=10\), ADMMSNL and PC produced undesirable RMSD and rRMSD (both in the order of \(10^{-1}\)). SFSDP benefited greatly from the refinement because it generated relatively inaccurate RMSD. By contrast the rest three methods enjoyed the successful localization except for EVEDM under the case \(n=300\). With regard to the computational speed, EVEDM and PREEEDM were the fastest, followed by SFSDP, PC, ADMMSNL and ARAP. When \(m=50\), more information was known, the results were better than before, especially for the methods ADMMSNL and PC. But PC still heavily relied on the refinement step to get the satisfactory localization. The rest five methods produced a satisfactory localization with varying degree of accuracy. It is encouraging to see that PREEEDM produced the most accurate rRMSD for all cases. The comparison of the computational speed is similar to the case of \(m=10\). We repeated the test for Example 3 and the average results were recorded in Table 4, where we observed a similar performance of the six methods as for Example 2. We omit the details.

Table 4 Comparisons of six methods for Example 3 with \(R=0.1, \texttt {nf}=0.1\)

(c) Effect of the noise factornf To see the dependence of the performance of each method on the noise factor, we first test Example 3 with fixing \(n=200, m=10, R=0.3\) and varying the noise factor \(\texttt {nf}\in \{0.1,0.2,\ldots ,0.7\}\). As shown in Fig. 6, in terms of RMSD  it can be seen that ARAP got the smallest ones, whilst EVEDM and PC obtained the worst ones. The line of ADMMSNL dropped down from \(0.1\le \texttt {nf}\le 0.3\) and then ascended. By contrast the line of PREEEDM reached the peak at \(\texttt {nf}=0.3\) but declined afterwards and gradually approached to RMSD of ARAP. However, after the refinement step, ARAP, SFSDP and PREEEDM all derived a similar rRMSD  while the other three methods produced undesirable ones. Apparently, EVEDM was indeed the fastest (yet with the worst rRMSD), followed by PC, SFSDP and PREEEDM. Again, ARAP and ADMMSNL were quite slow.

Fig. 6
figure 6

Average results for Example 3 with \(n=200, m=10, R=0.3\)

Fig. 7
figure 7

Localization for Example 2 with \(n=200, m=4, R=0.3\)

Next, we test Example 2 with a moderate size (for the visualization purpose in Fig. 7) \(n=200, m=4\) and \(R=0.3\) and varying \(\texttt {nf}\in \{0.1,0.3,0.5\}\). The actual embedding by each method was shown in Fig. 7, where the four anchors were plotted in green square and \(\widehat{\mathbf{x}}_i\) in pink points were jointed to its ground truth location (blue circle). It can be clearly seen that ARAP and PREEEDM were quite robust to the noise factor since their localization matched the ground truth well. EVEDM failed to locate when \(\texttt {nf}=0.5\). By contrast, SFSDP generated worse results when nf got bigger, and ADMMSNL and PC failed to localize for all cases.

Finally, we test Example 1 with larger sizes \(n\in \{300, 500, 1000, 2000\}\) and fixed \(m=4, R=0.3\). The average results were recorded in Table 5. When \(\texttt {nf}=0.1\), ADMMSNL and PC failed to render accurate embedding. Compared with ARAP, EVEDM and PREEEDM, SFSDP generated lager RMSD and rRMSD. Again, EVEDM and PREEEDM ran faster than ARAP. When \(\texttt {nf}=0.7\), the results were different. ARAP and PREEEDM were still able to produce high-quality RMSD and rRMSD. However, the former took extremely long time (16617 vs. 83 s). By contrast, ADMMSNL and PC again failed to reconstruct the network. Furthermore, EVEDM got large RMSD but comparable rRMSD when \(n\le 1000\), but it failed when \(n=2000\).

6.3.2 Comparison on MC

MC has long been an important application of EDM optimization [2, 21, 33]. We will test two types of MCs respectively from an artificial data set and a real data set in Protein Data Bank (PDB) [4]. For the former, we adopt the rule of generating data from [2, 33]. For the latter, we used the real data of 12 molecules derived from 12 structures of proteins from PDB. They are 1GM2, 304D, 1PBM, 2MSJ, 1AU6, 1LFB, 104D, 1PHT, 1POA, 1AX8, 1RGS, 2CLJ. They provide a good set of test problems in terms of the size n, which ranges from a few hundreds to a few thousands (the smallest \(n=166\) for 1GM and the largest \(n=4189\) for 2CLJ). The distance information was obtained in a realistic way as done in [24].

Example 4

(Artificial data) As described in [2, 33], the artificial molecule has \(n = s^3\) atoms \((\mathbf{x}_{1}, \ldots , \mathbf{x}_n)\) located in the three-dimensional lattice

$$\begin{aligned} \{(i_1, i_2, i_3) : i_1, i_2, i_3 =0,1,\ldots , s-1\} \end{aligned}$$

for some integer \(s \ge 1\), i.e., \(\mathbf{x}_{i}=(i_1, i_2, i_3)^T\). We define \(\mathcal {N}_{x}\) for the index set on which \(\delta _{ij}\) are available as:

$$\begin{aligned} \mathcal {N}_{x}:= & {} \{(i,j): |p(\mathbf{x}_{i})-p(\mathbf{x}_{j})|\le R\} \end{aligned}$$
(58)

where \(p(\mathbf{x}_{i}):=1+(1,s,s^2)^T\mathbf{x}_{i}=1+i_1+si_2+s^2i_3\) and R is a given constant (e.g., \(R=s^2\)). The corresponding dissimilarity matrix \(\varDelta \), weight matrix W and the lower and upper bound matrices L and U are given as in the table below. Here the generation of \(\delta _{ij}\) is the same as Example 1.

(ij)

\(W_{ij}\)

\(\varDelta _{ij}\)

\(L_{ij}\)

\(U_{ij}\)

\(i=j\)

0

0

0

0

\((i,j)\in \mathcal {N}_{x}\)

1

\(\delta _{ij}\)

1

\(\max _{(i,j)\in \mathcal {N}_{x}} || \mathbf{x}_{i} -\mathbf{x}_{j} ||^2\)

otherwise

0

0

1

\(3(s-1)^2\)

Example 5

(Real PDB data) Each molecule comprises n atoms \(\{\mathbf{x}_1,\ldots \mathbf{x}_n\}\) in \(\mathfrak {R}^3\) and its distance information is collected as follows. If the Euclidean distance between two of the atoms is less than R, the distance is chosen; otherwise no distance information about this pair is known. For example, \(R= 6{\AA }~(1{\AA } = 10^{-8}\)cm) is nearly the maximal distance that the nuclear magnetic resonance (NMR) experiment can measure between two atoms. For realistic molecular conformation problems, not all the distances below R are known from NMR experiments, so one may obtain \(c\%\) (e.g., \(c=50\%\)) of all the distances below R. Denote \(\mathcal {N}_{x}\) the set formed by indices of those measured distances. Moreover, the distances in \(\mathcal {N}_{x}\) can not be exactly measured. Instead, only lower bounds \(\ell _{ij}\) and upper bounds \(u_{ij}\) are provided, that is for \((i,j)\in \mathcal {N}_{x},\)

$$\begin{aligned} \ell _{ij}=\max \left\{ 1, (1-|\epsilon _{ij}|) \Vert \mathbf{x}_i-\mathbf{x}_j\Vert \right\} ,~~~~ u_{ij}=(1+|\varepsilon _{ij}|) \Vert \mathbf{x}_i-\mathbf{x}_j\Vert . \end{aligned}$$

where \(\epsilon _{ij},\varepsilon _{ij} \sim N(0, \texttt {nf}^2 \times \pi /2)\) are independent normal random variables. In our test, we set the noise factor \(\texttt {nf}=0.1\) and the parameters \(W,\varDelta , L, U \in \mathcal{S}^n\) are given as in the table below, where \(M>0\) is the upper bound (e.g., \(M :=n\max _{ij}\varDelta _{ij}\)).

(ij)

\(W_{ij}\)

\(\varDelta _{ij}\)

\(L_{ij}\)

\(U_{ij}\)

\(i=j\)

0

0

0

0

\((i,j)\in \mathcal {N}_{x}\)

1

\((\ell _{ij}+ u_{ij})/2\)

\(\ell ^2_{ij}\)

\(u^2_{ij}\)

otherwise

0

0

0

\(M^2\)

As we mentioned before, the current implementations of ADMMSNL, ARAP do not support the embedding for \(r\ge 3\) and thus are removed in the following comparison, where the method PPAS will be added. The main reason for adding PPAS is that it is particularly suitable and credible for the MC problems [24, 25].

Table 5 Comparisons of six methods for Example 1 with \(m=4, R=0.3\)

(d) Test on Example 4. To see the performance of each method on this problem, we first test it with fixing \(s=6 \; (n=6^3), \; \texttt {nf}=0.1\) but varying \(R\in \{36,38,\ldots ,48\}\). We note that the percentage of available dissimilarities increased from 32.47 to 39.87% with R increasing from 36 to 48, making the problem become ‘easier’ for conformation. The Average results were recorded in Fig. 8. Clearly, PREEEDM and PPAS outperformed the other three methods in terms of RMSD and rRMSD. The former generated the best RMSD when \(R\ge 42\) while the latter got the best RMSD when \(R\le 42\), but they both obtained similar rRMSD. As for the computational speed, PREEEDM ran much faster than PPAS. By contrast, the other three methods failed to produce accurate embeddings due to the worse RMSD and rRMSD. Notice that the refinement would not always make the final results better. For instance, rRMSD yielded by SFSDP was bigger than RMSD for each s.

We then test the example with fixing \(s=6\; (n=6^3), \; R=s^2\) and varying \(\texttt {nf}\in \{0.1,0.2,\ldots ,0.5\}\). As illustrated in Fig. 9, in terms of RMSD and rRMSD, it can be clearly seen that PREEEDM and PPAS were the joint winners. In particular, our method rendered the best RMSD when \(\texttt {nf}\ge 0.2\) and also ran much faster than PPAS. Obviously, the other three methods again failed to obtain desirable RMSD and rRMSD irrelevant of the time they used.

Fig. 8
figure 8

Average results for Example 4 with \(s=6, \texttt {nf}=0.1\)

Fig. 9
figure 9

Average results for Example 4 with \(s=6, R=s^2\)

Fig. 10
figure 10

Average results for Example 4 with \(n=s^3, R=s^2, \texttt {nf}=0.1\)

Finally, for larger size problems with \(n=s^3\) and \(s\in \{7,8,\ldots ,13\}\), the average results were presented in Fig. 10, where we omitted the results by PPAS for \(s>10\) because it took too much time to terminate. It is worth mentioning that the percentage of the available dissimilarities over all elements of \(\varDelta \) decreases from \(26.78\%\) to \(14.83\%\) when s increasing from 7 to 13, making the problems more and more challenging. Clearly PC, SFSDP and EVEDM failed to locate all atoms in \(\mathfrak {R}^3\). PPAS rendered the most accurate RMSD when \(s\le 10\) whilst PREEEDM achieved the most accurate RMSD when \(s >10\) and the most accurate rRMSD for all cases. Equally important for PREEEDM is that it spent less than 50 s for all tested cases, while PPAS took much more time to terminate (e.g., consuming over 2000 s when \(s\ge 10\)).

Fig. 11
figure 11

Molecular conformation by PC, SFSDP, PPAS, EVEDM and PREEEDM. Left 1GM2 (\(n=166\)); middle: 1AU6 (\(n=506\)); right 1LFB (\(n=641\))

(e) Test on Example 5 For the 12 collected real data, we fixed \(R = 6, c = 50\%\) and \(\texttt {nf}= 0.1\). The generated embeddings by the five methods for the three molecules 1GM2, 1AU6 and 1LFB were shown in Fig. 11, where the true and estimated positions of the atoms were plotted by blue circles and pink stars respectively. Each pink star was linked to its corresponding blue circle by a pink line. For these three data, PREEEDM and PPAS almost conformed the shape of the original data. Clearly, the other three methods failed to conform. The complete numerical results for the 12 problems were reported in Table 6. It can be clearly seen that PREEEDM and PPAS performed significantly better in terms of the RMSD and rRMSD than the other methods. What is more impressive is that PREEEDM only used a small fraction of the time by PPAS, which in general took relatively long time to terminate. For example, PREEEDM only used 22.64 s for 2CLJ, which is a very large data set with \(n = 4189\). In contrast, we had to omit the result of PPAS for this instance (as well as to omit for other tested instances, and the missed results were indicated as “–” in Table 6) because it took too long to terminate.

Table 6 Comparisons of the five methods for Example 5

6.4 Robustness of PREEEDM

The excellent performance of PREEEDM reported above was actually due to its robustness to noise. Previous examples all had Gaussian noise. We now demonstrate below that PREEEDM works much better than the other methods when the noise is from a heavy-tailed distribution, for instance, t-distribution with a small degree of freedom. We also take this opportunity to test SQREDM solver of our own [56], which also made use of penalty, majorization and minimization techniques, yet for the least squares problem (5). We will see that PREEEDM outperforms SQREDM for both types of noise (Gaussian and t distributions).

Fig. 12
figure 12

Robustness of PREEEDM

To shorten the presentation, we restrict our numerical tests on two representative examples: Example 1 with \(n=100, R=0.3\) and Example 4 with \(s=5, R=s^2\). For each example, we generate 20 instances under two types of noise from standard normal distribution and Student-t distribution with the degree of freedom being 1. We alter nf from \(\{0.1,0.2,\ldots ,0.9\}\) and from \(\{0.01,0.02,\ldots ,0.09\}\) for the Gaussian and the Student-t noises respectively. Average RMSD were recorded in Fig. 12. We have the following observations.

  1. (i)

    PREEEDM is competitive under Gaussian noise. For Example 1, Fig. 12a showed ARAP yielded the best RMSD followed by PREEEDM and SQREDM. For Example 4, Fig. 12c showed that PREEEDM rendered the smallest RMSD for most cases, followed by PPAS and SQREDM (note that the current implementation of ARAP is only for \(r=2\) and hence is not applicable to this example).

    In particular, when the nf is over 50%, PPAS and PREEEDM closely follow each other. The behaviour of those methods under Gaussian noise is expected as the least-squares formulation is equivalent to the maximum-likelihood criterion. On the one hand, least squares favour large distances. On the other hand, under Gaussian (thin-tailed), the number of large distance errors is relatively small and hence would not cause significant distortion in locating the unknown sensors.

  2. (ii)

    PREEEDM performs the best under heavy-tailed noise (from Student \(t_1\) distribution). For Example 1, both PREEEDM and SQREDM behaved much better than the other methods, see Fig. 12b. For Example 4, PREEEDM stood out as the best method when nf is bigger than 0.02 and is much better than SQREDM, see Fig. 12d. The test data now has more numbers of large distance errors than under the Gaussian distribution and the absolute value criterion alleviates the tendency of favouring large distances. Therefore, PREEEDM yielded the best performance in such situations.

We conclude that PREEEDM based on the model (3) is robust to the noise in terms of these two examples.

7 Conclusion

The purpose of this paper is to develop an efficient method for one of the most challenging distance embedding problems in a low-dimensional space, which have been widely studied under the framework of multi-dimensional scaling. The problem employs \(\ell _1\) norm to quantify the embedding errors. Hence, the resulting model (3) appears to be robust to outliers and is referred to as the robust Euclidean embedding (REE) model.

To the best knowledge of the authors, the only viable method, whose matlab code is also publicly available for REE is the PlaceCenter (PC) algorithm proposed in [1]. Our extensive numerical results on the SNL and MC test problems convincingly demonstrated that the proposed PREEEDM method outperform PC in terms of both the embedding quality and the CPU time. Moreover, PREEEDM is also comparable to several state-of-the-art methods for other embedding models in terms of the embedding quality, but is far more efficient in terms of the CPU time. The advantage becomes even more superior as the size of the problem gets bigger.

The novelty of the proposed PREEEDM lies with its creative use of the Euclidean distance matrix and a computationally efficient majorization technique to derive its subproblem, which has a closed-form solution closely related to the positive root of the classical depressed cubic equation. Furthermore, a great deal of effort has been devoted to its convergence analysis, which well justifies the numerical performance of PREEEDM. We feel that PREEEDM will become a very competitive embedding method in the field of SNL and MC and expect its wide use in other visualization problems.