1 Introduction

Maximum Variance Unfolding (MVU) and its variants, such as Colored MVU (CMVU), are among the first methods in nonlinear Dimensionality Reduction (DR) for data visualization and classification (see Weinberger and Saul (2006); Song et al. (2007); Cox and Ferry (1993); Kim and Lee (2014).) MVU provides a nice framework that preserves the local structure of data based on its pairwise distances and enjoys excellent mathematical theory. In terms of sample theory, it is convergent and consistent under reasonable assumptions, see Arias-Castro and Pelletier (2013). In terms of optimization theory, it has a close link to the duality in Semi-Definite Programming (SDP), see Sun et al. (2006). In terms of modelling, it can incorporate side information of data (e.g., labels) into its framework, such as done in CMVU by using the Hilbert-Schmidt independence criterion. It hence provides a great potential for quality DR. In terms of its applications, however, it is less popular than other leading DR methods such as ISOMAP by Tenenbaum et al. (2000), linear discriminant analysis (LDA) (Howland and Park, 2004) and t-SNE by Van der Maaten and Hinton (2008).

The primary purpose of this paper is to improve the numerical applicability of MVU/CMVU. We achieve this by following the modelling principle of MVU and proposing a new model whose main variable is the Euclean Distance Matrix (EDM). The resulting model can be efficiently solved by EDM optimization and appears to be very promising in competing against t-SNE in visualization. Compared with MVU variants that often rely on off-the-shelf SDP solvers, the EDM optimization is a stand-alone solver with a closed-form formula for each iteration and is easy to implement. Below we justify our proposal by conducting a critical analysis of MVU and CMVU.

1.1 MVU and CMVU

Suppose there are n items that are collected from a high-dimensional space. A certain type of dissimilarity (e.g., distance) may be computed for some pairs of objects. Let \(\delta _{ij}\) denote such a dissimilarity measurement for the item i and item j. We let \({\mathcal {N}}\) be the collection of such pairs. The purpose is to embed those items as n points \(\{\textbf{x}_i\}_{i=1}^n\) in a low-dimensional space \({\mathbb {R}}^r\) such that the Euclidean distance between \(\textbf{x}_i\) and \(\textbf{x}_j\) approximates \(\delta _{ij}\):

$$\begin{aligned} \Vert \textbf{x}_i - \textbf{x}_j \Vert \approx \delta _{ij} \ \ \forall \ \ (i, j) \in {\mathcal {N}}. \end{aligned}$$
(1)

A key principle in MVU is that it favours the embedding with high variance, making the quantity

$$\begin{aligned} \sigma _X^2:= \frac{1}{2n} \sum _{i, j=1}^n \Vert \textbf{x}_i - \textbf{x}_j \Vert ^2 \end{aligned}$$

as big as possible. Here ”\(:=\)” means ”define”. Under the assumption that the embedding points are centred, i.e., \(\textbf{x}_1 + \cdots + \textbf{x}_n = 0\), the variance becomes \(\sigma _X^2 = \sum _{i=1}^n \Vert \textbf{x}_i \Vert ^2\). Therefore, MVU aims to achieve (i) preserving the local distances in (1), and (ii) maximizing the variance \(\sigma _X^2\):

$$\begin{aligned} \max _{\textbf{x}_i} \ \sum _{i=1}^n \Vert \textbf{x}_i \Vert ^2 - \nu \underbrace{\sum _{(i,j) \in {\mathcal {N}}} \Big ( \Vert \textbf{x}_i - \textbf{x}_j \Vert ^2 - \delta _{ij}^2 \Big )^2}_{=: \sigma _{SS}(X)}, \end{aligned}$$
(2)

where \(\nu >0\) is a balance parameter between the two aims. The loss function \(\sigma _{SS}(X)\) is known as the Squared-Stress in Multi-Dimensional Scaling (MDS), see Chapter 11 of Borg and Groenen (2005).

The benefit of employing \(\sigma _{SS}(X)\) is that the squared distance \(\Vert \textbf{x}_i - \textbf{x}_j \Vert ^2\) has a linear representation in terms of a kernel matrix:

$$\begin{aligned} \Vert \textbf{x}_i - \textbf{x}_j \Vert ^2 = \Vert \textbf{x}_i \Vert ^2 + \Vert \textbf{x}_j \Vert ^2 - 2 \langle \textbf{x}_i, \textbf{x}_j \rangle = K_{ii} + K_{jj} - 2 K_{ij}, \end{aligned}$$
(3)

where the Kernel matrix K is defined by \(K_{ij} = \langle \textbf{x}_i, \textbf{x}_j \rangle\) with \(\langle \cdot , \cdot \rangle\) being the standard dot product in \({\mathbb {R}}^r\). Consequently, (2) can be represented as a SDP (dropping the hidden rank constraint \(\text{ rank }(K) = r\)):

$$\begin{aligned} \text{(MVU) }\quad \begin{array}{rl} \displaystyle \max _{K} &{} \text{ Tr }(K) - \nu \sum _{(i,j) \in {\mathcal {N}}} \Big ( K_{ii} + K_{jj} - 2K_{ij} - \delta _{ij}^2 \Big )^2\\ \text{ s.t. } &{} K \textbf{1}_n = 0, \ K \succeq 0. \end{array} \end{aligned}$$
(4)

where \(\text{ Tr }(K)\) is the trace of K, \(\textbf{1}_n\) is the (column) vector of all ones, \(K \succeq 0\) means that K is positive semidefinite (i.e., being a kernel), and the constraint \(K \textbf{1}_n =0\) is the centralization condition.

Examining MVU from the perspective of maximizing the Hilbert-Schmidt independent criterion (HSIC), Song et al. (2007) casts it as maximizing the alignment between the similarity matrix K and the covariance matrix of completely independent items, whose covariance matrix is the identity matrix I, e.g., \(\text{ Tr }(K) = \text{ Tr }(KI)\). They further generalized this view to replacing I with any positive semidefinite kernel matrix L (Lin et al., 2010), leading to the Colored MVU [Song et al. (2007) Lemma 1]:

$$\begin{aligned} \text{(CMVU) } \quad \begin{array}{rl} \displaystyle \max _{K} &{} Tr(JKJL) - \nu \sum _{(i,j) \in {\mathcal {N}}} \Big ( K_{ii} + K_{jj} - 2K_{ij} - \delta _{ij}^2 \Big )^2\\ \text{ s.t. } &{} K \succeq 0. \end{array} \end{aligned}$$
(5)

where the matrix \(J:= I - (1/n) \textbf{1}_n \textbf{1}_n^T\) is the centering matrix. It is known that kernels can represent some prior information of data, see Song et al. (2012). We are ready to make some critical comments that motivate our model.

  1. (i)

    On the squared-stress loss function. An important concept in the embedding theory is the usability proposed by De Leeuw De Leeuw (1988). A set of embedding points \(\{\textbf{x}_i\}\) is said to be usable if

    $$\begin{aligned} \Vert \textbf{x}_i - \textbf{x}_j \Vert> 0 \quad \text{ if } \ \ \delta _{ij} > 0. \end{aligned}$$

    This would prevent neighbouring points from crushing to one point (the degenerate embedding). CMVU does not have the usability property due to the squared-stress \(\sigma _{SS}(X)\) being used. This means that CMVU may lead to embeddings with "crowding phenomenon" observed for other embedding methods, see Qi and Yuan (2014).

  2. (ii)

    On the neighbourhood size \(|{\mathcal {N}}|\). Both MVU and CMVU rely on the off-shelf SDP solvers, whose computational complexity depends not only on the number of embedding points n but also on the neighbourhood size of \({\mathcal {N}}\). Typical size of 1% of n is often recommended. Higher values may significantly slow down any standard SDP solvers when n is the order of thousands. Comparing with ISOMAP and t-SNE where all pairwise distances (i.e., \(|{\mathcal {N}}|= n(n-1)/2\)) are used, MVU and CMVU were not able to take advantage of many distances that may be available among the data points even though n is only in the order of a few hundreds.

  3. (iii)

    On the rank constraint \(\text{ rank }(K) \le r\). The rank constraint is the reason why SDP reformulations in (4) and (5) are relaxations. The graphical Laplacian regularization by Weinberger et al. (2007) alleviates this issue by approximating it through the bottom eigenvectors of the Laplacian matrix of the graph (Yan et al., 2006). This approximation is also used in CMVU. As we will see in our numerical experiments, the quality of embedding is severely compromised.

1.2 Supervised MVU via EDM optimization

To overcome the weakness of MVU/CMVU, we propose a new variant that follows the maximum variance principle and directly handles the computational issues discussed above. The key technical contributions are as follows.

  1. (a)

    Using the stress loss function. To tackle the usability issue of the squared-stress loss function, we replace it with the stress loss function:

    $$\begin{aligned} \sigma _r (X):= \sum _{(i,j) \in {\mathcal {N}}} \Big ( \Vert \textbf{x}_i - \textbf{x}_j \Vert - \delta _{ij} \Big )^2, \end{aligned}$$

    where the Euclidean distance \(\Vert \textbf{x}_i - \textbf{x}_j \Vert\) is used rather than its squared form. It follows from (3) that the CMVU model (5) takes the following form:

    $$\begin{aligned} \max _{X} \ \sum _{i,j=1}^n S_{ij} \Vert \textbf{x}_i - \textbf{x}_j \Vert ^2 - \nu \sigma _r(X), \qquad \text{ and } \qquad S:= JLJ/2. \end{aligned}$$
    (6)

    Its SDP formulation becomes

    $$\begin{aligned} \max _{K} \ \text{ Tr }(JKJL) - \nu \sum _{(i,j) \in {\mathcal {N}}} \Big ( \sqrt{K_{ii} + K_{jj} - 2K_{ij}} - \delta _{ij} \Big )^2, \quad \text{ s.t. } \ \ K \succeq 0. \end{aligned}$$

    The square-root operation will make this SDP problem extremely difficult to solve. The major reason for using the stress loss function is that the resulting model (6) enjoys the usability property, proved in Proposition 1 below.

  2. (b)

    Reformulation as EDM optimization. To tackle the neighbourhood size issue, we use the Euclidean Distance Matrix (EDM) to reformulate (6). Let D be the \(n \times n\) EDM defined by \(D_{ij}:= \Vert \textbf{x}_i - \textbf{x}_j \Vert ^2\), \(i, j=1, \ldots , n\). By the theory of the classical MDS of [Cox and Cox (1991) Chp. 2], it holds

    $$\begin{aligned} - \frac{1}{2} JDJ = X^T X = K, \ K_{ii} + K_{jj} - 2 K_{ij} = D_{ij} \ \text{ and } \ \text{ rank }(JDJ) = r. \end{aligned}$$
    (7)

    Therefore, the EDM formulation of (6) is

    $$\begin{aligned} \max _{D} \ - \text{ Tr }(DS) - \nu \sum _{(i,j) \in {\mathcal {N}}} \Big ( \sqrt{D_{ij}} - \delta _{ij} \Big )^2, \quad \text{ s.t. } \ \ D \ \text{ is } \text{ EDM }. \end{aligned}$$
    (8)

    It is important to note that the proximal operator of the objective function in (8) can be computed in a closed-form formula, which significantly reduces the computational complexity due to the large neighbourhood size of \({\mathcal {N}}\).

  3. (c)

    Exact penalization of the rank constraint. With the rank constraint \(\text{ rank }(JDJ) \le r\), the EDM problem (8) is equivalent to our proposed MVU model (6). Penalty methods in dealing with rank constraints of various types of matrices have been recently studied, see Li and Qi (2011); Miao et al. (2016); Ding and Qi (2017); Zhou et al. (2018); Li et al. (2018); Sagan and Mitchell (2021); Le Thi et al. (2015). In this paper, we will propose an exact penalty for \(\text{ rank }(JDJ) \le r\). The exact penalty is a distance function, whose majorization can be cheaply constructed, leading to an efficient implementation of the overall EDM algorithm. It is noted that the squared distance was used as a penalty in Keys et al. (2019), but it is an inexact penalty.

  4. (d)

    Adding box constraints. The algorithmic framework developed allows us to handle simple constraints that enforce lower and upper distance bounds between any pairs of embedding points \(\textbf{x}_i\) and \(\textbf{x}_j\) without incurring any extra computational cost. This feature is very convenient in modelling. For example, we know in advance that some points have fixed distances or their distances should stay below certain threshold (e.g.,\(\Vert \textbf{x}_i - \textbf{x}_j \Vert \le b_{ij}\)). However, adding many such constraints to CMVU (5) may significantly slow down the SDP solver involved.

To summarize, the complete model we study in this paper is stated below in terms of the minimization (rather than maximization):

$$\begin{aligned} \text{(SMVU) } \ \ \left\{ \begin{array}{lll} \min _D &{} f(D):= \text{ Tr }(DS) &{}+\ \ \nu \sum _{i, j=1}^n W_{ij} \Big ( \sqrt{D_{ij}} - \delta _{ij} \Big )^2 \\ \text{ s.t. } &{} D \ \ \text{ is } \text{ EDM } &{} \text{(EDM } \text{ constraint) } \\ &{} \text{ rank }(D) \le r &{} \text{(rank } \text{ constraint) } \\ {} &{} A \le D \le B, &{} \text{(box } \text{ bounds) } \end{array} \right. \end{aligned}$$
(9)

where \(S:= JLJ/2\), \(W_{ij} \ge 0\) are weights reflecting importance of the corresponding data (e.g., \(W_{ij} =1\) for \((i, j) \in {\mathcal {N}}\) and \(=0\) otherwise). We also include the simple lower bound \(A_{ij}\) and upper bound \(B_{ij}\) for each \(D_{ij}\). If no such information is available, we may simply set \(A_{ij}=0\) and \(B_{ij} = \infty\). After getting the optimal D from (9), we use the decomposition in (7) to get the embedding points \(\textbf{x}_i\). Since we allow the prior information L to be used, we refer to our model (9) as the Supervised MVU with EDM (SMVU).

The rest of the paper is to consolidate the claims made in (a)-(d) above and to develop a fast solution method for SMVU (9), which, at first glance, seems to be a formidable problem to solve. The paper is organized as follows. Section 2 includes the usability result of SMVU. The exact penalty approach is developed in Sect. 3. Extensive numerical tests are reported in Sect. 4. We conclude the paper in Sect. 5. Some of the proofs are contained in Appendix.

2 Usability of SMVU model

This section shows that the SMVU model enjoys the usability property, which the CMVU model may fail to enjoy. Ignoring the box constraint in (9), the SMVU model has an equivalent vector form:

$$\begin{aligned} \min _X \Phi (X):= \underbrace{\sum _{i,j=1}^n S_{ij} \Vert \textbf{x}_i - \textbf{x}_j \Vert ^2}_{=: F(X)} \ + \ \nu \underbrace{\sum _{i,j=1}^n W_{ij} \Big ( \Vert \textbf{x}_i - \textbf{x}_j \Vert - \delta _{ij} \Big )^2}_{:= G(X)}, \end{aligned}$$
(10)

with \(X:= [\textbf{x}_1, \ldots , \textbf{x}_n]\) and \(\textbf{x}_i \in {\mathbb {R}}^r\). We have the following result.

Proposition 1

Suppose the weight matrix W satisfies the following assumption:

$$\begin{aligned} W_{ij}> 0 \quad \text{ if } \ \ \delta _{ij} > 0. \end{aligned}$$

Let \(X^*\) be a local minimum of (10). Then \(X^*\) is usable.

Proof

We first note that although \(\Phi (X)\) may not be differentiable at some points, it is always directionally differentiable. That is, \(\Phi '(X^{*};\, Z)\) exists for any \(Z \in {\mathbb {R}}^{r \times n}\). Since \(X^{*}\) is a local minimum, we must have

$$\begin{aligned} \Phi '(X^{*};\, Z) \ge 0 \qquad \text{ for } \text{ all } \ \ Z \in {\mathbb {R}}^{r \times n}, \end{aligned}$$
(11)

Define

$$\begin{aligned} \begin{array}{l} p_{ij}(X^{*}) := \left\{ \begin{array}{ll} 1/\Vert \textbf{x}_i^* - \textbf{x}_j^* \Vert &{} \quad \text{ if } \ \Vert \textbf{x}_i^* - \textbf{x}_j^* \Vert> 0 \\ 0 &{} \quad \text{ otherwise }, \end{array} \right. \\ q_{ij}(X^*):= \left\{ \begin{array}{ll} 0 &{} \quad \text{ if } \ \Vert \textbf{x}_i^* - \textbf{x}_j^* \Vert > 0 \\ 1 &{} \quad \text{ otherwise }. \end{array} \right. \end{array} \end{aligned}$$

We now calculate the directional derivatives \(F'(X^{*};\, Z)\) and \(G'(X^{*}, Z)\) by

$$\begin{aligned} F'(X^{*};\, Z) = 2 \sum _{i,j =1}^n S_{ij} \langle \textbf{x}^*_i - \textbf{x}^*_j, \ \textbf{z}_i - \textbf{z}_j \rangle . \end{aligned}$$

and (see also [De Leeuw (1984) Eq. 7])

$$\begin{aligned} G'(X^{*};\, Z) &= {} 2 \sum _{i,j=1}^n W_{ij} \langle \textbf{x}^*_i - \textbf{x}^*_j, \ \textbf{z}_i - \textbf{z}_j \rangle \\{} &\quad {} + 2 \sum _{i,j=1}^n W_{ij} \delta _{ij} p_{ij} \langle \textbf{x}^*_i - \textbf{x}^*_j, \ \textbf{z}_i - \textbf{z}_j \rangle \\{} &\quad - 2 \sum _{i,j=1}^n W_{ij} \delta _{ij} q_{ij} \Vert \textbf{z}_i - \textbf{z}_j \Vert . \end{aligned}$$

It then follows from the necessary condition (11) that for any \(Z \in {\mathbb {R}}^{r \times n}\)

$$\begin{aligned} 0 &\le {} \Phi '(X^{*};\, Z) + \Phi '(X^*;\, -Z) \\ &= F'(X^*;\, Z) + F'(X^*, -Z) + \nu G'(X^*; Z) + \nu G'(X*; -Z) \\ &= {} - 4 \nu \sum _{i,j=1}^n W_{ij} \delta _{ij} q_{ij} \Vert \textbf{z}_i - \textbf{z}_j \Vert , \end{aligned}$$

which implies (because \(W_{ij} \delta _{ij} q_{ij} \ge 0\) for all ij)

$$\begin{aligned} W_{ij} \delta _{ij} q_{ij} = 0 \quad \forall \ i, j=1, \ldots , n. \end{aligned}$$

By our assumption on the weight matrix W and the definition of \(q_{ij}\), we have

$$\begin{aligned} \delta _{ij}>0 \ \Longrightarrow \ W_{ij}> 0 \ \Longrightarrow \ W_{ij}\delta _{ij}> 0 \ \Longrightarrow \ q_{ij} = 0 \ \Longrightarrow \ \Vert \textbf{x}^*_i - \textbf{x}^*_j \Vert > 0. \end{aligned}$$

This is just the usability property of \(X^*\). \(\square\)

Example 1

Suppose we have four items that belong to a same class and the pairwise dissimilarities are all given

$$\begin{aligned} \delta _{12} = 2, \ \delta _{13} = \delta _{14} = \delta _{23} = \delta _{24} = 1, \ \delta _{34} = 0.1. \end{aligned}$$

This means that \(JLJ = 0\) when L is taken to be the label matrix \(L_{ij} = 1\) for all ij. This further implies that CMVU model reduces to minimizing the squared stress criterion (S-Stress) \(\sigma _S^2(X)\). The purpose is to find four one-dimensional embedding points \(X=[ x_1, \; \ldots , \; x_4]\) with \(x_i \in {\mathbb {R}}\). Let \(Z^*\) be given by \(x^*_1 = -1\), \(x^*_2 =1\) and \(x^*_3=x^*_4 =0\). It can be verified that the gradient and the Hessian of \(\sigma _S^2(X)\) at \(X^*\) are

$$\begin{aligned} \nabla \sigma ^2_S(X^*) = 0 \qquad \text{ and } \qquad \nabla ^2 \sigma ^2_S (X^*) = \left( \begin{array}{llll} 48 &{}\quad -\,32 &{}\quad -\,8 &{}\quad -\,8 \\ -\,32 &{}\quad 48 &{}\quad -\,8 &{}\quad -\,8 \\ -\,8 &{}\quad -\,8 &{}\quad 15 &{}\quad 1 \\ -\,8 &{}\quad -\,8 &{}\quad 1 &{}\quad 15 \end{array} \right) \succeq 0, \end{aligned}$$

Hence, the embedding points in \(X^*\) is a local minimum of \(\sigma _S^2(X)\), but it has \(|x_3^* - x_4^* |= 0\) even \(\delta _{34} = 0.1 > 0\). Consequently, the usability property does not hold for S-Stress criterion, which implies that CMVU may not have the usability property.

3 EDM optimization: exact penalty approach

In this part, we develop an efficient computational algorithm for the SMVU model (9). With the exact penalty approach, we will be able to design a majorization–minimization scheme. Each step of this scheme will have a closed-form solution, making it possible to tackle large data sets. Our first question is which constraints are to be penalized.

3.1 Understanding the constraints

There are three types of constraints in (9). We analyze them one by one to see where the computational difficulty lies.

  1. (a)

    EDM constraint An important characterization of EDM D of size \(n \times n\) is by Schoenberg (1938):

    $$\begin{aligned} D \ \text{ is } \text{ EDM } \qquad \Longleftrightarrow \qquad (-D) \in \mathcal {K}^n_+, \quad \text{ diag }(D) = 0, \end{aligned}$$

    where \(\mathcal {K}^n_+\) is the conditionally positive semidefinite cone:

    $$\begin{aligned} \mathcal {K}^n_+= & {} \left\{ A \ \Big |\ A \ \text{ is } \ n \times n \ \ \text{ symmetric } \text{ matrix }, \ \ \textbf{x}^T A \textbf{x}\ge 0,\ \ \forall \textbf{x}, \ \sum \nolimits _{i = 1}^{n} x_i = 0 \right\} \\= & {} \left\{ A \ \Big |\ JAJ \succeq 0 \right\} . \end{aligned}$$

    In other words, the EDM constraint is actually a spectral conic constraint intersecting with the linear space \(\text{ diag }(D) =0\). We note that \(\mathcal {K}^n_+\) is a spectral cone because it involves eigenvalues. It is equivalent to the role K played in CMVU (5): \(K \succeq 0\). Hence, the EDM constraint is a convex constraint.

  2. (b)

    Rank constraint It is the most difficult part to deal with. There are a few ways to manage this constraint. For example, Qi and Yuan (2014) represent this constraint by the equation:

    $$\begin{aligned} p(D):= \text{ Tr }(-JDJ) - \sum _{i=1}^r \lambda _i(-JDJ) = 0, \end{aligned}$$

    where \(\lambda _i\) is the ith latest eigenvalue of \((-JDJ)\). Since \((-JDJ) \succeq 0\), \(p(D) \ge 0\) for all EDM matrix D. The function is added as a penalty to the objective. The drawback of this approach is that we still need to keep the EDM constraint, whose spectral conic constraint is hard to handle when n is large. Moreover, the penalty is inexact. A new idea introduced in Zhou et al. (2018) is to collect all difficult constraints together through regrouping.

  3. (c)

    Regrouping the constraints We collect the rank constraint and the spectral conic constraint together to define a new object called the conditional positive semidefinite cone with rank-r cut

    $$\begin{aligned} \mathcal {K}^n_+(r):= \left\{ D \in \mathcal {S}^n \ \Big |\ D \in \mathcal {K}^n_+, \ \text{ rank }(JDJ) \le r \right\} . \end{aligned}$$

    The linear constraint \(\text{ diag }(D) =0\) is submerged into the box constraint by letting \(A_{ii} = B_{ii} = 0\) for \(i=1, \ldots , n\). Through this regrouping, the constraints in SMVU (9) are represented as \(-D \in \mathcal {K}^n_+(r)\) and \(A \le D \le B.\) The problem itself becomes

    $$\begin{aligned} \min \ f(D), \quad \text{ s.t. } \ \ -D \in \mathcal {K}^n_+(r) \qquad \text{ and } \qquad A \le D \le B. \end{aligned}$$
    (12)
  4. (d)

    Dealing with \(\mathcal {K}^n_+(r)\). For a given matrix D, we compute the Euclidean distance from \((-D)\) to \(\mathcal {K}^n_+(r)\):

    $$\begin{aligned} g(D):= \text{ dist } ( -D, \ \mathcal {K}^n_+(r)) = \Vert - D - \Pi _{\mathcal {K}^n_+(r)} (-D) \Vert = \Vert D + \Pi _{\mathcal {K}^n_+(r)} (-D) \Vert , \end{aligned}$$
    (13)

    where \(\Pi _{\mathcal {K}^n_+(r)} (-D)\) is a projection of \((-D)\) to \(\mathcal {K}^n_+(r)\):

    $$\begin{aligned} \Pi _{\mathcal {K}^n_+(r)} (-D) \in \arg \min _{Z} \left\{ \Vert -D - Z\Vert ^2 \ \Big |\ \ Z \in \mathcal {K}^n_+(r) \right\} . \end{aligned}$$

    We postpone the computation of the projection to a later section. Through the distance function, we know that

    $$\begin{aligned} -D \in \mathcal {K}^n_+(r) \qquad \text{ if } \text{ and } \text{ only } \text{ if } \qquad g(D) = 0. \end{aligned}$$

3.2 Exact penalty approach

Based on the development above, the SMVU model (9) can be reformulated as

$$\begin{aligned} \min _D \ f(D) \quad \text{ s.t } \quad g(D) = 0 \quad \text{ and } \quad A \le D \le B. \end{aligned}$$
(14)

We assume that there exists a lower bound \(c>0\) such that \(A_{ij} \ge c\) for all \(i \not = j\). This means that we do not want any pair of \(\textbf{x}_i\) and \(\textbf{x}_j\) to be embedded in the same point. We do not need to know the value c, and this is purely a theoretical assumption so that there exists \(\kappa >0\) such that the objective function f(D) satisfies the Lipschitz condition:

$$\begin{aligned} |f(D) - f(\overline{D}) |\le \kappa \Vert D - \overline{D}\Vert , \end{aligned}$$

where \(\kappa >0\) is called the Lipschitz constant of \(f(\cdot )\). Under this Lipschitz condition, Prop. 2.4.3 of Clarke (1990) implies that any local/global minimum of (14) is also local/global minimum of the following problem:

$$\begin{aligned} \min _D f_\rho (D):= \ f(D) + \rho g(D) \quad \text{ s.t } \quad A \le D \le B, \end{aligned}$$
(15)

where \(\rho \ge \kappa\) is a penalty parameter. In other words, the problem (15) is an exact penalty for the original problem due to the distance function g(D) being used. Previous, the squared distance function \(g^2(D)\) was used in other situations, and it resulted in an inexact penalty problem, where the penalty parameter \(\rho\) is driven to infinity in order to approximate well to its original problem at the local minima, see Zhou et al. (2018) for some examples. The remaining part of this section is about solving the penalty problem (15).

It is useful to note that the original objective function f(D) has a separable structure:

$$\begin{aligned} f(D) = \sum _{i,j=1}^n \; \underbrace{(S_{ij}+\nu W_{ij}) D_{ij} - 2\nu \delta _{ij} W_{ij} \sqrt{D_{ij}} + \nu W_{ij} \delta _{ij}^2}_{=: f_{ij}(D_{ij}) }. \end{aligned}$$
(16)

It motivates us to approximate g(D) also by a separable function that would lead to \(n(n-1)/2\) one-dimensional optimization problems. Moreover, each of those one-dimensional problems has a closed-form solution. The approximation is through majorization.

  1. (a)

    Majorization of g(D). The idea behind majorization technique is simple. Suppose we have a function \(\phi (\textbf{x}): {\mathbb {R}}^p \mapsto {\mathbb {R}}\), which is hard to minimize. Let \(\textbf{x}^k\) be the current guess of a minimum of \(\phi (\textbf{x})\). We would like to construct a simpler majorization function at \(\textbf{x}^k\), \(\phi ^{(m)}(\cdot ; \textbf{x}^k): {\mathbb {R}}^p \mapsto {\mathbb {R}}\) satisfying the majorization property:

    $$\begin{aligned} \phi ^{(m)} (\textbf{x}; \textbf{x}^k) \ge \phi (\textbf{x}), \ \ \forall \ \textbf{x}\in {\mathbb {R}}^p \qquad \text{ and } \qquad \phi ^{(m)}(\textbf{x}^k; \textbf{x}^k) = \phi (\textbf{x}^k). \end{aligned}$$
    (17)

    We then minimize the majorization function:

    $$\begin{aligned} \textbf{x}^{k+1} \in \arg \min _{\textbf{x}} \; \phi ^{(m)} (\textbf{x}; \textbf{x}^k). \end{aligned}$$
    (18)

    Therefore, we have

    $$\begin{aligned} \phi (\textbf{x}^{k+1}) {\mathop {\le }\limits ^{(17)}} \phi ^{(m)} (\textbf{x}^{k+1}; \textbf{x}^k) {\mathop {\le }\limits ^{(18)}} \phi ^{(m)} (\textbf{x}^{k}; \textbf{x}^k) {\mathop {=}\limits ^{(19)}} \phi (\textbf{x}^k). \end{aligned}$$

    This way, a decreasing sequence \(\{ \phi (\textbf{x}^k)\}\) is generated, and it must converge if bounded from below. This procedure of majorization-minimization has become very popular in machine learning algorithms, see Sun et al. (2016) for an extensive literature review on the topic. We now define a majorization function for g(D) at a given point \(Z \in \mathcal {S}^n\):

    $$\begin{aligned} g_1^{(m)}(D; Z):= \left\{ \begin{array}{ll} \Vert D - Z \Vert _1 &{}\quad \text{ if } \ g(Z) = 0 \\ \frac{g^2(D)}{2g(Z)} + \frac{1}{2} g(Z) &{}\quad \text{ if } \ g(Z) \not = 0. \end{array} \right. \end{aligned}$$

    It is easy to verify that this function satisfies the majorization property in (17). For the case \(g(Z) =0\), we have \((-Z) \in \mathcal {K}^n_+(r)\) and

    $$\begin{aligned} g(D) = \text{ dist }(-D, \mathcal {K}^n_+(r) ) \le \Vert - D - (-Z)\Vert \le \Vert D - Z\Vert _1 = g_1^{(m)}(D; Z). \end{aligned}$$

    For the case \(g(Z) \not =0\), we simply apply the inequality \(a^2+b^2 \ge 2ab\) for any \(a, b \in {\mathbb {R}}\) to get \(g(D) \le g_1^{(m)}(D; Z)\). Finally, it is obvious that \(g_1^{(m)}(Z; Z) = g(Z)\) for both cases. At this point, we cannot say that \(g_1^{(m)}(D; Z)\) is a simple majorization because \(g^2(D)\) is still involved. Fortunately, it can be further majorized by a simple convex quadratic function.

  2. (b)

    Majorization of \(g^2(D)\). We define a new function

    $$\begin{aligned} h(D): = \Vert D\Vert ^2 - g^2(D). \end{aligned}$$
    (19)

    The following has been proved by Qi and Yuan (2014) Prop. 3.4:

    $$\begin{aligned} h(D) \ \text{ is } \text{ convex } \quad \text{ and } \quad -2 \Pi _{\mathcal {K}^n_+(r)} (-D) \in \partial h(D). \end{aligned}$$
    (20)

    By the subgradient inequality for convex functions, we have

    $$\begin{aligned} h(D) \ge h(Z) + \langle -2 \Pi _{\mathcal {K}^n_+(r)} (-Z), \ D-Z \rangle , \quad \forall D, Z \in \mathcal {S}^n. \end{aligned}$$

    Consequently, we have

    $$\begin{aligned} g^2(D) &= \Vert D\Vert ^2 - h(D) \\ &\le \underbrace{\Vert D\Vert ^2 - h(Z) + \langle 2 \Pi _{\mathcal {K}^n_+(r)} (-Z), \ D-Z \rangle }_{\text{ majorization } \text{ of } g^2(D) \,\hbox{at}\, Z}, \end{aligned}$$

    which, together with \(g_1^{(m)}(D;\, Z)\) above, leads to our final majorization function for g(D):

    $$g^{(m)}(D; Z) = \left\{ \begin{array}{ll} \Vert D - Z \Vert _1 &\quad \text{ if } \ g(Z) = 0,\\ \frac{1}{2g(Z)} \Big ( \Vert D\Vert ^2 - h(Z) + \langle 2 \Pi _{\mathcal {K}^n_+(r)} (-Z), \ D-Z \rangle \Big ) + \frac{1}{2} g(Z) &\quad \text{ if } \ g(Z) \not = 0. \end{array} \right.$$
    (21)

    We now define a majorization for the penalty problem (15):

    $$\begin{aligned} f^{(m)}_\rho (D; Z) = f(D) + \rho g^{(m)}(D; Z) \end{aligned}$$

    and the majorization–minimization procedure gives rise to the following update on the current iterate \(D^k\):

    $$\begin{aligned} D^{k+1} = \arg \min _D \ f^{(m)}_\rho (D; D^k), \qquad \text{ s.t. } \quad A \le D \le B. \end{aligned}$$
    (22)

    It is not hard to see that the majorization problem has a separable structure and is equivalent to \(n(n-1)/2\) one-dimensional problem. We derive a closed-form solution for those problems below.

3.3 Solution of the majorization problem (22)

In this part, we derive the closed-form formula for the solution of (22). Depending on how the majorization is defined, we have two cases to consider.

(a) Case 1: \(g(D^k) = 0\). For this case, we have

$$\begin{aligned} f^{(m)}_\rho (D; D^k) = \sum _{i,j=1}^n \Big ( S_{ij} + \nu W_{ij} \Big ) D_{ij} - 2\nu \delta _{ij} W_{ij} \sqrt{D_{ij}} + \rho |D_{ij} - D^k_{ij} |+ \mathcal {R}_k, \end{aligned}$$

where \(\mathcal {R}_k\) is independent of D. Hence, computing \(D^{k+1}\) is equivalent to finding a solution to the following \(n(n-1)/2\) one-dimensional problem for \(i <j =2, \ldots , n\)

$$\begin{aligned} D^{k+1}_{ij}= & {} \arg \min \ \Big ( S_{ij} + \nu W_{ij} \Big ) D_{ij} - 2\nu \delta _{ij} W_{ij} \sqrt{D_{ij}} + \rho |D_{ij} - D^k_{ij} |\nonumber \\{} & {} \text{ s.t. } \quad \qquad A_{ij} \le D_{ij} \le B_{ij}. \end{aligned}$$
(23)

Each of those problems is equivalent to finding an optimal solution of the following one-dimensional problem:

$$\begin{aligned} \min \ \psi (x):= \alpha x - \beta \sqrt{x} + \gamma |x - \eta |, \qquad \text{ s.t. } \quad a \le x \le b, \end{aligned}$$
(24)

where \(\alpha >0\), \(\beta >0\), \(\gamma >0\) and \(\eta \ge 0\), \(a\ge 0\), \(b \ge 0\) are given constants with \(\eta \in [a, b]\). This is a convex problem, and it is easy to find its optimal solution.

Proposition 2

Define the following two quantities:

$$\begin{aligned} T_1(\alpha , \beta , \gamma , \eta , a, b)&:= {} \left\{ \begin{array}{ll} \eta &{} \ \text{ if } \ \alpha -\gamma \le 0 \\ \Pi _{[a,\; \eta ]} \Big ( \beta ^2/(4 (\alpha -\gamma )^2) \Big ) &{} \ \text{ otherwise }, \end{array} \right. \\ T_2(\alpha , \beta , \gamma , \eta , a, b) &:= {} \Pi _{[\eta ,\; b]} \Big ( \beta ^2/(4 (\alpha +\gamma )^2) \Big ). \end{aligned}$$

Then the optimal solution of (24) is given by

$$\begin{aligned} \mathcal {T}(\alpha , \beta , \gamma , \eta , a, b) = \arg \min \Big \{ \psi ( T_1(\alpha , \beta , \gamma , \eta , a, b)), \ \psi ( T_2(\alpha , \beta , \gamma , \eta , a, b)) \Big \}. \end{aligned}$$

Consequently, the optimal \(D^{k+1}\) in (23) is given by

$$\begin{aligned} D^{k+1}_{ij} = \mathcal {T}\left( S_{ij} + \nu W_{ij}, \; 2\nu \delta _{ij} W_{ij}, \; \rho ,\; D^k_{ij}, \; A_{ij}, \; B_{ij} \right) , \qquad i<j = 2, \ldots , n. \end{aligned}$$
(25)

(b) Case 2: \(g(D^k) \not =0\). In this case, we have

$$\begin{aligned} f^{(m)}(D; D^k) = \sum _{i,j=1}^n \frac{\rho }{2g_k} D_{ij}^2 + \left( S_{ij} + \nu W_{ij} - \frac{\rho \widehat{D}^k_{ij}}{g_k}\right) D_{ij} - 2 \nu \delta _{ij} W_{ij} \sqrt{D_{ij}} + \mathcal {R}_k, \end{aligned}$$

where \(\mathcal {R}_k\) is a term independent of D, and

$$\begin{aligned} \widehat{D}^k:= - \Pi _{\mathcal {K}^n_+(r)} (-D^k), \quad g_k:= g(D^k) = \Vert D^k - \widehat{D}^k \Vert , \quad \rho _k:= g_k/\rho . \end{aligned}$$
(26)

Hence, computing \(D^{k+1}\) is equivalent to finding a solution to the following \(n(n-1)/2\) one-dimensional problem for \(i <j =2, \ldots , n\)

$$\begin{aligned} D^{k+1}_{ij}= & {} \arg \min _{D_{ij}} \ \frac{\rho }{2g_k} D_{ij}^2 + \left( S_{ij} + \nu W_{ij} - \frac{\rho \widehat{D}^k_{ij}}{g_k} \right) D_{ij} - 2\nu \delta _{ij} W_{ij} \sqrt{D_{ij}} \nonumber \\= & {} \arg \min _{D_{ij}} \ \frac{1}{2} D_{ij}^2 + \left( \frac{g_k}{\rho }(S_{ij} + \nu W_{ij} ) - \widehat{D}^k_{ij} \right) D_{ij} - \left( \frac{2\nu g_k \delta _{ij} W_{ij}}{\rho }\right) \sqrt{D_{ij}} \nonumber \\= & {} \arg \min _{D_{ij}}\ \frac{1}{2} \left[ D_{ij} - \left( \widehat{D}^k_{ij} - \rho _k(S_{ij} + \nu W_{ij} ) \right) \right] ^2 - 2\left( \nu \rho _k \delta _{ij} W_{ij} \right) \sqrt{D_{ij}} \nonumber \\{} & {} \text{ s.t. } \quad \qquad A_{ij} \le D_{ij} \le B_{ij}. \end{aligned}$$
(27)

Each of those problems is equivalent to finding an optimal solution of the following one-dimensional problem:

$$\begin{aligned} \min \ \varphi (x):= \frac{1}{2} (x - \omega )^2 - 2\lambda \sqrt{x}, \qquad \text{ s.t. } \quad a \le x \le b, \end{aligned}$$
(28)

where \(\omega \in {\mathbb {R}}\), \(\lambda \ge 0\), \(a\ge 0\), \(b \ge 0\) are given constants. It is a convex problem and its optimal solution has a closed-form solution.

Proposition 3

[Zhou et al. (2020) Prop. 2] Define the following quantities:

$$\begin{aligned} p:= \frac{\omega }{3}, \quad q:= \frac{\lambda }{2}, \quad \tau := q^2 + p^3, \quad \theta := \arccos \left( q|p|^{-3/2} \right) , \end{aligned}$$

and

$$\begin{aligned} T_3 (\omega , \lambda , a, b):= \left\{ \begin{array}{ll} \left[ \left( q + \sqrt{\tau } \right) ^{1/3} + \left( q - \sqrt{\tau } \right) ^{1/3} \right] ^2 &{}\quad \text{ if } \ \tau \ge 0 \\ 4p \cos ^2(\theta /3) &{}\quad \text{ if } \ \tau < 0. \end{array} \right. \end{aligned}$$

Then the optimal solution of (28) is given by

$$\begin{aligned} \mathcal {S}(\omega , \; \lambda , \; a, \; b):= \min \Big \{ b, \; \max \{a,\ T_3 (\omega , \lambda , a, b) \} \Big \}. \end{aligned}$$

Consequently, the update in (27) is given by

$$\begin{aligned} D^{k+1}_{ij} = \mathcal {S}(\widehat{D}^k_{ij} - \rho _k(S_{ij} + \nu W_{ij} ), \; \nu \rho _k \delta _{ij} W_{ij},\; A_{ij}, \; B_{ij} ), \qquad i<j = 2, \ldots , n. \end{aligned}$$
(29)

(c) Computing the projection \(\Pi _{\mathcal {K}^n_+(r)} (-D^k)\). In both formula (25) and (29), we need to compute this projection in order to calculate \(g_k\) and \(\widehat{D}^k\). It turns out that it is most time-consuming part in our algorithm. Based on our numerical experience, on average, about 80% of the CPU time of our algorithm will spend on computing this projection, which needs eigenvalue decomposition. We briefly describe it below as the details can be found in [Zhou et al. (2018) Section 2].

Suppose \(Z \in \mathcal {S}^n\) has the following SVD:

$$\begin{aligned} Z = \lambda _1 \textbf{p}_1 \textbf{p}_1^T + \lambda _2 \textbf{p}_2 \textbf{p}_2^T + \cdots + \lambda _n \textbf{p}_n \textbf{p}_n^T, \end{aligned}$$
(30)

where \(\lambda _1 \ge \lambda _2 \ge \ldots \ge \lambda _n\) are the eigenvalues of A in non-increasing order, and \(\textbf{p}_i\), \(i=1, \ldots , n\) are the corresponding orthonormal eigenvectors. We define a PCA-style matrix truncated at r:

$$\begin{aligned} \text{ PCA}_r^+ (Z):= \sum _{i=1}^r \max \{0, \lambda _i\} \textbf{p}_i \textbf{p}_i^T. \end{aligned}$$
(31)

Then we have

$$\begin{aligned} \Pi _{\mathcal {K}^n_+(r)}(Z) = \text{ PCA}_r^+(JZJ) + (Z -JZJ). \end{aligned}$$
(32)

We note that we do not need to compute a full eigenvalue decomposition of JZJ. It is sufficient to compute the r leading eigenvalues and the corresponding eigenvectors of JZJ. When r is small (e.g., \(r=2\) or 3 for visualization), the computation is much faster than a full eigenvalue-eigenvector decomposition.

3.4 SMVU algorithm and its convergence

Now we are ready to describe our SMVU method in Algorithm 1.

Algorithm 1
figure a

SMVU Method

Convergence analysis of Algorithm 1 under the cases of \(g(D^k) = 0\) and \(g(D^k) > 0\) will be stated as follows:

(a) Case 1: \(g(D^k) = 0\).

$$\begin{aligned} f_{\rho }(D^{k+1}) - f_{\rho }(D^{k}) = f(D^{k+1})+\rho g(D^{k+1}) - f(D^{k}) \end{aligned}$$

With \(g^{(m)}(D; Z) = \Vert D^{k+1} - D^{k} \Vert _1\) and (22), we have

$$\begin{aligned} f(D^{k+1}) + \rho \Vert D^{k+1} - D^{k} \Vert _1 \le f(D^{k}), \end{aligned}$$

and

$$\begin{aligned} f(D^{k+1}) - f(D^{k}) \le -\rho \Vert D^{k+1} - D^{k} \Vert _1 \le -\rho \Vert D^{k+1} - D^{k} \Vert . \end{aligned}$$
(33)

Under the Lipschitz condition

$$\begin{aligned} |f(D^{k+1}) - f(D^{k}) |\le \kappa \Vert D^{k+1} - D^{k} \Vert , \end{aligned}$$

and \(\kappa < \rho\), we have \(D^{k+1} = D^{k}\).

(b) Case 2: \(g(D^k) > 0\). The optimality condition of (22) implies that

$$\begin{aligned} \langle \nabla f^{(m)}_\rho (D; D^k), D^{k} - D^{k+1}\rangle = \left\langle \nabla f(D^{k+1}) + \rho \frac{D^{k+1} + \Pi _{\mathcal {K}^n_+(r)} (-D^{k}) }{g(D^{k})}, D^{k}-D^{k+1}\right\rangle \ge 0, \end{aligned}$$
(34)

With the convexity of f(D),

$$\begin{aligned} f(D^{k+1}) - f(D^{k}) \le \langle \nabla f(D^{k+1}), D^{k+1}-D^{k}\rangle , \end{aligned}$$
(35)

and the following equations and inequality mentioned in Qi and Yuan (2014) and Zhou et al. (2018)

$$\begin{aligned}{} & {} \Vert D +\Pi _{\mathcal {K}^n_+(r)} (-D) \Vert ^2 = \Vert D\Vert ^2 - \Vert \Pi _{\mathcal {K}^n_+(r)} (-D) \Vert ^2, \end{aligned}$$
(36)
$$\begin{aligned}{} & {} \Vert D^{k+1}\Vert ^2 - \Vert D^{k}\Vert ^2 = 2\left\langle D^{k+1}, D^{k+1}-D^{k}\right\rangle - \Vert D^{k+1} - D^{k} \Vert ^2, \end{aligned}$$
(37)
$$\begin{aligned}{} & {} \Vert \Pi _{\mathcal {K}^n_+(r)} (-D^{k+1}) \Vert ^2 - \Vert \Pi _{\mathcal {K}^n_+(r)} (-D^{k}) \Vert ^2 \ge \left\langle \Pi _{\mathcal {K}^n_+(r)} (-D^{k}), D^{k+1}-D^{k}\right\rangle , \end{aligned}$$
(38)

a chain of inequalities can be deduced as:

$$\begin{aligned} \begin{array}{ll} &{}f(D^{k+1}) - f(D^{k}) + \rho g(D^{k+1}) - \rho g(D^{k})\\ &{}\quad \le \left\langle \nabla f(D^{k+1}), D^{k+1}-D^{k} \right\rangle + \rho \left( \frac{g^2(D^{k+1})}{2g(D^{k})} - \frac{1}{2} g(D^{k})\right) \\ &{}\quad \overset{(36)}{=} \left\langle \nabla f(D^{k+1}), D^{k+1}-D^{k}\right\rangle + \frac{\rho }{2g(D^{k})} \left( \Vert D^{k+1}\Vert ^2 - \Vert D^{k}\Vert ^2 \right) \\ &{}\quad \quad - \frac{\rho }{2g(D^{k})} \left( \Vert \Pi _{\mathcal {K}^n_+(r)} (-D^{k+1}) \Vert ^2 - \Vert \Pi _{\mathcal {K}^n_+(r)} (-D^{k}) \Vert ^2 \right) \\ &{}\quad \overset{(37)}{=} \left\langle \nabla f(D^{k+1}) + \frac{\rho }{g(D^{k})} D^{k+1}, D^{k+1}-D^{k}\right\rangle - \rho \frac{\Vert D^{k+1} - D^{k} \Vert ^2}{g(D^{k})} \\ &{}\quad \quad - \frac{\rho }{g(D^{k})} \left( \Vert \Pi _{\mathcal {K}^n_+(r)} (-D^{k+1}) \Vert ^2 - \Vert \Pi _{\mathcal {K}^n_+(r)} (-D^{k}) \Vert ^2 \right) \\ &{}\quad \overset{(38)}{ \le } \left\langle \nabla f(D^{k+1}) + \frac{\rho }{g(D^{k}) } D^{k+1} + \frac{\rho }{g(D^{k}) }\Pi _{\mathcal {K}^n_+(r)} (-D^{k+1}), D^{k+1}-D^{k} \right\rangle - \rho \frac{\Vert D^{k+1} - D^{k} \Vert ^2}{g(D^{k})}\\ &{}\quad \overset{(39)}{ \le } - \rho \frac{\Vert D^{k+1} - D^{k} \Vert ^2}{g(D^{k})} \end{array} \end{aligned}$$
(39)

It can be seen from (39), \(\{f_{\rho }(D^{k})\}\) is non-increasing, and it also has lower bound according to its definition. Therefore, when \(k \rightarrow \infty\), \(- \rho \frac{\Vert D^{k+1} - D^{k} \Vert ^2}{g(D^{k})} \rightarrow 0\). Since elements of \(D^{k}\) are bounded, \(g(D^{k}) < +\infty\), and \(\Vert D^{k+1} - D^{k} \Vert ^2 \rightarrow 0\).

As for the computation complexity of SMVU algorithm, the majorization function \(g^{(m)}\) in (21) is dominated by \(\Pi _{\mathcal {K}^n_+(r)} (-D)\), therefore the overall complexity of each step is about \(\mathcal {O}(rn^{2} )\) (Zhou et al., 2018).

4 Numerical experiments

The main purpose of this part is to demonstrate the effectiveness of SMVU-EDM model in data visualisation and classification. A desktop of 16 GB memory and Inter(R) Core(TM) i7-10750 2.59 GHz CPU and the IRIDIS5 High-Performance Computing Facility with associated support services at the University of Southampton are used, and the codes are developed in MATLAB (R2020a). This section includes datasets and benchmark methods, parameter setting and tuning scheme, and numerical comparison.

4.1 Data sets and benchmark methods

In this section, the effectiveness of SMVU-EDM was demonstrated on diverse types of data, including RGB images, grayscale images, binary images and categorical sequential data. The results were compared with four benchmark methods: coloured MVU, linear discriminant analysis (LDA) and WeightedIso (Vlachos et al., 2002) (supervised methods), and the t-SNE (Van der Maaten and Hinton, 2008) (unsupervised method).

  1. (a)

    CIFAR-10 Dataset (Krizhevsky and Hinton, 2009). The CIFAR-10 dataset is a set of \(32\times 32\) RGB colour images in 10 classes, including aeroplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. We will compute 2-dimensional visualizations of one of the batches in this dataset, in which there are a total of 10,000 images, and each class contains the same number of samples.

  2. (b)

    Splice-junction Gene Sequences Dataset (Asuncion and Newman, 2007). Splice-junction gene (SJG) sequences dataset consists of 3190 DNA sequences that have been classified into three types: the rest sequences that are the superfluous parts after splicing (EI), the sequences that are spliced out from the original sequence (IE), and the sequences that are the action of splicing has not been implemented on.

  3. (c)

    EMNIST Dataset (Cohen et al., 2017). The EMNIST dataset consists of \(28\times 28\) handwritten character digits including both uppercase and lowercase letters. This research uses the "Letters" split, where the uppercase letters and corresponding lowercase letters are merged in the same classes. Thus 26 classes are in this split of the EMNIST dataset. In the following part, we will provide 2-dimensional visualizations of 20,800 samples with each class containing the same number of samples.

  4. (d)

    Medical MNIST Datasets (Yang et al., 2021). These datasets contain pre-processed medical images of size \(28\times 28\). The images in Yang et al. (2021) are standardized. Pre-procession of the image data lets us assume that Euclidean distances between images are comparable. Data used include RGB dermatoscope image dataset ("DermaMNIST"), grayscale abdominal CT image data ("OrganAMNIST", "OrganCMNIST"), and RGB blood Cell microscope ("BloodMNIST").

  5. (e)

    Anticancer Peptides Sequences Dataset (Grisoni et al., 2019). Anticancer peptides sequences (ACPs) dataset stores one-letter amino-acid sequences that represent their anticancer activity on breast and lung cancer cell lines. The sub-dataset targeting breast cancer contains 949 amino-acid sequence observations, and the one for lung cancer consists of 901 observations. These sequences have various lengths and are classified into five classes: "active", "moderately active", "experimental inactive", and "virtual inactive".

Besides the comparison with CMVU, the t-SNE (Van der Maaten and Hinton, 2008), linear discriminant analysis (LDA) and weighted ISOMAP (WeightedIso) (Vlachos et al., 2002) are included in the following visualization and numerical comparison. For eigen-decomposition methods such as SMVU-EDM, CMVU, WeightedIso and LDA, we adopt the scheme mentioned in Song et al. (2007) that the output dimension r is set as 10 and visualize the first two or three dimensions. The EMNIST dataset and CIFAR-10 dataset include a huge amount of data, which is challenging for most existing dimensionality reduction methods. From the medical MNIST collection, we only provide the 2D visualizations for the "BloodMNIST" dataset. The numerical result of the rest of the dataset will be listed in Table 1. The criteria for selecting benchmark methods are their capability of handling the big data sets and yielding quality visualizations. The codes of t-SNE, LDA, WeightedIso are provided by the Matlab toolbox for dimensionality reduction Van der Maaten et al. (2007). The initial setting in t-SNE is the default one in MATLAB, such as setting exaggeration as 4. The code of CMVU is provided by Song et al. (2007).

4.2 Quality measurement of dimensionality reduction

There are two desirable features for a successful dimensionality reduction method. One is that it yields a visualization with pronounced cluster structure, and the other is that the inherent geometric characteristics is preserved to a certain degree. Correlation (Gracia et al., 2014) is commonly employed for dimensionality reduction quality measurement. In our case, it is computed between the vectorized version of the original distance matrix \(\Delta\) and the obtained D by a model. Since the dimensionality reduction result is produced via supervision, in this paper between-class structures and within-class structures will be measured separately. For pairs of points (ij), we denote the set of pairs that belong to the same class \(i\sim j\) as \(\mathbb {P}_{ \text {inClass}} = \{(i,j)|i\sim j , 1\le i,j \le n \}\), and that belong to different class as \(\mathbb {P}_{ \text {beClass}} = \{(i,j)|i \not \sim j , 1\le i,j \le n \}\). Accordingly, the set of original distances set \(\{ \delta _{ij}|1\le i,j\le n\}\) and \(\{ D_{ij}|1\le i,j\le n\}\) can be divided as in-class distance sets \(\delta _{\text {inClass}} = \{\delta _{ij} |(i,j)\in \mathbb {P}_{ \text {inClass}} \}\), \(D_{\text {inClass}} = \{D_{ij} |(i,j)\in \mathbb {P}_{ \text {inClass}} \}\), and between-class distance sets \(\delta _{\text {beClass}} = \{\delta _{ij} |(i,j)\in \mathbb {P}_{ \text {beClass}} \}\), \(D_{\text {beClass}} = \{D_{ij} |(i,j)\in \mathbb {P}_{ \text {beClass}} \}\). We take the Pearson correlation coefficient between \(\delta _{\text {inClass}}\) and \(D_{\text {inClass}}\), \(Corr_\text {inClass} = Corr(\delta _{\text {inClass}}, D_{\text {inClass}})\), and the one between \(\delta _{\text {beClass}}\) and \(D_{\text {beClass}}\), \(Corr_\text {beClass} = Corr(\delta _{\text {beClass}}, D_{\text {beClass}})\), as quality measurement on how well the dimensionality reduction methods preserve geometric characteristics (Espadoto et al., 2019; Paul and Chalup, 2017; Kalousis et al., 2004). Those correlations may be regarded as a global measurement how well the distances were kept in order.

Locally, we apply the local rank correlation coefficient (LRCorr) to measure how well a dimensionality reduction method reconstructs the local distance rank structure. Let \(\mathbf {\delta }^{i} := \{\delta _{i1}, \delta _{i2}, \ldots , \delta _{in}\}\) be the distance sequence derived from the ith point, and \(D^{i} := \{D_{i1}, D_{i2}, \ldots , D_{in}\}\) be the distance sequence derived from the ith point in the dimensionality reduction result. Furthermore, let \(\delta ^{i}_{\text {inClass}} := \{\delta _{ij}|\delta _{ij} \in \delta ^{i}, i\sim j\}\) and \(D^{i}_{\text {inClass}} := \{D_{ij}|D_{ij} \in D^{i}, i\sim j\}\), the local in-class distance rank correlation coefficient of ith point can be written as \(LRCorr^{i}_{\text {inClass}} = LRCorr(\delta ^{i}_{\text {inClass}},D^{i}_{\text {inClass}})\). Similarly, for between-class situation, with \(\delta ^{i}_{\text {beClass}} := \{\delta _{ij}|\delta _{ij} \in \delta ^{i}, i\not \sim j\}\) and \(D^{i}_{\text {beClass}} := \{D_{ij}|D_{ij} \in D^{i}, i\not \sim j\}\), we have the local between-class distance rank correlation coefficient of ith point as \(LRCorr^{i}_{\text {beClass}} = LRCorr(\delta ^{i}_{\text {beClass}},D^{i}_{\text {beClass}})\). We then take the average of all the \(LRCorr^{i}_{\text {inClass}}\) and that of all the \(LRCorr^{i}_{\text {beClass}}\) as respective measurement of the quality of local in-class and between-class distance rank structure preservation.

Finally, the Silhouette coefficient is used to measure if the supervised dimensionality reduction result has a well-separated structure. The silhouette coefficient computed at one point is based on the mean intra-cluster distance and the distance to the on average nearest cluster. Denoting s(i) as the Silhouette coefficient of the ith point, the average of s(i) is the overall Silhouette coefficient we adopt to measure the dimensionality reduction result.

4.3 On implementation of SMVU-EDM

Fig. 1
figure 1

Dimensionality reduction quality comparison for different distance types. This comparison is based on the CIFAR dataset. (o is order of the power of Euclidean distance and p is the order of Minkowski distance. In this experiment, we set \(\nu = 1.1\) in SMVU-EDM, \(\nu = 1\) in CMVU, \(Perp = 30\) in t-SNE and \(\lambda = 0.1\) in WeightedIso) (Color figure online)

Fig. 2
figure 2

Dimensionality reduction quality comparison for key tuning parameter of different methods (The dissimilarity type applied in this experiment is the Euclidean distance. For clear comparison, we project the values of the tuning parameters of different methods into uniform projected value (PV) which ranges from 0.1 to 10: for SMVU-EDM, \(PV = \nu - 1\); for CMVU, \(PV = \nu\); for t-SNE, \(PV = Perp/10\); for WeightedIso, \(PV = \lambda \times 10\)) (Color figure online)

This part describes how we have implemented SMVU-EDM including its stopping criteria, choice of the balance parameter \(\nu\) and the penalty parameter \(\rho\).

  1. (a)

    Stopping criterion Referring to Zhou et al. (2018) for SMVU-EDM algorithm, we adopt the double convergence criteria that terminate the algorithm if

    $$\begin{aligned} \eta _{f}(D^{k-1},D^{k}) = \frac{f_{\rho }(D^{k-1})-f_{\rho }(D^{k})}{1+f_{\rho }(D^{k-1})} \le \epsilon _1 \end{aligned}$$
    (40)

    and

    $$\begin{aligned} \eta _{\kappa }(D^{k}) = \frac{2g(D^{k})}{\Vert JD^{k}J\Vert ^{2}} \le \epsilon _2, \end{aligned}$$
    (41)

    for given \(\epsilon _{1} >0\) and \(\epsilon _{2}>0\). In our implementation, we used \(\epsilon _{1}=\sqrt{n}\times 10^{-3}\), \(\epsilon _{2}=10^{-3}\), where n is the size of the training set.

  2. (b)

    Choice of dissimilarity measurement. SMVU-EDM, CMVU, t-SNE and WeightedIso all take pairwise dissimilarities between observations as input. There is always a question as to what are the most appropriate dissimilarities to be used. There is no universal answer as it depends on properties of the concerned algorithms and the characteristics of data to be analyzed (Wang et al., 2005; Peng et al., 2019; Ting et al., 2019). To explore how different types of dissimilarity affect the output results of different methods, we conduct tests on a randomly picked subset of Cifar dataset of size 1000 and anticancer peptides sequences (breast cancer) data set, representing respectively the numerical data sets and categorical sequential data sets. For numerical data sets like Cifar dataset, different kinds of distance including Minkowski distance (\(\delta _{ij}^{\text {Minkowski}-p} = \Vert \textbf{x}_{i} - \textbf{x}_{j}\Vert _{p}\) ) and power of Euclidean distance (\(\delta _{ij}^{o} = \Vert \textbf{x}_{i} - \textbf{x}_{j}\Vert _{2}^{o}\) ) are applicable. The types of distances used in this experiment are Minkowski distance (Yu et al., 2008) of order 3 (p = 3), city-block distance (p = 1), Chebyshev distance (p = \(\infty\)) and powers of Euclidean distance of order o = 0.15, 0.25, 0.5, 1, 2. For categorical sequential data sets, matlab function seqpdist is used with default setting (Jukes-Cantor distance Jukes and Cantor (1969)), and the power of sequential distance of different orders (o = 0.15, 0.25, 0.5, 1, 2) will be compared.

    We randomly sample the Cifar data set, run each algorithm ten times, and average the results. In this experiment, we set \(\nu = 1.1\) in SMVU-EDM, \(\nu = 1\) in CMVU, \(Perp = 30\) in t-SNE and \(\lambda = 0.1\) in WeightedIso. In-class correlation, between-class correlation and Silhouette coefficient will be taken as dimensionality reduction quality measurement. As found in Fig. 1, with Euclidean distance (o=1), squared Euclidean distance (o=2), city-block distance (p=1) and Minkowski distance of order 3 (p = 3), SMVU-EDM has relatively higher in-class correlation and between-class correlation. With smaller order of the Euclidean distance, the in-class and between-class correlations decrease. Therefore, when the Silhouette coefficient increases greater than 0.9, where the order o is equal to or less than 0.25, the in-class correlation and between-class correlation are still acceptable. As for t-SNE, when the order o decreases, in-class correlation, between-class correlation, and Silhouette coefficient keep steady. Minkowski with higher order leads to slight increase of Silhouette coefficient for t-SNE at a price of decreasing correlation. Order of the dissimilarity also affects the performance of WeightedIso: when the Silhouette coefficient is greater than 0.9, the decrease of in-class correlation is significant, and so is the between-class correlation.

  3. (c)

    Choice of balancing parameter \(\nu\). Performance of the key tuning parameters of different methods are compared in Fig. 2 , including balancing parameter \(\nu\) of SMVU-DM, perplexity Perp of t-SNE, balancing parameter \(\nu\) of CMVU and weighting parameter \(\lambda\) of WeightedIso. Different choices of balancing parameter \(\nu\) for SMVU-EDM emphasize the HSIC part at different levels and contribute to different dimensionality reduction results. As shown in Fig. 2, when \(\nu\) increases, between-class correlation stays steady, meanwhile in-class correlation has around \(10\%\) increase. Overall, with relative small \(\nu\) in the range between 1.1 and 2, the SMVU-EDM model would produce a well-separated result with relatively superior between-class structure preservation and acceptable structure preservation. With respect to the WeightedIso, when the weighting parameter \(\lambda\) is between 0.01 and 0.2, the result is well-separated, but both the in-class correlation and between-class correlation are significantly reduced. It can also be noticed that when the perplexity is greater than 10, performance of t-SNE becomes steady. As for CMVU, Fig. 2 reveals that tuning of balancing parameters will not lead to a significantly separated result.

    With the above analysis, we set up the experiment scheme as follows. Realizing that tuning the balancing parameter \(\nu\) of SMVU-EDM, the order of dissimilarity and the weighting parameter \(\lambda\) of WeightedIso helps us obtain visualizations with a high silhouette coefficient, we will tune these parameters to compare embeddings from SMVU-EDM and WeightedIso with a similar silhouette score.

    Since the result will be steady when perplexity is large enough, we set perplexity Perp of t-SNE at 30. As for CMVU, the balancing parameter will be set as 1, and the number of nearest neighbours in CMVU is \(1\%\) of the total size of the dataset.

  4. (d)

    Choice of penalty parameter \(\rho\). According to Prop. 2.4.3 of Clarke (1990), when penalty parameter \(\rho\) is greater than the Lipschitz rank of f(D), the algorithm will converge to a point on the EDM cone. With the derivative of \(\frac{\partial f(D)}{\partial D_{ij}}|_{D_{ij}^{k} }= \nu \times W_{ij}\times \left( 1-\frac{\delta _{ij}}{\sqrt{ D_{ij}^{k}}} \right) + S_{ij}\), the scheme for choosing \(\rho\) is as follows:

    1. 1.

      For the first point \(D^{0}\), we computed its projected point \(D_{P}^{0}\) on the EDM cone. The initial \(\rho\) is set to be \(\Vert \nabla f(D_{P}^{0})\Vert\)

    2. 2.

      Since theoretically \(\Vert \nabla f(D^{k}) \Vert\) will decrease during convergence, in every step, the \(\Vert \nabla f(D^{k}) \Vert\) will be checked whether it is greater than the current value of \(\rho\); if not, the current value of \(\rho\) is kept. If \(\Vert \partial f(D^{k}) \Vert\) is greater than the current \(\rho\), \(\rho\) will be updated as \(\Vert \partial f(D^{k}) \Vert\).

    The elements of weight matrix W in this experiment will be set to be equal (i.e., \(W_{ij}=1\) for all ij).

4.4 Visualization comparison

Fig. 3
figure 3

Cifar data visualization of SMVU-EDM (Color figure online)

Fig. 4
figure 4

Cifar data visualization

Fig. 5
figure 5

Extracting ’ship’, ’horse’, “deer” and ”cat” from different visualization

Fig. 6
figure 6

Splice-junction gene sequences dataset visualization comparison (Color figure online)

Fig. 7
figure 7

EMNIST data visualization of SMVU-EDM (Color figure online)

Fig. 8
figure 8

EMNIST from different visualization (Color figure online)

Fig. 9
figure 9

Blood MNIST dataset visualization comparison

  1. (a)

    Cifar Data. The visualization results on the Cifar data, as shown in Figs. 3 and 4 reveal the strong performance of SMVU-EDM compared with the rest of the methods. As shown in the previous section and Figs. 1 and 2, the power of Euclidean distance with order 0.25 and setting \(\nu = 1.1\) leads SMVU-EDM produce quality result on the sampled Cifar dataset. We adopt this setting on the whole Cifar dataset. In Fig. 3, trucks (brownish-red) and automobiles (Han blue) are clustered together, and this group is separated from other clusters. It reveals the fact that images of trucks and automobiles are relatively more similar to each other than to other categories. Images of the aeroplanes (dark purple) and ships (persimmon), which represent machines, also have clear boundaries against the cluster of images of animals, including horses (deep saffron), frogs ( sunglow ), dogs (lime), deer (bright green), and cats (blue–green). In the cluster of animal images, the group of dogs overlaps with the ones of birds and deer, but the cluster of the cat is well separated from the others. Even there are overlaps, the clear boundary of the clusters can reveal the relations between different categories. Meanwhile, the distributions of different categories can be identified from the results of CMVU and t-SNE. For example, trucks (brownish–red) and automobiles (Han blue) are clustered together. However, the boundaries of categories are unclear. As for WeightedIso, referring to the experiment mentioned in the previous section, we set the weighting parameter \(\lambda = 0.2\) and use the Euclidean distances. From the visualization of WeightedIso in Fig. 4 , it can be noticed that the cluster of trucks and automobiles is clear and they are separated from the rest of the objects. But the rest categories are mixed.

    In Fig. 5, we extract points of ”ship”, ”horse”, ”deer”, and “cat” in the three visualizations. For SMVU-EDM, the clusters of horse and deer have, in a way, overlapped, but the boundaries around the clusters of cat and the ship are clear. Meanwhile, cats have the most significant distance to the group of horses. So, overall, the clusters relations such as overlaps, closeness with a clear boundary and great distances between clusters reveal the (dis-)similarity between different groups. In contrast, in all other four visualizations, the points of these four classes scatter in the whole area.

  2. (b)

    Splice-junction Gene Sequences Dataset. Since pairwise Jukes-Cantor distances are taken as input, only the distance-based methods such as SMVU-EDM, CMVU, WeightedIso and t-SNE are applied to the experiment. As presented in Fig. 6, t-SNE failed to extract the inherent distribution of splice-junction gene sequences. SMVU-EDM, WeightedIso and CMVU can cluster the three classes. In this experiment, we set \(\nu = 1.3\) in SMVU-EDM and \(\lambda = 0.6\) in WeightedIso. This setting lets SMVU-EDM and WeightedIso produce results with similar Silhouette coefficients. The three categories have clear boundaries in the result of SMVU-EDM and WeightedIso, but in the one of CMVU, the clusters of "EI" and "Neither" are merged. It is interesting to note that an outlier appears in the result of WeightedIso. It is a point in the "EI" category (coloured as green), and we prominently symbol it out as a bold dot and the corresponding points in all other visualizations. Compared with WeightedIso, it can be seen that, in the result of SMVU-EDM, the outlier appears at a much more reasonable distance from the main cluster of "EI". This suggests that the SMVU appears more robust to outliers than WeightedIso.

  3. (c)

    EMNIST Data. In Figs. 7 and 8, we show the visualizations of EMNIST Data. We apply SMVU-EDM on Euclidean distance to the power of 0.15 with \(\nu = 1.1\). It can be noted that the boundaries of clusters are clear in the result of SMVU-EDM. On the macroscopic aspect, the diverse closeness between the classes gives us information on their relative similarity. In the lower-left part, “o” and “c” are close to each other and they are relatively away from the upper right part that consists of the letters of “v”, “x”, “y”, “h”, “f” and “t”, which can be seen as the revelation that significant difference exists between these two groups. In the case of WeightedIso, we set \(\lambda = 0.3\). As shown in Fig. 8, the boundaries of the clusters are also clear. However, the outliers appear around the clusters frequently. Consequently, the boundary of the cluster is expanded because of the outliers. Some of the clusters in the result of t-SNE are distinctive while other clusters have unclear boundaries. In particular, some points are clustered in wrong clusters and some categories are torn into separated parts. In the result of CMVU, points belonging to different categories are vaguely clustered together. But it is hard to define boundaries and the relationships between different categories. As for the result of LDA, points are concentrated in the centre.

  4. (d)

    Blood MNIST Dataset. Visualizations of blood MNIST are shown in Fig. 9. We set \(\nu = 1.1\) and order of dissimilarity \(o=0.5\) for SMVU-EDM, and \(\lambda = 0.3\) and \(o=1\) for WeightedIso. This setting lets the dimensionality reduction results of SMVU-EDM and WeightedIso have similar silhouette coefficients of the silhouette coefficients 0.9334 and 0.9175 respectively. The visualizations produced by SMVU-EDM, t-SNE, CMVU and WeightedIso all have clear clusters of platelet (dark brown). Clusters of neutrophil images are also clear in the visualizations of SMVU-EDM, t-SNE and WeightedIso. Furthermore, SMVU-EDM and WeightedIso map the neutrophil images into clusters significantly separated from others. As for the rest of the categories, the clusters are in ribbon shapes, both in the visualization of SMVU and t-SNE. The cluster of lymphocyte images overlaps the one of erythroblast, so are basophil and monocyte. On the other hand, the clusters in the visualization of WeightedIso are shaped like diamonds, but most are isolated.

4.5 Numerical comparison

This part is complementary to the visualization results by reporting detailed numerical results of the tested methods on all 9 data sets. In particular, we report the correlation coefficients for both inClass and beClass, the Silhouette coefficient, the local rank coefficients and run time (in seconds). We explain the details below.

Table 1 Numerical comparison

In Table 1, numerical results of all the 9 datasets are recorded. We note that mLRCorr denote the means of the in-class ("inClass") or between-class ("beClass") local rank correlations of all points respectively. Correspondingly, in Fig. 10, the box-plots present the distributions of the local rank correlations of the points.

Looking at the overall performance, SMVU-EDM performs better in producing well-separated result and preserving quality of DR than the other methods simultaneously. On the datasets with large size, such as "Blood MNIST", "Derma MNIST", and "OrganA MNIST", the results of SMVU-EDM have the highest Silhouette coefficient, in-class correlation and between-class coefficient. As for the time consumption, CMVU win the top one, but SMVU-EDM has close level to or is less than the ones of t-SNE and LDA.

As for WeightedIso, we tune the weighting parameter \(\lambda\) so that the Silhouette coefficients of the results are at the same level as that of SMVU-EDM. We underline the Silhouette coefficients of SMVU-EDM and WeightedIso at the same time representing the tuning result.

In some cases, WeightedIso produces result with quality close to that of SMVU-EDM, such as the results for SJG sequence and Blood MNIST. In some other cases, SMVU-EDM slightly outperforms the WeightedIso, such as Cifar, EMNIST and OrganC MNIST. However, the WeightedIso takes too long to terminate compared with other methods. Especially on "OrganA MNIST" dataset, WeightedIso cannot produce results under the 60 h time cap (indicated by “−” in the table).

Fig. 10
figure 10

Local rank correlation distribution comparison (Bullet points in the boxes in the box-plots represent the median, and the plus symbols "+" represent the outliers)

In Fig. 10, the distributions of local rank correlations are presented by box-plots. Compared with the other methods, SMVU-EDM can preserve the local distance rank structure at a better level. Most of the points in Cifar, Blood MNIST, Derma MNIST, OrganA MNIST, OrganC MNIST have high between-class local rank correlation (minimums of the boxes are greater than 0.7). The in-class local rank correlations of SMVU-EDM are higher than that of other methods in some cases such as the blood MNIST, OrganA MNIST and OrganC MNIST.

5 Conclusion

This paper proposed a new variant of the well-known MVU method for dimensionality reduction. It begins with an observation that the objective function in MVU is the squared stress function in the field of multidimensional scaling. We used an example to demonstrate that the squared stress function does not enjoy the usability property, which avoids crushing points of small distances to even smaller ones leading to the often observed crowding phenomenon. We used the stress function to replace the squared stress function and proved that the usability property holds for the new model. It also allows us to include the label information and results in a supervised MVU. The model can be efficiently solved by Euclidean Distance Matrix optimization. The algorithm derived from the majorization–minimization method is convergent within acceptable time. We demonstrated both the quality of the model solution and the algorithm against a few leading dimensionality reduction methods on various data sets. We conclude that SMVU-EDM is an effective method for supervised dimensionality reduction and data visualization.

We also like to point out that, being a spectral method, a computational bottleneck for SMVU-EDM is to calculate the r leading eigenvalue-eigenvector pairs of a large matrix. There exist some techniques that are potentially useful in reducing the computational complexity. One is to approximate the true eigenspace by eigenvalue-eigenvector decomposition of a smaller matrix. This could be done via Laplacian regularization as suggested in Weinberger et al. (2007). This technique would be applied to the Euclidean distance matrix instead of a kernel matrix. We will investigate this possibility in future research.