Abstract
Maximum Variance Unfolding (MVU) is among the first methods in nonlinear dimensionality reduction for data visualization and classification. It aims to preserve local data structure and in the meantime push the variance among data as big as possible. However, MVU in general remains a computationally challenging problem and this may explain why it is less popular than other leading methods such as Isomap and t-SNE. In this paper, based on a key observation that the structure-preserving term in MVU is actually the squared stress in Multi-Dimensional Scaling (MDS), we replace the term with the stress function from MDS, resulting in a model that is usable. The property of the usability guarantees the “crowding phenomenon” will not happen in the dimension reduced results. The new model also allows us to combine label information and hence we call it the supervised MVU (SMVU). We then develop a fast algorithm that is based on Euclidean distance matrix optimization. By making use of the majorization-mininmization technique, the algorithm at each iteration solves a number of one-dimensional optimization problems, each having a closed-form solution. This strategy significantly speeds up the computation. We demonstrate the advantage of SMVU on some standard data sets against a few leading algorithms including Isomap and t-SNE.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Maximum Variance Unfolding (MVU) and its variants, such as Colored MVU (CMVU), are among the first methods in nonlinear Dimensionality Reduction (DR) for data visualization and classification (see Weinberger and Saul (2006); Song et al. (2007); Cox and Ferry (1993); Kim and Lee (2014).) MVU provides a nice framework that preserves the local structure of data based on its pairwise distances and enjoys excellent mathematical theory. In terms of sample theory, it is convergent and consistent under reasonable assumptions, see Arias-Castro and Pelletier (2013). In terms of optimization theory, it has a close link to the duality in Semi-Definite Programming (SDP), see Sun et al. (2006). In terms of modelling, it can incorporate side information of data (e.g., labels) into its framework, such as done in CMVU by using the Hilbert-Schmidt independence criterion. It hence provides a great potential for quality DR. In terms of its applications, however, it is less popular than other leading DR methods such as ISOMAP by Tenenbaum et al. (2000), linear discriminant analysis (LDA) (Howland and Park, 2004) and t-SNE by Van der Maaten and Hinton (2008).
The primary purpose of this paper is to improve the numerical applicability of MVU/CMVU. We achieve this by following the modelling principle of MVU and proposing a new model whose main variable is the Euclean Distance Matrix (EDM). The resulting model can be efficiently solved by EDM optimization and appears to be very promising in competing against t-SNE in visualization. Compared with MVU variants that often rely on off-the-shelf SDP solvers, the EDM optimization is a stand-alone solver with a closed-form formula for each iteration and is easy to implement. Below we justify our proposal by conducting a critical analysis of MVU and CMVU.
1.1 MVU and CMVU
Suppose there are n items that are collected from a high-dimensional space. A certain type of dissimilarity (e.g., distance) may be computed for some pairs of objects. Let \(\delta _{ij}\) denote such a dissimilarity measurement for the item i and item j. We let \({\mathcal {N}}\) be the collection of such pairs. The purpose is to embed those items as n points \(\{\textbf{x}_i\}_{i=1}^n\) in a low-dimensional space \({\mathbb {R}}^r\) such that the Euclidean distance between \(\textbf{x}_i\) and \(\textbf{x}_j\) approximates \(\delta _{ij}\):
A key principle in MVU is that it favours the embedding with high variance, making the quantity
as big as possible. Here ”\(:=\)” means ”define”. Under the assumption that the embedding points are centred, i.e., \(\textbf{x}_1 + \cdots + \textbf{x}_n = 0\), the variance becomes \(\sigma _X^2 = \sum _{i=1}^n \Vert \textbf{x}_i \Vert ^2\). Therefore, MVU aims to achieve (i) preserving the local distances in (1), and (ii) maximizing the variance \(\sigma _X^2\):
where \(\nu >0\) is a balance parameter between the two aims. The loss function \(\sigma _{SS}(X)\) is known as the Squared-Stress in Multi-Dimensional Scaling (MDS), see Chapter 11 of Borg and Groenen (2005).
The benefit of employing \(\sigma _{SS}(X)\) is that the squared distance \(\Vert \textbf{x}_i - \textbf{x}_j \Vert ^2\) has a linear representation in terms of a kernel matrix:
where the Kernel matrix K is defined by \(K_{ij} = \langle \textbf{x}_i, \textbf{x}_j \rangle\) with \(\langle \cdot , \cdot \rangle\) being the standard dot product in \({\mathbb {R}}^r\). Consequently, (2) can be represented as a SDP (dropping the hidden rank constraint \(\text{ rank }(K) = r\)):
where \(\text{ Tr }(K)\) is the trace of K, \(\textbf{1}_n\) is the (column) vector of all ones, \(K \succeq 0\) means that K is positive semidefinite (i.e., being a kernel), and the constraint \(K \textbf{1}_n =0\) is the centralization condition.
Examining MVU from the perspective of maximizing the Hilbert-Schmidt independent criterion (HSIC), Song et al. (2007) casts it as maximizing the alignment between the similarity matrix K and the covariance matrix of completely independent items, whose covariance matrix is the identity matrix I, e.g., \(\text{ Tr }(K) = \text{ Tr }(KI)\). They further generalized this view to replacing I with any positive semidefinite kernel matrix L (Lin et al., 2010), leading to the Colored MVU [Song et al. (2007) Lemma 1]:
where the matrix \(J:= I - (1/n) \textbf{1}_n \textbf{1}_n^T\) is the centering matrix. It is known that kernels can represent some prior information of data, see Song et al. (2012). We are ready to make some critical comments that motivate our model.
-
(i)
On the squared-stress loss function. An important concept in the embedding theory is the usability proposed by De Leeuw De Leeuw (1988). A set of embedding points \(\{\textbf{x}_i\}\) is said to be usable if
$$\begin{aligned} \Vert \textbf{x}_i - \textbf{x}_j \Vert> 0 \quad \text{ if } \ \ \delta _{ij} > 0. \end{aligned}$$This would prevent neighbouring points from crushing to one point (the degenerate embedding). CMVU does not have the usability property due to the squared-stress \(\sigma _{SS}(X)\) being used. This means that CMVU may lead to embeddings with "crowding phenomenon" observed for other embedding methods, see Qi and Yuan (2014).
-
(ii)
On the neighbourhood size \(|{\mathcal {N}}|\). Both MVU and CMVU rely on the off-shelf SDP solvers, whose computational complexity depends not only on the number of embedding points n but also on the neighbourhood size of \({\mathcal {N}}\). Typical size of 1% of n is often recommended. Higher values may significantly slow down any standard SDP solvers when n is the order of thousands. Comparing with ISOMAP and t-SNE where all pairwise distances (i.e., \(|{\mathcal {N}}|= n(n-1)/2\)) are used, MVU and CMVU were not able to take advantage of many distances that may be available among the data points even though n is only in the order of a few hundreds.
-
(iii)
On the rank constraint \(\text{ rank }(K) \le r\). The rank constraint is the reason why SDP reformulations in (4) and (5) are relaxations. The graphical Laplacian regularization by Weinberger et al. (2007) alleviates this issue by approximating it through the bottom eigenvectors of the Laplacian matrix of the graph (Yan et al., 2006). This approximation is also used in CMVU. As we will see in our numerical experiments, the quality of embedding is severely compromised.
1.2 Supervised MVU via EDM optimization
To overcome the weakness of MVU/CMVU, we propose a new variant that follows the maximum variance principle and directly handles the computational issues discussed above. The key technical contributions are as follows.
-
(a)
Using the stress loss function. To tackle the usability issue of the squared-stress loss function, we replace it with the stress loss function:
$$\begin{aligned} \sigma _r (X):= \sum _{(i,j) \in {\mathcal {N}}} \Big ( \Vert \textbf{x}_i - \textbf{x}_j \Vert - \delta _{ij} \Big )^2, \end{aligned}$$where the Euclidean distance \(\Vert \textbf{x}_i - \textbf{x}_j \Vert\) is used rather than its squared form. It follows from (3) that the CMVU model (5) takes the following form:
$$\begin{aligned} \max _{X} \ \sum _{i,j=1}^n S_{ij} \Vert \textbf{x}_i - \textbf{x}_j \Vert ^2 - \nu \sigma _r(X), \qquad \text{ and } \qquad S:= JLJ/2. \end{aligned}$$(6)Its SDP formulation becomes
$$\begin{aligned} \max _{K} \ \text{ Tr }(JKJL) - \nu \sum _{(i,j) \in {\mathcal {N}}} \Big ( \sqrt{K_{ii} + K_{jj} - 2K_{ij}} - \delta _{ij} \Big )^2, \quad \text{ s.t. } \ \ K \succeq 0. \end{aligned}$$The square-root operation will make this SDP problem extremely difficult to solve. The major reason for using the stress loss function is that the resulting model (6) enjoys the usability property, proved in Proposition 1 below.
-
(b)
Reformulation as EDM optimization. To tackle the neighbourhood size issue, we use the Euclidean Distance Matrix (EDM) to reformulate (6). Let D be the \(n \times n\) EDM defined by \(D_{ij}:= \Vert \textbf{x}_i - \textbf{x}_j \Vert ^2\), \(i, j=1, \ldots , n\). By the theory of the classical MDS of [Cox and Cox (1991) Chp. 2], it holds
$$\begin{aligned} - \frac{1}{2} JDJ = X^T X = K, \ K_{ii} + K_{jj} - 2 K_{ij} = D_{ij} \ \text{ and } \ \text{ rank }(JDJ) = r. \end{aligned}$$(7)Therefore, the EDM formulation of (6) is
$$\begin{aligned} \max _{D} \ - \text{ Tr }(DS) - \nu \sum _{(i,j) \in {\mathcal {N}}} \Big ( \sqrt{D_{ij}} - \delta _{ij} \Big )^2, \quad \text{ s.t. } \ \ D \ \text{ is } \text{ EDM }. \end{aligned}$$(8)It is important to note that the proximal operator of the objective function in (8) can be computed in a closed-form formula, which significantly reduces the computational complexity due to the large neighbourhood size of \({\mathcal {N}}\).
-
(c)
Exact penalization of the rank constraint. With the rank constraint \(\text{ rank }(JDJ) \le r\), the EDM problem (8) is equivalent to our proposed MVU model (6). Penalty methods in dealing with rank constraints of various types of matrices have been recently studied, see Li and Qi (2011); Miao et al. (2016); Ding and Qi (2017); Zhou et al. (2018); Li et al. (2018); Sagan and Mitchell (2021); Le Thi et al. (2015). In this paper, we will propose an exact penalty for \(\text{ rank }(JDJ) \le r\). The exact penalty is a distance function, whose majorization can be cheaply constructed, leading to an efficient implementation of the overall EDM algorithm. It is noted that the squared distance was used as a penalty in Keys et al. (2019), but it is an inexact penalty.
-
(d)
Adding box constraints. The algorithmic framework developed allows us to handle simple constraints that enforce lower and upper distance bounds between any pairs of embedding points \(\textbf{x}_i\) and \(\textbf{x}_j\) without incurring any extra computational cost. This feature is very convenient in modelling. For example, we know in advance that some points have fixed distances or their distances should stay below certain threshold (e.g.,\(\Vert \textbf{x}_i - \textbf{x}_j \Vert \le b_{ij}\)). However, adding many such constraints to CMVU (5) may significantly slow down the SDP solver involved.
To summarize, the complete model we study in this paper is stated below in terms of the minimization (rather than maximization):
where \(S:= JLJ/2\), \(W_{ij} \ge 0\) are weights reflecting importance of the corresponding data (e.g., \(W_{ij} =1\) for \((i, j) \in {\mathcal {N}}\) and \(=0\) otherwise). We also include the simple lower bound \(A_{ij}\) and upper bound \(B_{ij}\) for each \(D_{ij}\). If no such information is available, we may simply set \(A_{ij}=0\) and \(B_{ij} = \infty\). After getting the optimal D from (9), we use the decomposition in (7) to get the embedding points \(\textbf{x}_i\). Since we allow the prior information L to be used, we refer to our model (9) as the Supervised MVU with EDM (SMVU).
The rest of the paper is to consolidate the claims made in (a)-(d) above and to develop a fast solution method for SMVU (9), which, at first glance, seems to be a formidable problem to solve. The paper is organized as follows. Section 2 includes the usability result of SMVU. The exact penalty approach is developed in Sect. 3. Extensive numerical tests are reported in Sect. 4. We conclude the paper in Sect. 5. Some of the proofs are contained in Appendix.
2 Usability of SMVU model
This section shows that the SMVU model enjoys the usability property, which the CMVU model may fail to enjoy. Ignoring the box constraint in (9), the SMVU model has an equivalent vector form:
with \(X:= [\textbf{x}_1, \ldots , \textbf{x}_n]\) and \(\textbf{x}_i \in {\mathbb {R}}^r\). We have the following result.
Proposition 1
Suppose the weight matrix W satisfies the following assumption:
Let \(X^*\) be a local minimum of (10). Then \(X^*\) is usable.
Proof
We first note that although \(\Phi (X)\) may not be differentiable at some points, it is always directionally differentiable. That is, \(\Phi '(X^{*};\, Z)\) exists for any \(Z \in {\mathbb {R}}^{r \times n}\). Since \(X^{*}\) is a local minimum, we must have
Define
We now calculate the directional derivatives \(F'(X^{*};\, Z)\) and \(G'(X^{*}, Z)\) by
and (see also [De Leeuw (1984) Eq. 7])
It then follows from the necessary condition (11) that for any \(Z \in {\mathbb {R}}^{r \times n}\)
which implies (because \(W_{ij} \delta _{ij} q_{ij} \ge 0\) for all i, j)
By our assumption on the weight matrix W and the definition of \(q_{ij}\), we have
This is just the usability property of \(X^*\). \(\square\)
Example 1
Suppose we have four items that belong to a same class and the pairwise dissimilarities are all given
This means that \(JLJ = 0\) when L is taken to be the label matrix \(L_{ij} = 1\) for all i, j. This further implies that CMVU model reduces to minimizing the squared stress criterion (S-Stress) \(\sigma _S^2(X)\). The purpose is to find four one-dimensional embedding points \(X=[ x_1, \; \ldots , \; x_4]\) with \(x_i \in {\mathbb {R}}\). Let \(Z^*\) be given by \(x^*_1 = -1\), \(x^*_2 =1\) and \(x^*_3=x^*_4 =0\). It can be verified that the gradient and the Hessian of \(\sigma _S^2(X)\) at \(X^*\) are
Hence, the embedding points in \(X^*\) is a local minimum of \(\sigma _S^2(X)\), but it has \(|x_3^* - x_4^* |= 0\) even \(\delta _{34} = 0.1 > 0\). Consequently, the usability property does not hold for S-Stress criterion, which implies that CMVU may not have the usability property.
3 EDM optimization: exact penalty approach
In this part, we develop an efficient computational algorithm for the SMVU model (9). With the exact penalty approach, we will be able to design a majorization–minimization scheme. Each step of this scheme will have a closed-form solution, making it possible to tackle large data sets. Our first question is which constraints are to be penalized.
3.1 Understanding the constraints
There are three types of constraints in (9). We analyze them one by one to see where the computational difficulty lies.
-
(a)
EDM constraint An important characterization of EDM D of size \(n \times n\) is by Schoenberg (1938):
$$\begin{aligned} D \ \text{ is } \text{ EDM } \qquad \Longleftrightarrow \qquad (-D) \in \mathcal {K}^n_+, \quad \text{ diag }(D) = 0, \end{aligned}$$where \(\mathcal {K}^n_+\) is the conditionally positive semidefinite cone:
$$\begin{aligned} \mathcal {K}^n_+= & {} \left\{ A \ \Big |\ A \ \text{ is } \ n \times n \ \ \text{ symmetric } \text{ matrix }, \ \ \textbf{x}^T A \textbf{x}\ge 0,\ \ \forall \textbf{x}, \ \sum \nolimits _{i = 1}^{n} x_i = 0 \right\} \\= & {} \left\{ A \ \Big |\ JAJ \succeq 0 \right\} . \end{aligned}$$In other words, the EDM constraint is actually a spectral conic constraint intersecting with the linear space \(\text{ diag }(D) =0\). We note that \(\mathcal {K}^n_+\) is a spectral cone because it involves eigenvalues. It is equivalent to the role K played in CMVU (5): \(K \succeq 0\). Hence, the EDM constraint is a convex constraint.
-
(b)
Rank constraint It is the most difficult part to deal with. There are a few ways to manage this constraint. For example, Qi and Yuan (2014) represent this constraint by the equation:
$$\begin{aligned} p(D):= \text{ Tr }(-JDJ) - \sum _{i=1}^r \lambda _i(-JDJ) = 0, \end{aligned}$$where \(\lambda _i\) is the ith latest eigenvalue of \((-JDJ)\). Since \((-JDJ) \succeq 0\), \(p(D) \ge 0\) for all EDM matrix D. The function is added as a penalty to the objective. The drawback of this approach is that we still need to keep the EDM constraint, whose spectral conic constraint is hard to handle when n is large. Moreover, the penalty is inexact. A new idea introduced in Zhou et al. (2018) is to collect all difficult constraints together through regrouping.
-
(c)
Regrouping the constraints We collect the rank constraint and the spectral conic constraint together to define a new object called the conditional positive semidefinite cone with rank-r cut
$$\begin{aligned} \mathcal {K}^n_+(r):= \left\{ D \in \mathcal {S}^n \ \Big |\ D \in \mathcal {K}^n_+, \ \text{ rank }(JDJ) \le r \right\} . \end{aligned}$$The linear constraint \(\text{ diag }(D) =0\) is submerged into the box constraint by letting \(A_{ii} = B_{ii} = 0\) for \(i=1, \ldots , n\). Through this regrouping, the constraints in SMVU (9) are represented as \(-D \in \mathcal {K}^n_+(r)\) and \(A \le D \le B.\) The problem itself becomes
$$\begin{aligned} \min \ f(D), \quad \text{ s.t. } \ \ -D \in \mathcal {K}^n_+(r) \qquad \text{ and } \qquad A \le D \le B. \end{aligned}$$(12) -
(d)
Dealing with \(\mathcal {K}^n_+(r)\). For a given matrix D, we compute the Euclidean distance from \((-D)\) to \(\mathcal {K}^n_+(r)\):
$$\begin{aligned} g(D):= \text{ dist } ( -D, \ \mathcal {K}^n_+(r)) = \Vert - D - \Pi _{\mathcal {K}^n_+(r)} (-D) \Vert = \Vert D + \Pi _{\mathcal {K}^n_+(r)} (-D) \Vert , \end{aligned}$$(13)where \(\Pi _{\mathcal {K}^n_+(r)} (-D)\) is a projection of \((-D)\) to \(\mathcal {K}^n_+(r)\):
$$\begin{aligned} \Pi _{\mathcal {K}^n_+(r)} (-D) \in \arg \min _{Z} \left\{ \Vert -D - Z\Vert ^2 \ \Big |\ \ Z \in \mathcal {K}^n_+(r) \right\} . \end{aligned}$$We postpone the computation of the projection to a later section. Through the distance function, we know that
$$\begin{aligned} -D \in \mathcal {K}^n_+(r) \qquad \text{ if } \text{ and } \text{ only } \text{ if } \qquad g(D) = 0. \end{aligned}$$
3.2 Exact penalty approach
Based on the development above, the SMVU model (9) can be reformulated as
We assume that there exists a lower bound \(c>0\) such that \(A_{ij} \ge c\) for all \(i \not = j\). This means that we do not want any pair of \(\textbf{x}_i\) and \(\textbf{x}_j\) to be embedded in the same point. We do not need to know the value c, and this is purely a theoretical assumption so that there exists \(\kappa >0\) such that the objective function f(D) satisfies the Lipschitz condition:
where \(\kappa >0\) is called the Lipschitz constant of \(f(\cdot )\). Under this Lipschitz condition, Prop. 2.4.3 of Clarke (1990) implies that any local/global minimum of (14) is also local/global minimum of the following problem:
where \(\rho \ge \kappa\) is a penalty parameter. In other words, the problem (15) is an exact penalty for the original problem due to the distance function g(D) being used. Previous, the squared distance function \(g^2(D)\) was used in other situations, and it resulted in an inexact penalty problem, where the penalty parameter \(\rho\) is driven to infinity in order to approximate well to its original problem at the local minima, see Zhou et al. (2018) for some examples. The remaining part of this section is about solving the penalty problem (15).
It is useful to note that the original objective function f(D) has a separable structure:
It motivates us to approximate g(D) also by a separable function that would lead to \(n(n-1)/2\) one-dimensional optimization problems. Moreover, each of those one-dimensional problems has a closed-form solution. The approximation is through majorization.
-
(a)
Majorization of g(D). The idea behind majorization technique is simple. Suppose we have a function \(\phi (\textbf{x}): {\mathbb {R}}^p \mapsto {\mathbb {R}}\), which is hard to minimize. Let \(\textbf{x}^k\) be the current guess of a minimum of \(\phi (\textbf{x})\). We would like to construct a simpler majorization function at \(\textbf{x}^k\), \(\phi ^{(m)}(\cdot ; \textbf{x}^k): {\mathbb {R}}^p \mapsto {\mathbb {R}}\) satisfying the majorization property:
$$\begin{aligned} \phi ^{(m)} (\textbf{x}; \textbf{x}^k) \ge \phi (\textbf{x}), \ \ \forall \ \textbf{x}\in {\mathbb {R}}^p \qquad \text{ and } \qquad \phi ^{(m)}(\textbf{x}^k; \textbf{x}^k) = \phi (\textbf{x}^k). \end{aligned}$$(17)We then minimize the majorization function:
$$\begin{aligned} \textbf{x}^{k+1} \in \arg \min _{\textbf{x}} \; \phi ^{(m)} (\textbf{x}; \textbf{x}^k). \end{aligned}$$(18)Therefore, we have
$$\begin{aligned} \phi (\textbf{x}^{k+1}) {\mathop {\le }\limits ^{(17)}} \phi ^{(m)} (\textbf{x}^{k+1}; \textbf{x}^k) {\mathop {\le }\limits ^{(18)}} \phi ^{(m)} (\textbf{x}^{k}; \textbf{x}^k) {\mathop {=}\limits ^{(19)}} \phi (\textbf{x}^k). \end{aligned}$$This way, a decreasing sequence \(\{ \phi (\textbf{x}^k)\}\) is generated, and it must converge if bounded from below. This procedure of majorization-minimization has become very popular in machine learning algorithms, see Sun et al. (2016) for an extensive literature review on the topic. We now define a majorization function for g(D) at a given point \(Z \in \mathcal {S}^n\):
$$\begin{aligned} g_1^{(m)}(D; Z):= \left\{ \begin{array}{ll} \Vert D - Z \Vert _1 &{}\quad \text{ if } \ g(Z) = 0 \\ \frac{g^2(D)}{2g(Z)} + \frac{1}{2} g(Z) &{}\quad \text{ if } \ g(Z) \not = 0. \end{array} \right. \end{aligned}$$It is easy to verify that this function satisfies the majorization property in (17). For the case \(g(Z) =0\), we have \((-Z) \in \mathcal {K}^n_+(r)\) and
$$\begin{aligned} g(D) = \text{ dist }(-D, \mathcal {K}^n_+(r) ) \le \Vert - D - (-Z)\Vert \le \Vert D - Z\Vert _1 = g_1^{(m)}(D; Z). \end{aligned}$$For the case \(g(Z) \not =0\), we simply apply the inequality \(a^2+b^2 \ge 2ab\) for any \(a, b \in {\mathbb {R}}\) to get \(g(D) \le g_1^{(m)}(D; Z)\). Finally, it is obvious that \(g_1^{(m)}(Z; Z) = g(Z)\) for both cases. At this point, we cannot say that \(g_1^{(m)}(D; Z)\) is a simple majorization because \(g^2(D)\) is still involved. Fortunately, it can be further majorized by a simple convex quadratic function.
-
(b)
Majorization of \(g^2(D)\). We define a new function
$$\begin{aligned} h(D): = \Vert D\Vert ^2 - g^2(D). \end{aligned}$$(19)The following has been proved by Qi and Yuan (2014) Prop. 3.4:
$$\begin{aligned} h(D) \ \text{ is } \text{ convex } \quad \text{ and } \quad -2 \Pi _{\mathcal {K}^n_+(r)} (-D) \in \partial h(D). \end{aligned}$$(20)By the subgradient inequality for convex functions, we have
$$\begin{aligned} h(D) \ge h(Z) + \langle -2 \Pi _{\mathcal {K}^n_+(r)} (-Z), \ D-Z \rangle , \quad \forall D, Z \in \mathcal {S}^n. \end{aligned}$$Consequently, we have
$$\begin{aligned} g^2(D) &= \Vert D\Vert ^2 - h(D) \\ &\le \underbrace{\Vert D\Vert ^2 - h(Z) + \langle 2 \Pi _{\mathcal {K}^n_+(r)} (-Z), \ D-Z \rangle }_{\text{ majorization } \text{ of } g^2(D) \,\hbox{at}\, Z}, \end{aligned}$$which, together with \(g_1^{(m)}(D;\, Z)\) above, leads to our final majorization function for g(D):
$$g^{(m)}(D; Z) = \left\{ \begin{array}{ll} \Vert D - Z \Vert _1 &\quad \text{ if } \ g(Z) = 0,\\ \frac{1}{2g(Z)} \Big ( \Vert D\Vert ^2 - h(Z) + \langle 2 \Pi _{\mathcal {K}^n_+(r)} (-Z), \ D-Z \rangle \Big ) + \frac{1}{2} g(Z) &\quad \text{ if } \ g(Z) \not = 0. \end{array} \right.$$(21)We now define a majorization for the penalty problem (15):
$$\begin{aligned} f^{(m)}_\rho (D; Z) = f(D) + \rho g^{(m)}(D; Z) \end{aligned}$$and the majorization–minimization procedure gives rise to the following update on the current iterate \(D^k\):
$$\begin{aligned} D^{k+1} = \arg \min _D \ f^{(m)}_\rho (D; D^k), \qquad \text{ s.t. } \quad A \le D \le B. \end{aligned}$$(22)It is not hard to see that the majorization problem has a separable structure and is equivalent to \(n(n-1)/2\) one-dimensional problem. We derive a closed-form solution for those problems below.
3.3 Solution of the majorization problem (22)
In this part, we derive the closed-form formula for the solution of (22). Depending on how the majorization is defined, we have two cases to consider.
(a) Case 1: \(g(D^k) = 0\). For this case, we have
where \(\mathcal {R}_k\) is independent of D. Hence, computing \(D^{k+1}\) is equivalent to finding a solution to the following \(n(n-1)/2\) one-dimensional problem for \(i <j =2, \ldots , n\)
Each of those problems is equivalent to finding an optimal solution of the following one-dimensional problem:
where \(\alpha >0\), \(\beta >0\), \(\gamma >0\) and \(\eta \ge 0\), \(a\ge 0\), \(b \ge 0\) are given constants with \(\eta \in [a, b]\). This is a convex problem, and it is easy to find its optimal solution.
Proposition 2
Define the following two quantities:
Then the optimal solution of (24) is given by
Consequently, the optimal \(D^{k+1}\) in (23) is given by
(b) Case 2: \(g(D^k) \not =0\). In this case, we have
where \(\mathcal {R}_k\) is a term independent of D, and
Hence, computing \(D^{k+1}\) is equivalent to finding a solution to the following \(n(n-1)/2\) one-dimensional problem for \(i <j =2, \ldots , n\)
Each of those problems is equivalent to finding an optimal solution of the following one-dimensional problem:
where \(\omega \in {\mathbb {R}}\), \(\lambda \ge 0\), \(a\ge 0\), \(b \ge 0\) are given constants. It is a convex problem and its optimal solution has a closed-form solution.
Proposition 3
[Zhou et al. (2020) Prop. 2] Define the following quantities:
and
Then the optimal solution of (28) is given by
Consequently, the update in (27) is given by
(c) Computing the projection \(\Pi _{\mathcal {K}^n_+(r)} (-D^k)\). In both formula (25) and (29), we need to compute this projection in order to calculate \(g_k\) and \(\widehat{D}^k\). It turns out that it is most time-consuming part in our algorithm. Based on our numerical experience, on average, about 80% of the CPU time of our algorithm will spend on computing this projection, which needs eigenvalue decomposition. We briefly describe it below as the details can be found in [Zhou et al. (2018) Section 2].
Suppose \(Z \in \mathcal {S}^n\) has the following SVD:
where \(\lambda _1 \ge \lambda _2 \ge \ldots \ge \lambda _n\) are the eigenvalues of A in non-increasing order, and \(\textbf{p}_i\), \(i=1, \ldots , n\) are the corresponding orthonormal eigenvectors. We define a PCA-style matrix truncated at r:
Then we have
We note that we do not need to compute a full eigenvalue decomposition of JZJ. It is sufficient to compute the r leading eigenvalues and the corresponding eigenvectors of JZJ. When r is small (e.g., \(r=2\) or 3 for visualization), the computation is much faster than a full eigenvalue-eigenvector decomposition.
3.4 SMVU algorithm and its convergence
Now we are ready to describe our SMVU method in Algorithm 1.
Convergence analysis of Algorithm 1 under the cases of \(g(D^k) = 0\) and \(g(D^k) > 0\) will be stated as follows:
(a) Case 1: \(g(D^k) = 0\).
With \(g^{(m)}(D; Z) = \Vert D^{k+1} - D^{k} \Vert _1\) and (22), we have
and
Under the Lipschitz condition
and \(\kappa < \rho\), we have \(D^{k+1} = D^{k}\).
(b) Case 2: \(g(D^k) > 0\). The optimality condition of (22) implies that
With the convexity of f(D),
and the following equations and inequality mentioned in Qi and Yuan (2014) and Zhou et al. (2018)
a chain of inequalities can be deduced as:
It can be seen from (39), \(\{f_{\rho }(D^{k})\}\) is non-increasing, and it also has lower bound according to its definition. Therefore, when \(k \rightarrow \infty\), \(- \rho \frac{\Vert D^{k+1} - D^{k} \Vert ^2}{g(D^{k})} \rightarrow 0\). Since elements of \(D^{k}\) are bounded, \(g(D^{k}) < +\infty\), and \(\Vert D^{k+1} - D^{k} \Vert ^2 \rightarrow 0\).
As for the computation complexity of SMVU algorithm, the majorization function \(g^{(m)}\) in (21) is dominated by \(\Pi _{\mathcal {K}^n_+(r)} (-D)\), therefore the overall complexity of each step is about \(\mathcal {O}(rn^{2} )\) (Zhou et al., 2018).
4 Numerical experiments
The main purpose of this part is to demonstrate the effectiveness of SMVU-EDM model in data visualisation and classification. A desktop of 16 GB memory and Inter(R) Core(TM) i7-10750 2.59 GHz CPU and the IRIDIS5 High-Performance Computing Facility with associated support services at the University of Southampton are used, and the codes are developed in MATLAB (R2020a). This section includes datasets and benchmark methods, parameter setting and tuning scheme, and numerical comparison.
4.1 Data sets and benchmark methods
In this section, the effectiveness of SMVU-EDM was demonstrated on diverse types of data, including RGB images, grayscale images, binary images and categorical sequential data. The results were compared with four benchmark methods: coloured MVU, linear discriminant analysis (LDA) and WeightedIso (Vlachos et al., 2002) (supervised methods), and the t-SNE (Van der Maaten and Hinton, 2008) (unsupervised method).
-
(a)
CIFAR-10 Dataset (Krizhevsky and Hinton, 2009). The CIFAR-10 dataset is a set of \(32\times 32\) RGB colour images in 10 classes, including aeroplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. We will compute 2-dimensional visualizations of one of the batches in this dataset, in which there are a total of 10,000 images, and each class contains the same number of samples.
-
(b)
Splice-junction Gene Sequences Dataset (Asuncion and Newman, 2007). Splice-junction gene (SJG) sequences dataset consists of 3190 DNA sequences that have been classified into three types: the rest sequences that are the superfluous parts after splicing (EI), the sequences that are spliced out from the original sequence (IE), and the sequences that are the action of splicing has not been implemented on.
-
(c)
EMNIST Dataset (Cohen et al., 2017). The EMNIST dataset consists of \(28\times 28\) handwritten character digits including both uppercase and lowercase letters. This research uses the "Letters" split, where the uppercase letters and corresponding lowercase letters are merged in the same classes. Thus 26 classes are in this split of the EMNIST dataset. In the following part, we will provide 2-dimensional visualizations of 20,800 samples with each class containing the same number of samples.
-
(d)
Medical MNIST Datasets (Yang et al., 2021). These datasets contain pre-processed medical images of size \(28\times 28\). The images in Yang et al. (2021) are standardized. Pre-procession of the image data lets us assume that Euclidean distances between images are comparable. Data used include RGB dermatoscope image dataset ("DermaMNIST"), grayscale abdominal CT image data ("OrganAMNIST", "OrganCMNIST"), and RGB blood Cell microscope ("BloodMNIST").
-
(e)
Anticancer Peptides Sequences Dataset (Grisoni et al., 2019). Anticancer peptides sequences (ACPs) dataset stores one-letter amino-acid sequences that represent their anticancer activity on breast and lung cancer cell lines. The sub-dataset targeting breast cancer contains 949 amino-acid sequence observations, and the one for lung cancer consists of 901 observations. These sequences have various lengths and are classified into five classes: "active", "moderately active", "experimental inactive", and "virtual inactive".
Besides the comparison with CMVU, the t-SNE (Van der Maaten and Hinton, 2008), linear discriminant analysis (LDA) and weighted ISOMAP (WeightedIso) (Vlachos et al., 2002) are included in the following visualization and numerical comparison. For eigen-decomposition methods such as SMVU-EDM, CMVU, WeightedIso and LDA, we adopt the scheme mentioned in Song et al. (2007) that the output dimension r is set as 10 and visualize the first two or three dimensions. The EMNIST dataset and CIFAR-10 dataset include a huge amount of data, which is challenging for most existing dimensionality reduction methods. From the medical MNIST collection, we only provide the 2D visualizations for the "BloodMNIST" dataset. The numerical result of the rest of the dataset will be listed in Table 1. The criteria for selecting benchmark methods are their capability of handling the big data sets and yielding quality visualizations. The codes of t-SNE, LDA, WeightedIso are provided by the Matlab toolbox for dimensionality reduction Van der Maaten et al. (2007). The initial setting in t-SNE is the default one in MATLAB, such as setting exaggeration as 4. The code of CMVU is provided by Song et al. (2007).
4.2 Quality measurement of dimensionality reduction
There are two desirable features for a successful dimensionality reduction method. One is that it yields a visualization with pronounced cluster structure, and the other is that the inherent geometric characteristics is preserved to a certain degree. Correlation (Gracia et al., 2014) is commonly employed for dimensionality reduction quality measurement. In our case, it is computed between the vectorized version of the original distance matrix \(\Delta\) and the obtained D by a model. Since the dimensionality reduction result is produced via supervision, in this paper between-class structures and within-class structures will be measured separately. For pairs of points (i, j), we denote the set of pairs that belong to the same class \(i\sim j\) as \(\mathbb {P}_{ \text {inClass}} = \{(i,j)|i\sim j , 1\le i,j \le n \}\), and that belong to different class as \(\mathbb {P}_{ \text {beClass}} = \{(i,j)|i \not \sim j , 1\le i,j \le n \}\). Accordingly, the set of original distances set \(\{ \delta _{ij}|1\le i,j\le n\}\) and \(\{ D_{ij}|1\le i,j\le n\}\) can be divided as in-class distance sets \(\delta _{\text {inClass}} = \{\delta _{ij} |(i,j)\in \mathbb {P}_{ \text {inClass}} \}\), \(D_{\text {inClass}} = \{D_{ij} |(i,j)\in \mathbb {P}_{ \text {inClass}} \}\), and between-class distance sets \(\delta _{\text {beClass}} = \{\delta _{ij} |(i,j)\in \mathbb {P}_{ \text {beClass}} \}\), \(D_{\text {beClass}} = \{D_{ij} |(i,j)\in \mathbb {P}_{ \text {beClass}} \}\). We take the Pearson correlation coefficient between \(\delta _{\text {inClass}}\) and \(D_{\text {inClass}}\), \(Corr_\text {inClass} = Corr(\delta _{\text {inClass}}, D_{\text {inClass}})\), and the one between \(\delta _{\text {beClass}}\) and \(D_{\text {beClass}}\), \(Corr_\text {beClass} = Corr(\delta _{\text {beClass}}, D_{\text {beClass}})\), as quality measurement on how well the dimensionality reduction methods preserve geometric characteristics (Espadoto et al., 2019; Paul and Chalup, 2017; Kalousis et al., 2004). Those correlations may be regarded as a global measurement how well the distances were kept in order.
Locally, we apply the local rank correlation coefficient (LRCorr) to measure how well a dimensionality reduction method reconstructs the local distance rank structure. Let \(\mathbf {\delta }^{i} := \{\delta _{i1}, \delta _{i2}, \ldots , \delta _{in}\}\) be the distance sequence derived from the ith point, and \(D^{i} := \{D_{i1}, D_{i2}, \ldots , D_{in}\}\) be the distance sequence derived from the ith point in the dimensionality reduction result. Furthermore, let \(\delta ^{i}_{\text {inClass}} := \{\delta _{ij}|\delta _{ij} \in \delta ^{i}, i\sim j\}\) and \(D^{i}_{\text {inClass}} := \{D_{ij}|D_{ij} \in D^{i}, i\sim j\}\), the local in-class distance rank correlation coefficient of ith point can be written as \(LRCorr^{i}_{\text {inClass}} = LRCorr(\delta ^{i}_{\text {inClass}},D^{i}_{\text {inClass}})\). Similarly, for between-class situation, with \(\delta ^{i}_{\text {beClass}} := \{\delta _{ij}|\delta _{ij} \in \delta ^{i}, i\not \sim j\}\) and \(D^{i}_{\text {beClass}} := \{D_{ij}|D_{ij} \in D^{i}, i\not \sim j\}\), we have the local between-class distance rank correlation coefficient of ith point as \(LRCorr^{i}_{\text {beClass}} = LRCorr(\delta ^{i}_{\text {beClass}},D^{i}_{\text {beClass}})\). We then take the average of all the \(LRCorr^{i}_{\text {inClass}}\) and that of all the \(LRCorr^{i}_{\text {beClass}}\) as respective measurement of the quality of local in-class and between-class distance rank structure preservation.
Finally, the Silhouette coefficient is used to measure if the supervised dimensionality reduction result has a well-separated structure. The silhouette coefficient computed at one point is based on the mean intra-cluster distance and the distance to the on average nearest cluster. Denoting s(i) as the Silhouette coefficient of the ith point, the average of s(i) is the overall Silhouette coefficient we adopt to measure the dimensionality reduction result.
4.3 On implementation of SMVU-EDM
Dimensionality reduction quality comparison for different distance types. This comparison is based on the CIFAR dataset. (o is order of the power of Euclidean distance and p is the order of Minkowski distance. In this experiment, we set \(\nu = 1.1\) in SMVU-EDM, \(\nu = 1\) in CMVU, \(Perp = 30\) in t-SNE and \(\lambda = 0.1\) in WeightedIso) (Color figure online)
Dimensionality reduction quality comparison for key tuning parameter of different methods (The dissimilarity type applied in this experiment is the Euclidean distance. For clear comparison, we project the values of the tuning parameters of different methods into uniform projected value (PV) which ranges from 0.1 to 10: for SMVU-EDM, \(PV = \nu - 1\); for CMVU, \(PV = \nu\); for t-SNE, \(PV = Perp/10\); for WeightedIso, \(PV = \lambda \times 10\)) (Color figure online)
This part describes how we have implemented SMVU-EDM including its stopping criteria, choice of the balance parameter \(\nu\) and the penalty parameter \(\rho\).
-
(a)
Stopping criterion Referring to Zhou et al. (2018) for SMVU-EDM algorithm, we adopt the double convergence criteria that terminate the algorithm if
$$\begin{aligned} \eta _{f}(D^{k-1},D^{k}) = \frac{f_{\rho }(D^{k-1})-f_{\rho }(D^{k})}{1+f_{\rho }(D^{k-1})} \le \epsilon _1 \end{aligned}$$(40)and
$$\begin{aligned} \eta _{\kappa }(D^{k}) = \frac{2g(D^{k})}{\Vert JD^{k}J\Vert ^{2}} \le \epsilon _2, \end{aligned}$$(41)for given \(\epsilon _{1} >0\) and \(\epsilon _{2}>0\). In our implementation, we used \(\epsilon _{1}=\sqrt{n}\times 10^{-3}\), \(\epsilon _{2}=10^{-3}\), where n is the size of the training set.
-
(b)
Choice of dissimilarity measurement. SMVU-EDM, CMVU, t-SNE and WeightedIso all take pairwise dissimilarities between observations as input. There is always a question as to what are the most appropriate dissimilarities to be used. There is no universal answer as it depends on properties of the concerned algorithms and the characteristics of data to be analyzed (Wang et al., 2005; Peng et al., 2019; Ting et al., 2019). To explore how different types of dissimilarity affect the output results of different methods, we conduct tests on a randomly picked subset of Cifar dataset of size 1000 and anticancer peptides sequences (breast cancer) data set, representing respectively the numerical data sets and categorical sequential data sets. For numerical data sets like Cifar dataset, different kinds of distance including Minkowski distance (\(\delta _{ij}^{\text {Minkowski}-p} = \Vert \textbf{x}_{i} - \textbf{x}_{j}\Vert _{p}\) ) and power of Euclidean distance (\(\delta _{ij}^{o} = \Vert \textbf{x}_{i} - \textbf{x}_{j}\Vert _{2}^{o}\) ) are applicable. The types of distances used in this experiment are Minkowski distance (Yu et al., 2008) of order 3 (p = 3), city-block distance (p = 1), Chebyshev distance (p = \(\infty\)) and powers of Euclidean distance of order o = 0.15, 0.25, 0.5, 1, 2. For categorical sequential data sets, matlab function seqpdist is used with default setting (Jukes-Cantor distance Jukes and Cantor (1969)), and the power of sequential distance of different orders (o = 0.15, 0.25, 0.5, 1, 2) will be compared.
We randomly sample the Cifar data set, run each algorithm ten times, and average the results. In this experiment, we set \(\nu = 1.1\) in SMVU-EDM, \(\nu = 1\) in CMVU, \(Perp = 30\) in t-SNE and \(\lambda = 0.1\) in WeightedIso. In-class correlation, between-class correlation and Silhouette coefficient will be taken as dimensionality reduction quality measurement. As found in Fig. 1, with Euclidean distance (o=1), squared Euclidean distance (o=2), city-block distance (p=1) and Minkowski distance of order 3 (p = 3), SMVU-EDM has relatively higher in-class correlation and between-class correlation. With smaller order of the Euclidean distance, the in-class and between-class correlations decrease. Therefore, when the Silhouette coefficient increases greater than 0.9, where the order o is equal to or less than 0.25, the in-class correlation and between-class correlation are still acceptable. As for t-SNE, when the order o decreases, in-class correlation, between-class correlation, and Silhouette coefficient keep steady. Minkowski with higher order leads to slight increase of Silhouette coefficient for t-SNE at a price of decreasing correlation. Order of the dissimilarity also affects the performance of WeightedIso: when the Silhouette coefficient is greater than 0.9, the decrease of in-class correlation is significant, and so is the between-class correlation.
-
(c)
Choice of balancing parameter \(\nu\). Performance of the key tuning parameters of different methods are compared in Fig. 2 , including balancing parameter \(\nu\) of SMVU-DM, perplexity Perp of t-SNE, balancing parameter \(\nu\) of CMVU and weighting parameter \(\lambda\) of WeightedIso. Different choices of balancing parameter \(\nu\) for SMVU-EDM emphasize the HSIC part at different levels and contribute to different dimensionality reduction results. As shown in Fig. 2, when \(\nu\) increases, between-class correlation stays steady, meanwhile in-class correlation has around \(10\%\) increase. Overall, with relative small \(\nu\) in the range between 1.1 and 2, the SMVU-EDM model would produce a well-separated result with relatively superior between-class structure preservation and acceptable structure preservation. With respect to the WeightedIso, when the weighting parameter \(\lambda\) is between 0.01 and 0.2, the result is well-separated, but both the in-class correlation and between-class correlation are significantly reduced. It can also be noticed that when the perplexity is greater than 10, performance of t-SNE becomes steady. As for CMVU, Fig. 2 reveals that tuning of balancing parameters will not lead to a significantly separated result.
With the above analysis, we set up the experiment scheme as follows. Realizing that tuning the balancing parameter \(\nu\) of SMVU-EDM, the order of dissimilarity and the weighting parameter \(\lambda\) of WeightedIso helps us obtain visualizations with a high silhouette coefficient, we will tune these parameters to compare embeddings from SMVU-EDM and WeightedIso with a similar silhouette score.
Since the result will be steady when perplexity is large enough, we set perplexity Perp of t-SNE at 30. As for CMVU, the balancing parameter will be set as 1, and the number of nearest neighbours in CMVU is \(1\%\) of the total size of the dataset.
-
(d)
Choice of penalty parameter \(\rho\). According to Prop. 2.4.3 of Clarke (1990), when penalty parameter \(\rho\) is greater than the Lipschitz rank of f(D), the algorithm will converge to a point on the EDM cone. With the derivative of \(\frac{\partial f(D)}{\partial D_{ij}}|_{D_{ij}^{k} }= \nu \times W_{ij}\times \left( 1-\frac{\delta _{ij}}{\sqrt{ D_{ij}^{k}}} \right) + S_{ij}\), the scheme for choosing \(\rho\) is as follows:
-
1.
For the first point \(D^{0}\), we computed its projected point \(D_{P}^{0}\) on the EDM cone. The initial \(\rho\) is set to be \(\Vert \nabla f(D_{P}^{0})\Vert\)
-
2.
Since theoretically \(\Vert \nabla f(D^{k}) \Vert\) will decrease during convergence, in every step, the \(\Vert \nabla f(D^{k}) \Vert\) will be checked whether it is greater than the current value of \(\rho\); if not, the current value of \(\rho\) is kept. If \(\Vert \partial f(D^{k}) \Vert\) is greater than the current \(\rho\), \(\rho\) will be updated as \(\Vert \partial f(D^{k}) \Vert\).
The elements of weight matrix W in this experiment will be set to be equal (i.e., \(W_{ij}=1\) for all i, j).
-
1.
4.4 Visualization comparison
-
(a)
Cifar Data. The visualization results on the Cifar data, as shown in Figs. 3 and 4 reveal the strong performance of SMVU-EDM compared with the rest of the methods. As shown in the previous section and Figs. 1 and 2, the power of Euclidean distance with order 0.25 and setting \(\nu = 1.1\) leads SMVU-EDM produce quality result on the sampled Cifar dataset. We adopt this setting on the whole Cifar dataset. In Fig. 3, trucks (brownish-red) and automobiles (Han blue) are clustered together, and this group is separated from other clusters. It reveals the fact that images of trucks and automobiles are relatively more similar to each other than to other categories. Images of the aeroplanes (dark purple) and ships (persimmon), which represent machines, also have clear boundaries against the cluster of images of animals, including horses (deep saffron), frogs ( sunglow ), dogs (lime), deer (bright green), and cats (blue–green). In the cluster of animal images, the group of dogs overlaps with the ones of birds and deer, but the cluster of the cat is well separated from the others. Even there are overlaps, the clear boundary of the clusters can reveal the relations between different categories. Meanwhile, the distributions of different categories can be identified from the results of CMVU and t-SNE. For example, trucks (brownish–red) and automobiles (Han blue) are clustered together. However, the boundaries of categories are unclear. As for WeightedIso, referring to the experiment mentioned in the previous section, we set the weighting parameter \(\lambda = 0.2\) and use the Euclidean distances. From the visualization of WeightedIso in Fig. 4 , it can be noticed that the cluster of trucks and automobiles is clear and they are separated from the rest of the objects. But the rest categories are mixed.
In Fig. 5, we extract points of ”ship”, ”horse”, ”deer”, and “cat” in the three visualizations. For SMVU-EDM, the clusters of horse and deer have, in a way, overlapped, but the boundaries around the clusters of cat and the ship are clear. Meanwhile, cats have the most significant distance to the group of horses. So, overall, the clusters relations such as overlaps, closeness with a clear boundary and great distances between clusters reveal the (dis-)similarity between different groups. In contrast, in all other four visualizations, the points of these four classes scatter in the whole area.
-
(b)
Splice-junction Gene Sequences Dataset. Since pairwise Jukes-Cantor distances are taken as input, only the distance-based methods such as SMVU-EDM, CMVU, WeightedIso and t-SNE are applied to the experiment. As presented in Fig. 6, t-SNE failed to extract the inherent distribution of splice-junction gene sequences. SMVU-EDM, WeightedIso and CMVU can cluster the three classes. In this experiment, we set \(\nu = 1.3\) in SMVU-EDM and \(\lambda = 0.6\) in WeightedIso. This setting lets SMVU-EDM and WeightedIso produce results with similar Silhouette coefficients. The three categories have clear boundaries in the result of SMVU-EDM and WeightedIso, but in the one of CMVU, the clusters of "EI" and "Neither" are merged. It is interesting to note that an outlier appears in the result of WeightedIso. It is a point in the "EI" category (coloured as green), and we prominently symbol it out as a bold dot and the corresponding points in all other visualizations. Compared with WeightedIso, it can be seen that, in the result of SMVU-EDM, the outlier appears at a much more reasonable distance from the main cluster of "EI". This suggests that the SMVU appears more robust to outliers than WeightedIso.
-
(c)
EMNIST Data. In Figs. 7 and 8, we show the visualizations of EMNIST Data. We apply SMVU-EDM on Euclidean distance to the power of 0.15 with \(\nu = 1.1\). It can be noted that the boundaries of clusters are clear in the result of SMVU-EDM. On the macroscopic aspect, the diverse closeness between the classes gives us information on their relative similarity. In the lower-left part, “o” and “c” are close to each other and they are relatively away from the upper right part that consists of the letters of “v”, “x”, “y”, “h”, “f” and “t”, which can be seen as the revelation that significant difference exists between these two groups. In the case of WeightedIso, we set \(\lambda = 0.3\). As shown in Fig. 8, the boundaries of the clusters are also clear. However, the outliers appear around the clusters frequently. Consequently, the boundary of the cluster is expanded because of the outliers. Some of the clusters in the result of t-SNE are distinctive while other clusters have unclear boundaries. In particular, some points are clustered in wrong clusters and some categories are torn into separated parts. In the result of CMVU, points belonging to different categories are vaguely clustered together. But it is hard to define boundaries and the relationships between different categories. As for the result of LDA, points are concentrated in the centre.
-
(d)
Blood MNIST Dataset. Visualizations of blood MNIST are shown in Fig. 9. We set \(\nu = 1.1\) and order of dissimilarity \(o=0.5\) for SMVU-EDM, and \(\lambda = 0.3\) and \(o=1\) for WeightedIso. This setting lets the dimensionality reduction results of SMVU-EDM and WeightedIso have similar silhouette coefficients of the silhouette coefficients 0.9334 and 0.9175 respectively. The visualizations produced by SMVU-EDM, t-SNE, CMVU and WeightedIso all have clear clusters of platelet (dark brown). Clusters of neutrophil images are also clear in the visualizations of SMVU-EDM, t-SNE and WeightedIso. Furthermore, SMVU-EDM and WeightedIso map the neutrophil images into clusters significantly separated from others. As for the rest of the categories, the clusters are in ribbon shapes, both in the visualization of SMVU and t-SNE. The cluster of lymphocyte images overlaps the one of erythroblast, so are basophil and monocyte. On the other hand, the clusters in the visualization of WeightedIso are shaped like diamonds, but most are isolated.
4.5 Numerical comparison
This part is complementary to the visualization results by reporting detailed numerical results of the tested methods on all 9 data sets. In particular, we report the correlation coefficients for both inClass and beClass, the Silhouette coefficient, the local rank coefficients and run time (in seconds). We explain the details below.
In Table 1, numerical results of all the 9 datasets are recorded. We note that mLRCorr denote the means of the in-class ("inClass") or between-class ("beClass") local rank correlations of all points respectively. Correspondingly, in Fig. 10, the box-plots present the distributions of the local rank correlations of the points.
Looking at the overall performance, SMVU-EDM performs better in producing well-separated result and preserving quality of DR than the other methods simultaneously. On the datasets with large size, such as "Blood MNIST", "Derma MNIST", and "OrganA MNIST", the results of SMVU-EDM have the highest Silhouette coefficient, in-class correlation and between-class coefficient. As for the time consumption, CMVU win the top one, but SMVU-EDM has close level to or is less than the ones of t-SNE and LDA.
As for WeightedIso, we tune the weighting parameter \(\lambda\) so that the Silhouette coefficients of the results are at the same level as that of SMVU-EDM. We underline the Silhouette coefficients of SMVU-EDM and WeightedIso at the same time representing the tuning result.
In some cases, WeightedIso produces result with quality close to that of SMVU-EDM, such as the results for SJG sequence and Blood MNIST. In some other cases, SMVU-EDM slightly outperforms the WeightedIso, such as Cifar, EMNIST and OrganC MNIST. However, the WeightedIso takes too long to terminate compared with other methods. Especially on "OrganA MNIST" dataset, WeightedIso cannot produce results under the 60 h time cap (indicated by “−” in the table).
In Fig. 10, the distributions of local rank correlations are presented by box-plots. Compared with the other methods, SMVU-EDM can preserve the local distance rank structure at a better level. Most of the points in Cifar, Blood MNIST, Derma MNIST, OrganA MNIST, OrganC MNIST have high between-class local rank correlation (minimums of the boxes are greater than 0.7). The in-class local rank correlations of SMVU-EDM are higher than that of other methods in some cases such as the blood MNIST, OrganA MNIST and OrganC MNIST.
5 Conclusion
This paper proposed a new variant of the well-known MVU method for dimensionality reduction. It begins with an observation that the objective function in MVU is the squared stress function in the field of multidimensional scaling. We used an example to demonstrate that the squared stress function does not enjoy the usability property, which avoids crushing points of small distances to even smaller ones leading to the often observed crowding phenomenon. We used the stress function to replace the squared stress function and proved that the usability property holds for the new model. It also allows us to include the label information and results in a supervised MVU. The model can be efficiently solved by Euclidean Distance Matrix optimization. The algorithm derived from the majorization–minimization method is convergent within acceptable time. We demonstrated both the quality of the model solution and the algorithm against a few leading dimensionality reduction methods on various data sets. We conclude that SMVU-EDM is an effective method for supervised dimensionality reduction and data visualization.
We also like to point out that, being a spectral method, a computational bottleneck for SMVU-EDM is to calculate the r leading eigenvalue-eigenvector pairs of a large matrix. There exist some techniques that are potentially useful in reducing the computational complexity. One is to approximate the true eigenspace by eigenvalue-eigenvector decomposition of a smaller matrix. This could be done via Laplacian regularization as suggested in Weinberger et al. (2007). This technique would be applied to the Euclidean distance matrix instead of a kernel matrix. We will investigate this possibility in future research.
Availability of data and materials
The data that support the findings of this study are openly available in: (a) CIFAR-10 Dataset https://www.cs.toronto.edu/~kriz/cifar.html (Krizhevsky and Hinton, 2009; (b) Splice-junction Gene Sequences Dataset https://archive-beta.ics.uci.edu/dataset/69/molecular+biolo (Asuncion and Newman, 2007; (c) EMNIST Dataset https://www.nist.gov/itl/products-and-services/emnist-dataset (Cohen et al., 2017; (d) Medical MNIST Datasets https://medmnist.com/ (Yang et al., 2021; (e) Anticancer Peptides Sequences Dataset https://archive.ics.uci.edu/ml/datasets/Anticancer+peptides (Grisoni et al., 2019).
Code availability
The code of this study are available in https://github.com/dy1n16/SMVU-EDM.git.
References
Arias-Castro, E., & Pelletier, B. (2013). On the convergence of maximum variance unfolding. Journal of Machine Learning Research 14(7)
Asuncion, A., & Newman, D. (2007). UCI machine learning repository. CA, USA: Irvine.
Borg, I., & Groenen, P. J. (2005). Modern multidimensional scaling: Theory and applications. Berlin: Springer.
Clarke, F.H. (1990). Optimization and Nonsmooth Analysis. In SIAM pp. 51–52.
Cohen, G., Afshar, S., Tapson, J., & Van Schaik, A. (2017). Emnist: Extending mnist to handwritten letters. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926. IEEE
Cox, T. F., & Cox, M. A. (1991). Multidimensional scaling on a sphere. Communications in Statistics-Theory and Methods, 20(9), 2943–2953.
Cox, T. F., & Ferry, G. (1993). Discriminant analysis using non-metric multidimensional scaling. Pattern Recognition, 26(1), 145–153.
De Leeuw, J. (1984). Differentiability of Kruskal’s stress at a local minimum. Psychometrika, 49(1), 111–113.
De Leeuw, J. (1988). Convergence of the majorization method for multidimensional scaling. Journal of Classification, 5(2), 163–180.
Ding, C., & Qi, H.-D. (2017). Convex optimization learning of faithful Euclidean distance representations in nonlinear dimensionality reduction. Mathematical Programming, 164(1), 341–381.
Espadoto, M., Martins, R. M., Kerren, A., Hirata, N. S., & Telea, A. C. (2019). Toward a quantitative survey of dimension reduction techniques. IEEE Transactions on Visualization and Computer Graphics, 27(3), 2153–2173.
Gracia, A., González, S., Robles, V., & Menasalvas, E. (2014). A methodology to compare dimensionality reduction algorithms in terms of loss of quality. Information Sciences, 270, 1–27.
Grisoni, F., Neuhaus, C. S., Hishinuma, M., Gabernet, G., Hiss, J. A., Kotera, M., & Schneider, G. (2019). De novo design of anticancer peptides by ensemble artificial neural networks. Journal of Molecular Modeling, 25(5), 1–10.
Howland, P., & Park, H. (2004). Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8), 995–1006.
Jukes, T. H., & Cantor, C. R. (1969). Evolution of protein molecules. Mammalian Protein Metabolism, 3, 21–132.
Kalousis, A., Gama, J., & Hilario, M. (2004). On data and algorithms: Understanding inductive performance. Machine Learning, 54(3), 275–312.
Keys, K. L., Zhou, H., & Lange, K. (2019). Proximal distance algorithms: Theory and practice. The Journal of Machine Learning Research, 20(1), 2384–2421.
Kim, K., & Lee, J. (2014). Sentiment visualization and classification via semi-supervised nonlinear dimensionality reduction. Pattern Recognition, 47(2), 758–768.
Krizhevsky, A., & Hinton, G., et al. (2009). Learning multiple layers of features from tiny images
Le Thi, H. A., Le, H. M., & Pham Dinh, T. (2015). Feature selection in machine learning: an exact penalty approach using a difference of convex function algorithm. Machine Learning, 101(1), 163–186.
Li, Z., Nie, F., Chang, X., Nie, L., Zhang, H., & Yang, Y. (2018). Rank-constrained spectral clustering with flexible embedding. IEEE Transactions on Neural Networks and Learning Systems, 29(12), 6073–6082.
Lin, Y.-Y., Liu, T.-L., & Fuh, C.-S. (2010). Multiple kernel learning for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(6), 1147–1160.
Li, Q., & Qi, H.-D. (2011). A sequential semismooth newton method for the nearest low-rank correlation matrix problem. SIAM Journal on Optimization, 21(4), 1641–1666.
Miao, W., Pan, S., & Sun, D. (2016). A rank-corrected procedure for matrix completion with fixed basis coefficients. Mathematical Programming, 159(1), 289–338.
Paul, R., & Chalup, S. K. (2017). A study on validating non-linear dimensionality reduction using persistent homology. Pattern Recognition Letters, 100, 160–166.
Peng, Q., Rao, N., & Zhao, R. (2019). Covariance-based dissimilarity measures applied to clustering wide-sense stationary ergodic processes. Machine Learning, 108(12), 2159–2195.
Qi, H.-D., & Yuan, X. (2014). Computing the nearest Euclidean distance matrix with low embedding dimensions. Mathematical Programming, 147(1), 351–389.
Sagan, A., & Mitchell, J. E. (2021). Low-rank factorization for rank minimization with nonconvex regularizers. Computational Optimization and Applications, 79(2), 273–300.
Schoenberg, I. J. (1938). Metric spaces and positive definite functions. Transactions of the American Mathematical Society, 44(3), 522–536.
Song, L., Smola, A. J., Borgwardt, K. M., & Gretton, A. (2007). Colored maximum variance unfolding. In Nips, pp. 1385–1392. Citeseer
Song, L., Smola, A., Gretton, A., Bedo, J., & Borgwardt, K. (2012). Feature selection via dependence maximization. Journal of Machine Learning Research 13(5)
Sun, Y., Babu, P., & Palomar, D. P. (2016). Majorization-minimization algorithms in signal processing, communications, and machine learning. IEEE Transactions on Signal Processing, 65(3), 794–816.
Sun, J., Boyd, S., Xiao, L., & Diaconis, P. (2006). The fastest mixing Markov process on a graph and a connection to a maximum variance unfolding problem. SIAM Review, 48(4), 681–699.
Tenenbaum, J. B., De Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.
Ting, K. M., Zhu, Y., Carman, M., Zhu, Y., Washio, T., & Zhou, Z.-H. (2019). Lowest probability mass neighbor algorithms: Relaxing the metric constraint in distance-based neighborhood algorithms. Machine Learning, 108(2), 331–376.
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579–2605.
Van der Maaten, L., Postma, E. O., & van den Herik, H. J. (2007). Matlab toolbox for dimensionality reduction. Maastricht: Maastricht University, MICC.
Vlachos, M., Domeniconi, C., Gunopulos, D., Kollios, G., & Koudas, N. (2002). Non-linear dimensionality reduction techniques for classification and visualization. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 645–651
Wang, L., Zhang, Y., & Feng, J. (2005). On the Euclidean distance of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1334–1339.
Weinberger, K. Q., Sha, F., Zhu, Q., & Saul, L. K. (2007). Graph Laplacian regularization for large-scale semidefinite programming. In Advances in Neural Information Processing Systems, pp. 1489–1496
Weinberger, K. Q., & Saul, L. K. (2006). An introduction to nonlinear dimensionality reduction by maximum variance unfolding. AAI, 6, 1683–1686.
Yang, J., Shi, R., Wei, D., Liu, Z., Zhao, L., Ke, B., Pfister, H., & Ni, B. (2021). Medmnist v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification. arXiv preprint arXiv:2110.14795
Yan, S., Xu, D., Zhang, B., Zhang, H.-J., Yang, Q., & Lin, S. (2006). Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 40–51.
Yu, J., Amores, J., Sebe, N., Radeva, P., & Tian, Q. (2008). Distance learning for similarity estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3), 451–462.
Zhou, S., Xiu, N., & Qi, H.-D. (2018). A fast matrix majorization–projection method for penalized stress minimization with box constraints. IEEE Transactions on Signal Processing, 66(16), 4331–4346.
Zhou, S., Xiu, N., & Qi, H.-D. (2020). Robust Euclidean embedding via EDM optimization. Mathematical Programming Computation, 12(3), 337–387.
Funding
This work was partially supported by the Departmental Project P0044200.
Author information
Authors and Affiliations
Contributions
All authors (DY, H-DQ) contributed equally to each part of this work.
Corresponding author
Ethics declarations
Ethical approval
Informed consent for publication of this paper was obtained from all authors.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editor: Tijl De Bie.
Publisher's Note
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Proof of Proposition 2
Appendix A: Proof of Proposition 2
We note that \(\psi (x)\) is a strictly convex function. Hence it has a unique minimum. We consider two cases.
Case 1: \(x \in [a, \eta ]\). For this case,
Its minimum depends on the sign of \((\alpha - \gamma )\). If \(\alpha - \gamma \le 0\), the minimum of \(\psi (x)\) is \(x^* = \eta\) because \(\psi (x)\) is strictly decreasing over \([a, \eta ]\). If \(\alpha - \gamma > 0\), the optimality condition is \(\psi '(\bar{x}^*) =0\), which gives
The minimum of \(\psi (\cdot )\) over \([a, \eta ]\) for the case \(\alpha - \gamma >0\) is given by
Therefore, \(T_1(\alpha , \beta , \gamma , \eta , a, b)\) defined in Proposition 2 is the minimum of \(\psi (x)\) over \([a, \eta ]\).
Case 2: \(x \in [\eta , b]\). For this case,
Solving the equation \(\psi '(\bar{x}^*) =0\) yields
The minimum of \(\psi (\cdot )\) over \([\eta , b]\) is given by
This is \(T_2(\alpha , \beta , \gamma , \eta , a, b)\) defined in Proposition 2. Combining the two cases yields the optimal solution \(\mathcal {T}(\alpha , \beta , \gamma , \eta , a, b)\).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yang, D., Qi, HD. Supervised maximum variance unfolding. Mach Learn (2024). https://doi.org/10.1007/s10994-024-06553-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10994-024-06553-8