A unified probabilistic framework for robust manifold learning and embedding
 981 Downloads
 1 Citations
Abstract
This paper focuses on learning a smooth skeleton structure from noisy data—an emerging topic in the fields of computer vision and computational biology. Many dimensionality reduction methods have been proposed, but none are specially designed for this purpose. To achieve this goal, we propose a unified probabilistic framework that directly models the posterior distribution of data points in an embedding space so as to suppress data noise and reveal the smooth skeleton structure. Within the proposed framework, a sparse positive similarity matrix is obtained by solving a boxconstrained convex optimization problem, in which the sparsity of the matrix represents the learned neighborhood graph and the positive weights stand for the new similarity. Embedded data points are then obtained by applying the maximum a posteriori estimation to the posterior distribution expressed by the learned similarity matrix. The embedding process naturally provides a probabilistic interpretation of Laplacian eigenmap and maximum variance unfolding. Extensive experiments on various datasets demonstrate that our proposed method obtains the embedded points that accurately uncover inherent smooth skeleton structures in terms of data visualization, and the method yields superior clustering performance compared to various baselines.
Keywords
Dimensionality reduction Probabilistic model Manifold embedding Structure learning1 Introduction
Real world data, such as images and molecular profiles, is usually highdimensional. In order to adequately handle such highdimensional data, dimensionality reduction is usually used to transform the data from a high dimensional space into a reduced representation, which ideally corresponds to the intrinsic dimensionality of the data (Fukunaga 2013). It also facilitates data visualization and the classification of highdimensional data by mitigating the curse of dimensionality and other undesirable properties (Jimenez and Landgrebe 1998).
During the last two decades, a plethora of dimensionality reduction methods have been proposed (van der Maaten et al. 2009; Burges 2009). Most of them aim to preserve certain information within the data. Principal component analysis (PCA; Jolliffe 1986), a classic method for this purpose, learns a subspace linearly spanned over some orthonormal bases by minimizing the reconstruction error (Burges 2009). However, the complex structure of data may be misrepresented by the linear manifold constructed using PCA. To overcome this issue, Kernel PCA (KPCA; Schölkopf et al. 1999) first maps the original space to a reproducing kernel Hilbert space (RKHS) using a kernel function, and then performs PCA in the RKHS. Hence, KPCA is a nonlinear generalization of traditional PCA, but its performance critically depends on the choice of the kernel function.
Several works have been proposed for model selection of the kernel function. Gaussian process latent variable model (GPLVM; Lawrence 2005) achieves a nonlinear generalization of probabilistic PCA (PPCA; Tipping and Bishop 1999) and learns a kernel function defined on a set of variables in a lowdimensional latent space by maximizing the loglikelihood of observed data in terms of covariance matrix in Gaussian process (Rasmussen 2006). However, the objective function of GPLVM is highly nonconvex, so it is easily trapped at the local optima. Maximum variance unfolding (MVU; Weinberger et al. 2004) is proposed to bypass the challenging issue of choosing a kernel function by learning a nonparametric kernel matrix directly, which retains pairwise distances encoded in a neighborhood graph constructed from the input data. Several variants of MVU have also been proposed, such as relaxation by the inequality constraints (MVUineq; Weinberger and Saul 2006) or \(\ell _2\)norm over slacks of distance differences (MVU2; Weinberger and Saul 2006), and landmark MVU (\(\ell \)MVU; Weinberger et al. 2005).
Another family of dimensionality reduction methods is sparse spectral manifold learning, which aims to find a manifold that is close to the intrinsic structure of data. Similar to MVU, a neighborhood graph has to be provided in advance, but only the pairwise similarities over the edges of the given graph are considered to approximate the manifold. Given a neighborhood graph and a kernel function, Laplacian eigenmap (LE; Belkin and Niyogi 2001) is proposed to find a mapping where the distances between a data point and its neighbors are minimized. Authors provide the theoretical analysis of the Laplacian matrix using graph spectral theory. However, LE also faces the difficulty of selecting kernel functions. Locally linear embedding (LLE; Saul and Roweis 2003) is proposed to preserve local geometry by learning a sparse similarity matrix based on the assumption that local patches over knearest neighbors are nearly linear and overlap with another one to form a manifold.
Recently, datasets with smooth skeleton structures have been emerging from the fields of computer vision (Weinberger and Saul 2006; Song et al. 2007) and computational biology (Curtis et al. 2012; Sun et al. 2014). For example, human cancer is a dynamic disease that develops over an extended time period. Once initiated from a normal cell, the advance to malignancy can, to some extent, be considered to have a complex branching structure (Greaves and Maley 2012). It becomes very important to unveil this dynamic process using massive molecular profile data, which is generally in highdimensional space corresponding to tens of thousands of genes (Curtis et al. 2012). More importantly, the data is generally noisy because not all genes provide useful expressions related to the cancer. Thus, the dynamic progression structure is generally assumed to reside in a lowdimensional space (Sun et al. 2014). Moreover, the visualization of cancer progression structure is also valuable for downstream analysis, e.g., providing critical insights into the process of disease, and informing the development of diagnostics, prognostics and targeted therapeutics. Datasets with this type of structure have also been widely studied in principal curve learning, but only for curve structures. (Hastie and Stuetzle 1989; Kégl et al. 2000). However, a curve is not an appropriate representation of complex structures such as loops, bifurcations, and multiple disconnected components.
Although the above methods work well under certain conditions, they lack a unified probabilistic framework to robustly learn a smooth skeleton structure from noisy data. Probabilistic models such as PPCA and GPLVM can deal with noisy data, but they find it difficult to model the neighborhood manifold. On the other hand, methods based on neighborhood manifold, such as MVUs, LE and LLE, either find it hard to learn the manifold structure of a smooth skeleton, or cannot be interpreted as probabilistic model for model selection and missing data, simultaneously. Thus, a specially designed dimensionality reduction method for uncovering skeleton structures in complex forms from noisy highdimensional datasets is desperately needed.

We propose a novel probabilistic dimensionality reduction framework that can simultaneously suppress data noise and uncover a smooth skeleton structure, in which a prior distribution is used to model the noise and the expected inequality constraints over pairwise distances of embedding points impose smooth skeleton structures.

Under the proposed framework, the posterior distribution has an analytic expression with a sparse positive similarity matrix for representing a weighted neighborhood graph, so that both the skeleton structure and the similarity function are automatically adapted from the data.

The embedding points are represented by the MAP estimation of the learned posterior distribution. Solving the MAP estimation is consistent with the optimization problem of LE, and gives a natural explanation for the use of KPCA for embedding.

The resulting optimization problem for learning a sparse positive similarity matrix is convex with a box constraint, and the objective function is continuous and differentiable. Thus, there exists a global optimal solution, and the problem can be efficiently solved by fast and largescale offtheshelf optimization tools.

Extensive experiments have been conducted from the perspectives of data visualization and clustering. Results on five synthetic datasets and nine real world datasets demonstrate that the proposed method can correctly recover smooth skeleton structures from noisy data for data visualization and also achieve better clustering performance than various existing methods.
2 Related work
Our proposed method is a probabilistic model that can automatically learn a smooth skeleton structure from a given observed dataset with noise. Although many dimensionality reduction methods have been proposed in the literature, most of them are not suitable for our purposes. In this section, we will mainly discuss in details two methods, which are closely related to our proposed method—MVU (Weinberger et al. 2004) and maximum entropy unfolding (MEU; Lawrence 2012).
3 Maximum posterior manifold embedding
3.1 Motivation
We are interested in learning a smooth skeleton from noisy data. Informally, a smooth skeleton is a special skeleton structure of manifolds that is embedded in a lowdimensional space of the observed data. Noisy data can be the data with noises added into the original feature space, or inserted as irrelevant dimensions. These special structures have been widely studied in principal curve learning for curve structures (Hastie and Stuetzle 1989; Kégl et al. 2000). However, curves are not appropriate representations of complex structures such as loops, bifurcations, and multiple disconnected components. In this paper, we further generalize skeleton structures to these complex forms. In nonlinear dimensionality reduction methods, the structure of embedded data might be different from the structure of original data (in the case of 2D or 3D datasets which can be visualized naturally), but the smooth skeleton structure should be retained, and the noise is removed or alleviated.
Figure 1 shows two real world datasets. A collection of teapot images is viewed from different angles (Weinberger et al. 2005). Each image contains \(76 \times 101\) RGB pixels, so the pixel space has a dimension of 23,028, but the intrinsic structure has only one degree of freedom: the angle of rotation, as shown in Fig. 1a. Human cancer is a dynamic disease that develops over an extended time period. Once initiated from a normal cell, the advance to malignancy can to some extent be considered a Darwinian process—a multistep evolutionary process—that responds to selective pressure (Greaves and Maley 2012). The disease progresses through a series of clonal expansions that result in tumor persistence and growth, and ultimately the ability to invade surrounding tissue and metastasize to distant organs. As shown in Fig. 1b, the evolution trajectories inherent in cancer progression are complex and branching (Greaves and Maley 2012). It has become critically important to uncover the progression path from massive molecular profile data (Sun et al. 2014).
In order to learn a set of embedded points that form a smooth skeleton from observed data, the manifold assumption over the embedded points is quite appropriate. In general, a sparse neighborhood graph over observed data points, e.g., a knearest neighbor graph, is manually crafted, and is used to approximate the manifold of the data. However, it is challenging to build a good neighborhood graph with a prefixed k if the local density of each observed data point is very different. Moreover, a smooth manifold cannot tolerate noise in observed data if we preserve deterministic distances as strictly as we do in MVU since the distances themselves are not reliable. To overcome the above issues, we should resort to a probabilistic model that represents the distance in a flexible fashion so as to learn a smooth skeleton structure to best approximate a true smooth skeleton.
3.2 Probabilistic modeling assumptions
We define \({\mathbbm {y}}=\{\mathbf {y}_i \}_{i=1}^N\) as a set of observed data points in \({\mathbb {R}}^D\). The distance of any two points \(\mathbf {y}_i\) and \(\mathbf {y}_j\) can be computed in either the Euclidean space as \(\phi _{i,j} = \mathbf {y}_i  \mathbf {y}_j_2^2\) or the RKHS \(\mathcal {H}\) as \(\phi _{i,j} =  \varphi (\mathbf {y}_i)  \varphi (\mathbf {y}_j) _{\mathcal {H}}^2 = K_{i,i} + K_{j,j}  2 K_{i,j}\), where \(K(\mathbf {y}_i, \mathbf {y}_j) = \langle \varphi (\mathbf {y}_i), \varphi (\mathbf {y}_j) \rangle _{\mathcal {H}}\) is a kernel function. Let the corresponding embedded data \({\mathbbm {x}}\) of \({\mathbbm {y}}\) be a set of points \(\{\mathbf {x}_i\}_{i=1}^N\) in an embedding space \({\mathbb {R}}^d\) with \(\mathbf {x}_i = [x_i^1, \ldots , x_i^d]^T\).
The expected distance of these points with respect to the posterior distribution is much more flexible than the deterministic distances used in MVU since both \({\mathbbm {x}}\) and \(p({\mathbb {F}}  {\mathbbm {y}})\) are variables to optimize. Moreover, this flexibility is further strengthened by the inequality constraints when these distances might not be strictly preserved. These constraints not only tolerate noises, but also result in a sparse neighborhood graph (see Sect. 3.3). More importantly, the smooth skeleton we seek can be found from the high flexibility of embedded points since two embedded points can move closely if the distance between two observed data is small according to (1). By contrast, deterministic constraints used in MVU are strictly preserved, so MVU cannot achieve the same effect. As a result, the notrestrictivelypreserved constraints allow the embedded points to move flexibly to form a smooth manifold and provide the feasibility of the dimensionality reduction methods for automatically adapting the neighborhood manifold from data. We provide a detailed discussion from a duality perspective in Sect. 3.3.
3.3 Problem formulation via probabilistic modeling
Based on the assumptions discussed in Sect. 3.2, we propose to directly estimate a posterior distribution of embedded data points \({\mathbbm {x}}\) in a lowdimensional space \({\mathbb {R}}^d\) by minimizing the KL divergence between the posterior distribution and a prior distribution with a set of constraints in terms of expected pairwise distances. This modeling technique has been widely used to learn a posterior distribution from data for problems such as classification (Jebara 2001), structured output prediction (Zhu and Xing 2009) and multiple kernel learning (Mao et al. 2015).
Although matrix \(\mathbf {W}\) is introduced as a dual variable for solving problem (2), it has several interesting properties. (i) According to (6), the expected distance is strictly preserved if \(w_{i,j}>0\) with a tolerance \(\xi _{i,j}\). (ii) According to (10), we know that \(w_{i,j}\) is small if \(\phi _{i,j}\) is large. Thus, the dual variable \(w_{i,j}\) can be interpreted as a similarity of the embedded points \(\mathbf {x}_i\) and \(\mathbf {x}_j\). (iii) Optimizing \(w_{i,j}\) leads to a sparse positive similarity matrix, where the optimal solution \(\mathbf {W}\) is sparser if C is larger. This property can be explained by the KKT conditions at the optimum of (2). Specifically, according to condition (5), if \(w_{i,j} = 0\), we have \(\beta _{i,j} = C\). And then, condition (7) implies that \(\xi _{i,j} = 0\). According to (6), we have \(\sum _{k=1}^d \int (x_i^k  x_j^k )^2 p(\mathbf {f}_k  {{\mathbbm {y}}} ) \text {d} \mathbf {f}_k < \phi _{i,j} \), which means that the distance of embedded points \(\mathbf {x}_i\) and \(\mathbf {x}_j\) must be smaller than \(\phi _{i,j}\). That is, \(w_{i,j}=0\) leads to a shrinkage of distance. C is the regularization parameter of \(\sum _{i,j} \xi _{i,j}\) in (2). By minimizing \(\ell _1\) norm over \(\xi _{i,j}\), the \(\xi _{i,j}\) will result in many zeros if C increases. As a result, the larger the C is, the sparser the \(\mathbf {W}\) is more likely to be.
According to the above properties, we are now ready to explain why the proposed model can achieve a skeleton structure from noisy data through the duality perspective of problem (2). First, our model is a probabilistic model represented by a posterior density function, which incorporates the prior distribution with precision \(\lambda \) to capture the noise of latent embedding points. Second, the expected pairwise distances in terms of the posterior density function are more robust than deterministic pairwise distances, so they tolerate noisy data points for preserving distance information. Third, the penalty term and inequality distance constraints imposes the sparsity of \(\mathbf {W}\) so that many distances are not preserved, instead they shrink. The degree of shrinkage depends on the distance of original data points. If the distance between two original points is large, the shrinkage is also large. By combining these three factors, a large pairwise distance caused by noises tends to shrink so as to approximate the inherent distance. If the data has an inherent skeleton structure, our model can correctly uncover the skeleton structure using shrinkage effect on noisy distances.
3.4 Sparse positive similarity matrix learning
3.5 Embedding via MAP estimation
Our embedding process is similar to either LE or KPCA, but the proposed embedding framework provides a novel way to automatically learn a sparse positive similarity matrix \(\mathbf {W}\) from a set of pairwise distances, and the sparse positive similarity matrix is purposely designed for learning the embedded points. This also provides a probabilistic interpretation why MVU takes KPCA as the embedding method after learning a kernel matrix. The pseudocode of our proposed maximum posterior manifold embedding (MPME) is given in Algorithm 1. Solving problem (13) takes approximately \(O(N^{2.37})\) for computing logdet and an inversion of matrix \(\mathbf {Q}\) at each iteration of the LBFGSB solver. Solving (17) takes \(O(N^3)\) of KPCA. Thus, the time complexity of Algorithm 1 takes the order of \(O(N^3)\), which is the same as that of most spectral based methods, but is much faster than the SDP used in MVU.
3.6 Discussion
Our proposed MPME method takes several key components from existing dimensionality reduction methods such as preserving distances from MVU and the probabilistic modeling of embedded points from PPCA and MEU. However, our method is very different. In the following, several key differences are discussed in detail by comparing our method to MVU and MEU.
3.6.1 Comparison with MVU
First, the variables to be optimized are different. Our model learns a sparse similarity matrix, while MVU learns a dense kernel with a positive semidefinite constraint. Second, the objective functions are different. Our model maximizes the posterior distribution of latent variables in a Bayesian way, while MVU maximizes the variance of latent data points in a deterministic way. Third, the constraints are different. Our model defines expected inequality constraints (1) and (2) with error tolerance, while MVU imposes strictly equality constraints. As discussed in Zhu and Xing (2009), the model with the expected inequality constraints can robustly tolerate inaccurate pairwise distances. It is worth noting that \(\ell \)MVU takes inequality constraints, but they are treated as a relaxation from the optimization perspective. And, MVUineq and MVU also take a different relaxation of the equality constraints, but the incorporation of prior distribution and expected pairwise distance in our proposed method is extremely helpful for learning a smooth skeleton structure from noisy data.
3.6.2 Comparison with MEU
One of the key differences is that our framework directly models the posterior distribution \(p(\mathbf {X}  {{\mathbbm {y}}} )\) of latent data, while MEU models the density \(p(\mathbf {Y})\) of observed data. As a result, MEU has to assume that the data features are i.i.d.. However, this assumption is hardly satisfied if feature correlation exists. By contrast, our model assumes that the reduced features in the latent space are i.i.d, which is more reasonable than that used in MEU since the latent space is generally assumed to be formed by a set of orthogonal bases, such as PCA and KPCA. This difference also leads to a clear interpretation of methods such as LE and KPCA based on the learned matrix \(\mathbf {L}\) as discussed in Sect. 3.5, while MEU does not give the reason why applying KPCA on \(\mathbf {K}\) can work well for dimensionality reduction. Another key difference is the expected inequality constraints used in our model, which can be explicitly represented, so the posterior distribution is well defined instead of heuristically constructed in MEU. The third difference is that MEU takes pseudolikelihood approximation to learn \(\mathbf {L}\) so as to avoid the positive semidefinite constraint on \(\mathbf {K}\). However, pseudolikelihood is motivated by computational reasons, and it sacrifices the accuracy of the estimated kernel (Besag 1975). By contrast, our model does not have this issue since we directly obtain the posterior distribution. Moreover, the optimization problem is a boxconstrained convex optimization problem, which can be solved globally and efficiently using existing optimization tools.
4 Experiments
Datasets used in the experiments
Task  Dataset  N  D  d  # of classes 

Visualization  Circle  100  2  2  – 
DistortedSShape  100  2  2  –  
2moons  400  2  2  –  
Helix  600  3  2  –  
Twinpeaks  600  3  2  –  
Cancer  2133  367  3  –  
Teapot  400  23028  2  –  
Clustering  Letter  5000  16  12  26 
Pendigits  3498  16  9  10  
Satimage  4435  36  6  6  
USPS  2007  256  32  10  
Isolet  3119  617  165  2  
Coil20  1440  1024  84  20  
YALEB  2414  1024  116  38 
4.1 Experimental setting
The datasets used in the experiments are summarized in Table 1. The visualization results of the embedded points learned by compared methods are shown in 2D or 3D for comparisons (see Sect. 4.3). The clustering results are reported with Kmeans on the learned embedding that was obtained by each compared method (see Sect. 4.4). Two popular evaluation criteria, accuracy and normalized mutual information (NMI), are used for comparing clustering methods (Nie et al. 2009). For fair comparisons, we set d to be the dimension that retains 95% of its energy after applying PCA. All methods use the number of true clusters as the number of clusters for Kmeans and the same reduced dimensionality for each dataset.
4.2 Parameter sensitivity analysis
We first conducted the sensitivity analysis for data visualization. Figures 2 and 3 demonstrate the skeleton structures of embedded points learned by the proposed method on DistortedSShape by varying \(C\in [0.1, 1, 10, 100, \infty ]\) with \(\lambda =10\) and \(\lambda \in [0.1, 0.5, 1, 5, 10]\) with \(C=\infty \), respectively. We can clearly see that the skeleton becomes smoother and noise is gradually removed when C increases. Moreover, the graph represented by the similarity matrix \(\mathbf {W}\) becomes sparser when C increases. These empirical observations are consistent with our theoretical analysis in Sect. 3.3 that C is a parameter for controlling the sparsity of the learned similarity matrix. Moreover, parameter \(\lambda \) is also important, which can change the location of embedding points as shown in Fig. 3. The importance of \(\lambda \) becomes clear for clustering problems, which is discussed below.
We also investigated the clustering performance of the proposed method by varying \(\lambda \) and C. The clustering results on datasets USPS and YALEB with respect to \(\lambda \) and C are reported in Fig. 4. These results imply that the clustering performance largely depends on the given datasets. We observed that the clustering performance is very sensitive to \(\lambda \). A small \(\lambda \) is preferred on USPS, while a large \(\lambda \) is preferred on YALEB. This is reasonable because data points in USPS form clear clustering structure, while the data points in YALEB have a manifold structure (Nie et al. 2010).
4.3 Data visualization
For the data visualization in a 2D/3D Euclidean space, we performed experiments on five synthetic datasets and two real datasets.
4.3.1 Synthetic datasets
4.3.2 Cancer data
We interrogated a largescale, publicly available breast cancer dataset (Curtis et al. 2012) for cancer progression modeling. The dataset contains the expression levels of over 25,000 gene transcripts obtained from 144 normal breast tissue samples and 1989 tumor tissue samples. Using a nonlinear regression method, a total of 359 genes were identified that may play a role in cancer development (Sun et al. 2014). We set MPME with \(\lambda =1.2\) and \(C=\infty \), and visualize the embedded points in 3D space, where \(C=\infty \) is for the skeleton structure in terms of the sparsity solution of \(\mathbf {W}\). For the purpose of understanding this in details, we also report the kernel matrices learned by variants of the MVUs.
Figure 6 shows the embedded points and the learned similarity matrix in 3D space from MPME and the MVUs. Figure 7 shows the embedded points of the remaining methods. Each tumor sample is colored with its corresponding PAM50 subtype label, a molecular approximation that uses a 50gene signature to group breast tumors into five subtypes including normallike, luminal A, luminal B, HER2+ and basal (Parker et al. 2009). The basal and HER2+ subtypes are known to be the most aggressive types of breast tumor. The skeleton structure learned by MPME in the 3D space suggests a linear bifurcating progression path, starting from normal tissue samples, to normallike samples, and through to luminal A before continuing to luminal B, and then diverging to either the basal or HER2+ subtypes. A conceptual linear evolution progression model was proposed that posits that basal tumors are derived from luminal tumors, seeing Figure 6 in Creighton (2012). The revealed skeleton structure is consistent with the proposed branching architecture of cancer progression (Sun et al. 2014). For ease of understanding, we annotate each subtype of embedded samples obtained by MPME using eclipses with dashed lines and the progression path using curves. LE and MEU demonstrate similar trends, but they are more noisy and less smooth than MPME. tSNE can obtain a clustering structure, but the interconnections between the different subtypes are missing. MVUs do not have a clear skeleton structure for describing cancer progression, and also the clustering structures of subtypes are not as well formed as that of MPME by comparing the learned kernels of MVUs and the similarity matrix of MPME. The remaining methods do not demonstrate any skeleton structures. These observations imply that our proposed MPME method can unveil a cancer progression path better than baseline methods from highdimensional noisy breast cancer data.
4.3.3 Teapot data with added noises
The dataset Teapot ^{6} is a collection of 400 teapot images that are taken successively as a teapot is rotated \(360^{\circ }\). Each image consists of \(76 \times 101\) RGB pixels (Weinberger et al. 2005), i.e., each image lies in 23,028 dimensional space. Each pixel value is divided by 255 so as to normalize each pixel to [0, 1]. As shown in Weinberger et al. (2004, 2005); Weinberger and Saul (2006), MVUs can correctly recover the circle structure from teapot data. Thus, unlike the previous experimental settings, we intentionally add noise to the data so that the circular embeddings were not as clear as in the original data. Specifically, we add the noise sampled from the uniform distribution in \([0, \rho ]\) where \(\rho \in [0, 0.1, 0.2, 0.4, 0.5]\), to each dimension of teapot images. For \(\rho =0\), the data is the same as the original teapot data. The larger the \(\rho \) becomes, the more noise the newly generated data has. Since the normalized pixel is in [0, 1], the generated data with a noise rate \(\rho \le 0.5\) are expected to have enough information to uncover the circular structure.
4.4 Clustering with dimensionality reduction
Clustering results of seven datasets using 10 methods in terms of accuracy and NMI
Method  Letter  Pendigits  Satimage  USPS  Coil20  YALEB  Isolet 

Accuracy  
Kmeans  0.2632  0.6544  0.6685  0.6153  0.6674  0.1081  0.5633 
PCA  0.2634  0.6527  0.6681  0.6208  0.6674  0.1110  0.5633 
KPCA  0.2596  0.7387  0.7145  0.1933  0.2243  0.0766  0.5008 
LLE  0.3084  0.1215  0.6510  0.1624  0.3493  0.0667  0.5021 
LE  0.0674  0.7742  0.6692  0.4828  0.3889  0.1475  0.5361 
\(\ell \)MVU  0.1484  0.5692  0.6886  0.3896  0.5146  0.0684  0.5989 
GPLVM  0.2634  0.6527  0.6681  0.4709  0.6674  0.0671  0.5630 
MEU  0.2518  0.7138  0.7398  0.6129  0.3590  0.3094  0.5476 
tSNE  0.3856  0.7504  0.6257  0.7603  0.7319  0.3592  0.5976 
MPME  0.3432  0.8276  0.7411  0.7344  0.7590  0.4010  0.6935 
NMI  
Kmeans  0.3621  0.6669  0.6097  0.5657  0.7845  0.1694  0.0117 
PCA  0.3591  0.6627  0.6090  0.5664  0.7845  0.1781  0.0117 
KPCA  0.3528  0.6849  0.6028  0.1085  0.3104  0.0911  0.0006 
LLE  0.3992  0.0059  0.5080  0.0106  0.3848  0.0810  0.0004 
LE  0.0211  0.7397  0.5812  0.4522  0.5077  0.2534  0.0369 
\(\ell \)MVU  0.1379  0.6422  0.5818  0.3858  0.6595  0.0870  0.0292 
GPLVM  0.3591  0.6627  0.6090  0.3824  0.7845  0.0716  0.0116 
MEU  0.3341  0.7635  0.6826  0.5743  0.5067  0.4174  0.0151 
tSNE  0.5059  0.7846  0.6266  0.7868  0.8743  0.5925  0.0288 
MPME  0.4616  0.8139  0.6594  0.7611  0.8862  0.5301  0.1109 
Table 2 reports the results of ten methods on seven datasets including their accuracy and normalized mutual information (NMI). We made the following observations. Kmeans on the embedded points after applying PCA provides a marginal improvement over Kmeans on the original space. That is, the reduced dimension by retaining 95% of its energy of data can capture the most information from the data. Except for MPME and tSNE, the other methods hardly outperform PCA, which is also observed in van der Maaten et al. (2009). Although MEU shares some properties with MPME, MPME still outperforms MEU as shown in Table 2 due to the key differences discussed in Sect. 3.6.2. These observations further confirm that both the inequality constraints for preserving expected distances and the probability model can contribute to a robust dimensionality reduction model by learning a proper similarity matrix for clustering problems.
5 Conclusion
We propose a novel probabilistic dimensionality reduction framework for learning a smooth skeleton structure from noisy data by directly modeling the posterior distribution of latent points with a set of constraints over the expected pairwise distances. Our model not only tolerates noisy data and uncovers a smooth skeleton structure by learning a sparse positive similarity matrix, but also gives a natural interpretation of LE and MVU from the probabilistic point of view. Extensive experiments demonstrated that our proposed method can achieve better visualization results on seven datasets, successfully unveil the circular structure of teapot data with added noises at different noise ratios. The method also yields superior clustering performance compared to various baseline methods on seven real datasets.
Footnotes
Notes
Acknowledgements
This project is partially supported by the ARC Future Fellowship FT130100746 and ARC grant LP150100671.
References
 Belkin, M., & Niyogi, P. (2001). Laplacian eigenmaps and spectral techniques for embedding and clustering. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), NIPS (Vol. 14, pp. 585–591). Cambridge: MIT Press.Google Scholar
 Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR, 7, 2399–2434.MathSciNetzbMATHGoogle Scholar
 Besag, J. (1975). Statistical analysis of nonlattice data. The Statistician, 24(3), 179–195.CrossRefGoogle Scholar
 Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar
 Burges, C. J. C. (2009). Dimension reduction: A guided tour. FTML, 2(4), 275–365.zbMATHGoogle Scholar
 Byrd, R. H., Lu, P., Nocedal, J., & Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5), 1190–1208.MathSciNetCrossRefzbMATHGoogle Scholar
 Creighton, C. (2012). The molecular profile of luminal b breast cancer. Biologics, 15, 440.Google Scholar
 Curtis, C., Shah, S. P., Chin, S., et al. (2012). The genomic and transcriptomic architecture of 2000 breast tumours reveals novel subgroups. Nature, 486(7403), 346–352.Google Scholar
 Elhamifar, E., & Vidal, R. (2011). Sparse manifold clustering and embedding. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, & K. Q. Weinberger (Eds.), NIPS (pp. 55–63). Granada, Spain: Granada Congress and Exhibition Centre.Google Scholar
 Fukunaga, K. (2013). Introduction to statistical pattern recognition. New York: Academic press.zbMATHGoogle Scholar
 Greaves, M., & Maley, C. C. (2012). Clonal evolution in cancer. Nature, 481(7381), 306–313.CrossRefGoogle Scholar
 Gupta, A. K., & Nagar, D. K. (1999). Matrix variate distributions (Vol. 104). Boca Raton: CRC Press.zbMATHGoogle Scholar
 Hastie, T., & Stuetzle, W. (1989). Principal curves. JASA, 84, 502–516.MathSciNetCrossRefzbMATHGoogle Scholar
 Jebara, T. (2001). Discriminative, generative and imitative learning. Ph.D. thesis, Massachusetts Institute of Technology.Google Scholar
 Jimenez, L. O., & Landgrebe, D. (1998). Supervised classification in highdimensional space: Geometrical, statistical, and asymptotical properties of multivariate data. TSMC, 28(1), 39–54.CrossRefGoogle Scholar
 Jolliffe, J. T. (1986). Principal component analysis. Berlin: Springer.CrossRefzbMATHGoogle Scholar
 Kégl, B., Krzyzak, A., Linder, T., & Zeger, K. (2000). Learning and design of principal curves. IEEE TPAMI, 22(3), 281–297.CrossRefGoogle Scholar
 Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.MathSciNetCrossRefzbMATHGoogle Scholar
 Lawrence, N. D. (2005). Probabilistic nonlinear principal component analysis with gaussian process latent variable models. JMLR, 6, 1783–1816.MathSciNetzbMATHGoogle Scholar
 Lawrence, N. D. (2012). A unifying probabilistic perspective for spectral dimensionality reduction: Insights and new models. JMLR, 13(1), 1609–1638.MathSciNetzbMATHGoogle Scholar
 Mao, Q., Tsang, I. W., Gao, S., & Wang, L. (2015). Generalized multiple kernel learning with datadependent priors. IEEE TNNLS, 24(2), 248–261.Google Scholar
 Nene, S. A., Nayar, S. K., & Murase, H. (1996). Columbia object image library (coil20). Technical Report CUCS00596.Google Scholar
 Nie, F., Xu, D., Tsang, I. W., & Zhang, C. (2009). Spectral embedded clustering. In C. Boutilier (Ed.), IJCAI (pp. 1181–1186). Menlo Park, California: AAAI Press.Google Scholar
 Nie, F., Xu, D., Tsang, I. W., & Zhang, C. (2010). Flexible manifold embedding: A framework for semisupervised and unsupervised dimension reduction. TIP, 19(7), 1921–1932.MathSciNetGoogle Scholar
 Parker, J., Mullins, M., Cheang, M., et al. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncology, 27(8), 1160–1167.CrossRefGoogle Scholar
 Rasmussen, C. E. (2006). Gaussian processes for machine learning. Cambridge: The MIT Press.zbMATHGoogle Scholar
 Saul, L. K., & Roweis, S. T. (2003). Think globally, fit locally: Unsupervised learning of low dimensional manifolds. JMLR, 4, 119–155.MathSciNetzbMATHGoogle Scholar
 Schölkopf, B., Smola, A., & Muller, K. (1999). Kernel principal component analysis. In B. Schölkopf, A. J. Smola, & C. J. C. Burges (Eds.), Advances in Kernel methods–Support vector learning (pp. 327–352). Cambridge: MIT Press.Google Scholar
 Smola, A. J., & Kondor, R. (2003). Kernels and regularization on graphs. In B. Schölkopf & M. K. Warmuth (Eds.), ICML (pp. 144–158). New York: Springer.Google Scholar
 Song, L., Smola, A., Gretton, A., & Borgwardt, K. (2007). A dependence maximization view of clustering. In Z. Ghahramani (Ed.), ICML (pp. 815–822). New York: ACM.CrossRefGoogle Scholar
 Sun, Y., Yao, J., Nowak, N., & Goodison, S. (2014). Cancer progression modeling using static sample data. Genome Biology, 15(8), 440.CrossRefGoogle Scholar
 Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society, 61(3), 611–622.MathSciNetCrossRefzbMATHGoogle Scholar
 Tutuncu, R., Toh, K., & Todd, M. (2003). Solving semidefinitequadraticlinear programs using SDPT3. Mathematical Programming, 95, 189–217.MathSciNetCrossRefzbMATHGoogle Scholar
 Vandenberghe, L., & Boyd, S. (1996). Semidefinite programming. SIAM Review, 38(1), 49–95.MathSciNetCrossRefzbMATHGoogle Scholar
 van der Maaten, L., & Hinton, G. (2008). Visualizing data using tsne. JMLR, 9(2579–2605), 85.zbMATHGoogle Scholar
 van der Maaten, L., Postma, E. O., & van den Herik, H. J. (2009). Dimensionality reduction: A comparative review. Tilburg University Technical Report, TiCCTR 2009005.Google Scholar
 Weinberger, K. Q., Packer, B. D., & Saul, L. K. (2005). Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization. In R. G. Cowell & Z. Ghahramani (Eds.) Proceedings of the 10th international workshop on artificial intelligence and statistics (pp. 381–388).Google Scholar
 Weinberger, K. Q., & Saul, L. K. (2006). Unsupervised learning of image manifolds by semidefinite programming. IJCV, 70(1), 77–90.CrossRefGoogle Scholar
 Weinberger, K. Q., Sha, F., & Saul, L. K. (2004). Learning a kernel matrix for nonlinear dimensionality reduction. In C. E. Brodley (Ed.), ICML (p. 106). New York: ACM.CrossRefGoogle Scholar
 Zhu, J., & Xing, E. P. (2009). Maximum entropy discrimination markov networks. JMLR, 10, 2531–2569.MathSciNetzbMATHGoogle Scholar