Keywords

1 Introduction

In recent years, various kinds of information, such as location information, search history, and videos, have been converted to numerical/categorical/binary data. Those data are often expressed by vectors of a high dimension. Therefore, it is difficult for us to observe directly the data in order to grasp how data are distributed and what relationship exists among data. To make use of our high-order brain functions and intuition to analyze such data, dimension reduction into a two- or three-dimensional space is effective. Dimension reduction, not limited to two- or three-dimensional, is also useful to avoid the “curse of dimensionality”, a common obstacle in regression and classification. Many visualization methods proposed so far are categorized into two of unsupervised methods and supervised methods. They are furthermore divided into two of linear and nonlinear methods. The unsupervised methods do not use class labels as seen in principal component analysis (PCA) and multidimensional scaling (MDS). PCA is a linear mapping and maximizes the variance of mapped data. MDS is a nonlinear mapping and preserves the distance between data, before and after mapping, as much as possible. On the other hand, the supervised methods use class labels as supervision information. Fisher linear discriminant analysis (FLDA) is a representative example. FLDA is a linear mapping and minimizes the ratio of within-class variance to between-class variance in the mapped data. The visual neural classifier [5] is an example of supervised nonlinear method.

Unsupervised methods are useful for revealing hidden structure, typically manifolds formed from data. On the contrary, supervised methods are effective for revealing the separability of classes. Linear-methods keep the linear structure of data but cannot express the manifold structures with varying curvature. On the other hand, nonlinear-methods can effectively catch the manifold structure but may produce fake structure which can mislead the analysts. Laplacian Eigenmaps (LEs), our main concerns, are, originally unsupervised, nonlinear mappings and preserve the neighbor relationship of data by graph Laplacians over adjacency graphs. In this paper, we propose a novel supervised LE, which combines feature and label information into a single neighborhood relation between data.

2 Related Works

In this section, we provide an overview of supervised LEs. So far, CCDR [2], Constraint Score [8], S-LapEig [4] and S-LE [6] have been proposed. In the following, the detail of each algorithm will be introduced. Note that some parameter symbols are changed from the original papers for keeping consistency through this paper. In fact, \(k\) is used in common for the number of nearest neighbors, \(\sigma ^{2}\) for a variance of an exponential, \(\tau ^{2}\) for a variance of a second exponential, \(\beta \) for a parameter on label-agreement, and \(\lambda \) for a parameter on the balance between feature space and label space information. In addition, necessary parameters of each algorithm are also shown with the name.

2.1 Laplacian Eigenmaps: LE(k) (Original LE)

Given \(n\) data points \(\{\varvec{x}_i\}^n_{i=1}\) in a high-dimensional space \({\mathbb R}^{M}\), the original LE [1] maps them into points \(\{\varvec{z}_i\}^n_{i=1}\) in a low-dimensional space \({\mathbb R}^{m}\) on the basis of a neighbor relation represented by \(\{w_{ij}(\ge 0)\}^n_{i,j=1}\) over \(\{\varvec{x}_i\}^n_{i=1}\) in such a way to minimize

$$\begin{aligned} J_{\text{ LE }}=\sum _i\sum _j{||\varvec{z}_i-\varvec{z}_j||}^2w_{ij}. \end{aligned}$$
(1)

This formulation corresponds to graph Laplacian with the adjacency relation \(W=(w_{ij})\). Typically, \(W\) is given by

$$\begin{aligned} w_{ij}= {\left\{ \begin{array}{ll} \exp ( -\Vert \varvec{x}_i-\varvec{x}_j\Vert ^{2}/\sigma ^{2}) &{} (\varvec{x}_i \in \text{ kNN }(\varvec{x}_j ) \vee \varvec{x}_j \in \text{ kNN }(\varvec{x}_i))\\ 0 &{} \text{(otherwise) }\\ \end{array}\right. }, \end{aligned}$$

where \(\varvec{x}_i \in \text{ kNN }(\varvec{x}_{j})\) shows that \(\varvec{x}_i\) is a member of \(k\) nearest neighbors of \(\varvec{x}_j\).

Let \(Z\) be a matrix of \(n\times m\) and let \(\varvec{z}_{i}^{T}\) (‘\(T\)’ denotes the transpose) be the \(i\)th row. Then \(J_{\text{ LE }}\) becomes \(J_{\text{ LE }}=2 \text{ tr } Z^{T} L Z\) (‘tr’ denotes the trace), where \(L=D-W\) with \(D=\text{ diag }(\sum _{j}w_{1j}, \ldots ,\sum _{j}w_{nj})\). We can find \(\{\varvec{z}_i\}^n_{i=1}\) by minimizing tr \(Z^{T} L Z\), subject to \(Z^{T} D Z=I\). The solution is given by solving the generalized eigenvalue problem, \( L Z = D Z\varLambda ,\) and, avoiding the trivial eigenvector of \(\mathbf 1\) with \(\lambda =0\), the second to \((m+1)\)th smallest (in the corresponding eigenvalue) eigenvectors are used for \(Z\). Note that \(L\) is positive semi-definite.

2.2 Classification Constrained Dimensionaly Reduction: CCDR(\(k,\sigma ^2,\lambda \))

CCDR [2] introduces a hypothetical node for each class, called a class center, and requires the points of the same class to gather around the class center in the mapped space. Let \(\mu _k\in \mathbb {R}^m\) be the class center of class k in the mapped space and \(C=(c_{ki})\) be the class membership matrix, i.e., \(c_{ki}=1\) if \(\varvec{x}_i\in \mathbb {R}^M\) has label k and \(c_{ki}=0\) otherwise. CCDR minimizes the cost function

$$\begin{aligned} J_{\text{ CCDR }}=\lambda \sum _{i,j}||\varvec{z}_i-\varvec{z}_j||^2w_{ij}+(1-\lambda )\sum _{k,i}{||\mu _k-\varvec{z}_i||}^2c_{ki}, \end{aligned}$$
(2)

where

$$\begin{aligned} w_{ij} = {\left\{ \begin{array}{ll} \exp ( -\Vert \varvec{x}_i-\varvec{x}_j\Vert ^{2}/\sigma ^{2}) &{} (\varvec{x}_i \in \text{ kNN }(\varvec{x}_j) \vee \varvec{x}_j \in \text{ kNN }(\varvec{x}_i ))\\ 0 &{} \text{(otherwise) } \end{array}\right. }, c_{ki} = {\left\{ \begin{array}{ll} 1 &{} (y_{i}=k)\\ 0 &{} (y_{i}\ne k)\\ \end{array}\right. }. \end{aligned}$$

Here \(y_i\) is the class label of \(\varvec{x}_i\) and \(\lambda \; (0\le \lambda \le 1)\) is a balance parameter between feature space information and label space information. In [2], \(\sigma ^{2}\) is determined as ten times the average of the squared nearest neighbor distances and \(\lambda =1/2\).

2.3 Constraint Score: CS(\(\beta \))

The Constraint Score [8] is not proposed directly for dimension reduction nor visualization, but for feature selection. However, we can use the criterion for LE. In fact, it is a naïve way to deal with sample pairs of different classes: if the classes are the same, then multiply \(+1\) to (1), otherwise \(-1\).

Although two cost functions, division type and subtraction type, are shown in [8], we consider only the subtraction type that minimizes the cost function

$$ J_{\text{ CS }}=\sum _{i,j}||\varvec{z}_i-\varvec{z}_j||^2w^{M}_{ij}- \beta \sum _{i,j} \Vert \varvec{z}_{i}-\varvec{z}_j\Vert ^2w^{C}_{ij}, $$

where and ( is the indication function that takes 1 if the argument is true, 0 otherwise).

2.4 S-LapEig(\(k,\sigma ^{2}, \tau ^{2}\))

S-LapEig [4] modifies the distance between data points \(\{\varvec{x}_i\}^n_{i=1}\) in the original space such that data of the same class label become closer and data of the different class labels become more distant. The criterion to minimize is the same as the original LE: \(J_{\text{ S-LapEig }}=J_{\text{ LE }}\). However, the weight is determined at two stages as

$$ w_{ij}= {\left\{ \begin{array}{ll} \exp ( -d^{2}(\varvec{x}_i,\varvec{x}_j)/\sigma ^{2}) &{} (\varvec{x}_i \in \text{ kNN }(\varvec{x}_j) \vee \varvec{x}_j \in \text{ kNN }(\varvec{x}_i))\\ 0 &{} \text{(otherwise) }\\ \end{array}\right. }, $$

where

$$d^2(\varvec{x}_i,\varvec{x}_j) = {\left\{ \begin{array}{ll} 1-\text{ exp }(-\Vert \varvec{x}_i-\varvec{x}_j\Vert ^{2}/\tau ^{2}) &{} (y_i=y_j)\\ \text {exp}(\Vert \varvec{x}_i-\varvec{x}_j\Vert ^{2}/\tau ^{2}) &{} (y_i\ne y_j) \end{array}\right. }.$$

Here \(\tau ^{2}\) is taken as the square of the average Euclidean distance between all pairs of data points in [4].

2.5 S-LE(\(\sigma ^{2}, \beta \))

S-LE [6] computes the adjacency matrix W as follows. Let \(AS(\varvec{x}_i)=1/n\cdot \sum _{j=1}^{n} s(\varvec{x}_i,\varvec{x}_j)\), where \( s(\varvec{x}_i,\varvec{x}_j) = \text {exp}\left( -\Vert \varvec{x}_i-\varvec{x}_j\Vert ^2/\sigma ^{2} \right) \). If \((s(\varvec{x}_i,\varvec{x}_j)>AS(\varvec{x}_i)) \wedge (y_{i}=y_{j})\), then \(\varvec{x}_j\) is judged as the neighbor of \(\varvec{x}_i\) and denoted by \(\varvec{x}_j\in N_w(\varvec{x}_i)\). On the contrary, if \((s(\varvec{x}_i,\varvec{x}_j)>AS(\varvec{x}_i)) \wedge (y_{i} \ne y_{j})\), then \(\varvec{x}_j \in N_b(\varvec{x}_i)\). Under these definitions, S-LE maximizes (not minimizes)

$$J_{\text{ SLE }} = \beta \sum _{i,j}{||\varvec{z}_i-\varvec{z}_j||}^2w_{ij}^B - (1-\beta )\sum _{i,j}{||\varvec{z}_i-\varvec{z}_j||}^2w_{ij}^W,$$

where

3 Supervised Laplacian Eigenmaps

Almost all supervised LE algorithms that we refer to in Sect. 2 basically separate a pair \((\varvec{x}_i,\varvec{x}_j)\) into a same-class pair or a different-class pair and evaluate them separately. Therefore, we need to pay a special attention to the difference of the number of two kinds of pairs. In addition, some algorithms cannot control the degree to which we mix the label information and the feature information. Most of all, they cannot deal with multi-label datasets where a single data is associated with multiple class labels. Only CCDR can deal with multi-label data, if we want to do that, but it has its own problem as will be discussed later. For the other three algorithms, it is also not easy to extend because they deal with sample pairs differently depending on if they share the same class or not. To cope with these limitations, we propose a novel supervised LE, called the Supervised Laplacian Eigenmaps for Multi-Label datasets (shortly, SLE-ML), for visualization mainly. We combine neighbor information in the feature space and that in the class-label space into one with a balance parameter \(\lambda ~(0\le \lambda \le 1)\).

3.1 Supervised Laplacian Eigenmaps for Multi-label Datasets: SLE-ML(\(k,\lambda \))

SLE-ML minimizes the same cost function as the original LE using a different weight

$$\begin{aligned} J_{\text{ SLE-ML }}=\lambda \sum _{i,j}||\varvec{z}_i-\varvec{z}_j||^2w^{F}_{ij}+(1-\lambda )\sum _{i,j}{||\varvec{z}_{i}-\varvec{z}_j||}^2w^{L}_{ij} , \end{aligned}$$
(3)

where

Here, superscript ‘F’ stands for “feature space” and ‘L’ stands for “label space”. In addition, \(w^L_{ij}\) is the Jaccard similarity coefficient, the ratio of common labels to the union of their labels, and takes a value between 0 and 1. For a single label problem, \(w_{ij}^L=1\) if data points i and j share the same label, and \(w^L_{ij}=0\) otherwise. The original (unsupervised) LE is a special case of SLE-ML with \(\lambda =1.0\). Unlike many of previous supervised LEs that take a trade-off in the feature space between same-class pairs and different-class pairs, SLE-ML take a trade-off of similarity between the feature space and the label space.

3.2 Parameters

All algorithms have their own parameters: CCDR(\(k,\sigma ^2,\lambda \)), CS(\(\beta \)), S-LapEig(k,\(\sigma ^{2}\),\(\tau ^{2}\)), SLE(\(\sigma ^{2}, \beta \)), and SLE-ML(\(k,\lambda \)). It is often critical to choose an appropriate value for each parameter. We first discuss how to determine the values and how sensitive they are to the results. The variance parameters \(\sigma ^{2}\) and \(\tau ^{2}\) are often determined from data. A typical way is to use the average squared Euclidean distance between all pairs of data points. As for the value of \(k\), we need to use the same value in common to all algorithms. In the following experiments, the value of \(k\) is set to 1.5 times the average sample size per class in order to relate each sample to other samples of different classes. As for the other parameter, \(\beta \) (as for the label agreement) and \(\lambda \) (as for trade-off between feature and label information), we need to be more careful about the setting. Let us consider \(\beta \) in CS(\(\beta \)) and S-LE(\(\sigma ^{2}, \beta \)). When the number of classes is large, the cases when two samples have the same label are far less than the counter part. So, we have to set the value of \(\beta \) in accordance with the given dataset. When we consider supervised LEs, the most important thing is how we incorporate the label information into feature information. In contrast to LE that uses the feature information only, if we use the label information only, then all the points of the same label concentrate on a single point in the mapped space, as seen in SLE-ML(\(k,\lambda =0.0\)). Therefore, we need to be careful about the value of \(\lambda \) more than the other parameters. CCDR(\(k,\sigma ^2,\lambda \)) has the same parameter \(\lambda \), but it has another problem. The criterion (2) has two terms: the size of the first term is \(O(n^{2})\) and the size of the second term is \(O(Kn)\) where \(K\) is the number of classes. Therefore, if \(K \ll n\) or its converse (as seen in extreme multi-label problems), the effect of the same value of \(\lambda \) changes. So, it needs to be set carefully. In SLE-ML(\(k,\lambda \)), the two terms in (3) have the same size of \(O(n^2)\). Therefore, we do not need to be careful about the number of classes and can consider the value of \(\lambda \) independently of datasets. That is, SLE-ML is problem-independent. The algorithms except for CCDR(\(k,\sigma ^2,\lambda \)) and SLE-ML(\(k,\lambda \)) do not have even a trade-off parameter between feature and label information. This means we cannot control it.

Table 1. Datasets. (A) stands for artificial datasets and (N) for natural datasets. In the artificial datasets, garbage features are added; 7 for Torus and 8 for Clusdat.

4 Experiments

We evaluated the performance of the proposed method on several high-dimensional datasets (Table 1). The dataset digits consist of 1797 images of hand-written digits (0–9). In our experiments, the parameter k for nearest neighbors was set to 1.5 times the average number of samples of each class as described before.

Fig. 1.
figure 1

Effect of the balance parameter \(\lambda \) in the proposed SLE-ML in digits dataset (Each color corresponds to a class as seen in the case of \(\lambda =0.4\)). (Color figure online)

Fig. 2.
figure 2

Visualization of digits by three algorithms

Figure 1 is the visualization result of digits by SLE-ML. To confirm the effect of the parameter \(\lambda \), we varied the value from 0 to 1 by step 0.2. We see that \(\lambda =1.0\) (the feature space only) derives the same mapping as LE, and \(\lambda =0.0\) (the label space only) derives the class-isolated mapping. For a middle value of \(\lambda \), we can see the result by a trade-off between feature and label spaces. It should be noted that a smaller value of \(\lambda \) tends to enhance the separability among classes more than the reality. So, we recommend to use two values of \(\lambda \) as \(\lambda =0.5\) and 0.9 at once for analyzing data. We compared four algorithms, CCDR, CS, S-LapEig and S-LE, with SLE-ML. The parameters were chosen so as to produce almost the best results except for SLE-ML. In Fig. 2, the results of CCDR, S-LE and S-LapEig are shown. Since CS did not produce any good result, the result is not shown. We see that a high separability of classes is visualized by CCDR, S-LapEig and SLE-ML(\(\lambda =0.4,0.6\)). The other algorithms fail to reveal the separability that actually exists. In the following, therefore, we compared these three only.

Fig. 3.
figure 3

Visualization of Torus

To make clear the difference of those algorithms, we visualized two artificial datasets Torus and Clusdat. Note that these data are contaminated by garbage features. The results are shown in Figs. 3 and 4. We see that CCDR and SLE-ML (\(\lambda =0.9\)) expose the manifold structure to some extent, while SLE-ML (\(\lambda =0.5\)) and S-LapEig succeed to show the separability.

Next we dealt with multi-label datasets. Figure 5 is the visualization result of scene by SLE-ML (\(\lambda =0.5\)). We observe that multi-label data are mapped the same as single-label data. In Fig. 5, we see that data with two labels {Fall foliage, Field} locate in the middle of data with {Fall foliage} and data with {Field}. Such an observation reveals the relationship between a composite class and its component classes in the original feature space.

Fig. 4.
figure 4

Visualization of Clusdat

5 Discussion

The proposed SLE-ML is advantageous to the compared four algorithms in the sense that the results give more information than the others. This is mainly because the control parameter \(\lambda \) is intuitive and the multiple results with different values of it help us to analyze data. However, there still remain many more challenges that the original LEs had and maybe many LEs still have. First of all, we need to resolve the “out-of-sample” problem. Since the mapping in SLE-ML is not explicit, we cannot apply this mapping to a newly arrived data. We are now thinking to simulate the mapping linearly or nonlinearly. If it is succeeded, we may choose the parameter value under which separability is held high. Next, we need to cope with “imbalance problem.” SLE-ML needs to be modified to emphasize minority classes. Last, we have to devise some way to visualize a hundred of thousands of data and data with a large number of multiple labels.

Fig. 5.
figure 5

Visualization of scene with multiple labels.

6 Conclusion

In this paper, we have proposed a novel supervised Laplacian eigenmap algorithm that can handle multi-label data in addition to single-label data. The experiment demonstrated the advantages of the algorithm over the compared the state-of-the-art algorithms in the visualization quality and understandability, and in the easiness of parameter setting.

In the proposed algorithm, we combine the feature information and the label information into one, and control the balance by a parameter. To mitigate the risk of being cheated by an apparent separability with a small value of the parameter, we recommend to use two different values of the parameter at once. We also analyzed how appropriately we can give the values in the parameters of previous four algorithms and, as a result, pointed out some careful points.