Keywords

1 Introduction

With the rapid growth of information technology, massive data sets, such as human gene distribution, human healthy report, climate patterns, commercial record, have been created. Dimension reduction (DR), as a key technique to extract the important formation from the high-dimensional data, has increasingly attracted considerable attentions and studies.

The major task of DR is to find a projection to map the high dimension data into a lower dimensional space according to some criterion. So far, many DR methods have been proposed, and they can be divided into linear and nonlinear groups. PCA and LDA [1] are most popular linear methods, which try to obtain linear projection matrices. In contrast to the linear DR methods, nonlinear approaches are able to discover the nonlinear structure data in real world. Most incremental DR methods are based on kernel function. For example, kernel principal component analysis (KPCA) [2] has been widely used to get an implicit nonlinear mapping. But the result may be worse once the inappropriate kernel function is chosen. Recently, some methods based on manifold learning, such as laplacian eigenmaps (LE) [3], locally linear embedding (LLE) [4] and Isomap [5] have drawn much concern. In particular, Isomap, which can reveal the intrinsic geometric structure of manifold by preserving geodesic distance of all similarity pairs, has presented some encouraging results. Therefore, a series of Isomap methods have occurred, such as supervised Isomap (S-Isomap) [6], orthogonal constrained marginal Isomap (M-Isomap) [7] and multi-manifold discriminant Isomap (MMD-Isomap) [8]. Among them, S-Isomap has shown good performances in visualization, image retrieval and classification by modifying the distances between the pairwise points with the same and different class labels. Despite the good performances achieved by Isomap family, they suffer from heavy computation burden that is caused by calculating the geodesic distances of all point pairs. In particular, distance recalculation has to be done when new data arrive [9] and leads to more computational load.

Incremental learning, as a family of efficient of scalable machine learning methods, has been proposed to operate on a sequence of data instance. At each step, the incremental learning algorithms process an incoming example and update the learner through the instance. An incremental version of Isomap has been proposed, which can efficiently update the geodesic distances and re-estimate the eigenvectors using the previous computation results [10]. On the other hand, for supervised learning, the number of labeled samples is limited in practice due to cost and time. This issue has motivated the recent study of active learning [11], which queries only a subset of informative incoming instances to update the learner and aims to maximize learner performance using minimal human labeling effort at each iteration.

In order to cut down the computation amount of S-Isomap and use the most informative labeled samples, we present an incremental active learning S-Isomap, called by IALSI. The proposed method modifies the original S-Isomap to the incremental version and reduces the computational cost greatly. Furthermore, active learning is applied for high efficiency and scalability towards large-scale learning tasks.

The remainder of this paper is organized as follows: Sect. 2 describes our proposed incremental active learning S-Isomap. The experimental results are shown in Sect. 3. Finally, Sect. 4 gives the conclusion and future work.

2 Incremental Active Learning S-Isomap

In this section, we will elaborate our method. Before that, we first give a brief review of S-Isomap. Then we introduce the details of our method.

2.1 S-Isomap

Given a data set \( {\mathbf{X}} = \left[ {{\mathbf{x}}_{1} ,{\mathbf{x}}_{2} , \ldots ,{\mathbf{x}}_{N} } \right]{ \in }{\mathbf{R}}^{r \times N} \) with N points, DR is to find a mapping function that maps these points to a new data set \( {\mathbf{Y}} = \left[ {{\mathbf{y}}_{1} ,{\mathbf{y}}_{2} , \ldots ,{\mathbf{y}}_{N} } \right]{ \in }{\mathbf{R}}^{r' \times N} \) in a lower dimensional space \( \left( {r' \ll r} \right) \). \( d\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) \) denotes the Euclidean distance between points \( {\mathbf{x}}_{i} \) and \( {\mathbf{x}}_{j} \).

S-Isomap is a supervised version of Isomap. Considering the class label information of samples, it defines the distance metric as follows:

$$ \tilde{d}\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) = \left\{ {\begin{array}{*{20}l} {\sqrt {1 - { \exp }\left( { - d^{2} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right)/\beta } \right)} \;{\text{if }}l\left( {{\mathbf{x}}_{i} } \right) = l\left( {{\mathbf{x}}_{j} } \right)} \hfill \\ {\sqrt {{ \exp }\left( {d^{2} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right)/\beta } \right)} - \alpha \;{\text{if }}l\left( {{\mathbf{x}}_{i} } \right) \ne l\left( {{\mathbf{x}}_{j} } \right)} \hfill \\ \end{array} } \right. $$
(1)

where the parameters of \( \alpha \) and \( \beta \) are used to control the range of \( d\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) \). Actually, \( \alpha \) is set to be a small positive value, and \( \beta \) is set to be the average Euclidean distance between all pairs of data points. \( l\left( {{\mathbf{x}}_{i} } \right) \) is the class label of \( {\mathbf{x}}_{i} \). After the modification of point distances, the same steps as Isomap are used to obtain the lower dimensional representations of samples. It can be summarized as follows:

  1. 1.

    Construct neighborhood graph \( {\mathbf{D}}_{G} \). For every pair of data points, if \( \tilde{d}\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) \). is smaller than the fixed us radius \( { \in } \) or \( {\mathbf{x}}_{j}\,{ \in }\,{\text{KNN}}\left( {{\mathbf{x}}_{i} } \right)\;\,\left( {{\text{KNN}}\left( {{\mathbf{x}}_{i} } \right)} \right. \) means \( {\mathbf{x}}_{j} \) is the K-nearest neighbors of \( \left. {{\mathbf{x}}_{i} } \right), \) set edge with weight of \( \tilde{d}\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right). \)

  2. 2.

    Compute the shortest geodesic distances. Initial the distance \( d_{G} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) = \tilde{d}\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) \). if \( {\mathbf{x}}_{j} \) and \( {\mathbf{x}}_{i} \) are neighbors, otherwise, put \( d_{G} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) = {\infty } \). Estimate the geodesic distances between all pairs of points by computing their shortest path distances in \( {\mathbf{D}}_{G} \) using Dijkstra’s or Floyd’s algorithm.

  3. 3.

    Construct r’-dimensional embedding. Define the optimization problem as follows:

    $$ \mathop {\hbox{min} }\limits_{{\mathbf{Y}}} \sum\limits_{{{\mathbf{x}}_{i} {\mathbf{,x}}_{j} }} {\left( {d({\mathbf{y}}_{i} ,{\mathbf{y}}_{j} ) - d_{G} ({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} )} \right)^{2} } $$
    (2)

Let \( {\mathbf{H}} = {\mathbf{I}} - \left( {1/N} \right){\mathbf{ee}}^{T} \), where I is an \( N \times N \) identity matrix, and \( {\mathbf{e}} \) is the vector of all ones. Let Q is an \( N \times N \) matrix with elements \( {\mathbf{Q}}_{ij} = d_{G}^{2} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) \). The lower-dimensional embedding Y is obtained as \( \left[ {\sqrt {\lambda_{1} } {\mathbf{v}}_{1} , \sqrt {\lambda_{2} } {\mathbf{v}}_{2} , \ldots ,\sqrt {\lambda_{{r^{{\prime }} }} } {\mathbf{v}}_{{r^{{\prime }} }} } \right]^{T} \)., where \( \left\{ {{\mathbf{v}}_{i} } \right\}_{i = 1}^{{r^{{\prime }} }} \) denotes the eigenvectors corresponding to the first \( r' \) leading eigenvalues of \( {\mathbf{R}} = - {\mathbf{HQH}}/2 \).

2.2 Proposed IALSI

In IALSI, incremental learning is applied to update the shortest distance matrix based on Floyd’s algorithm en new training data arrive in each iteration. These new incoming samples are chosen by active learning according to a given criterion.

The main task of active learning is to select information and representative samples for classification. According to different criteria for sample selection, active learning methods can be roughly divided into two groups: committee-based approaches and uncertainty-based approaches. Committee-based approaches, based on committee of classifiers, select the data whose classification results have the greatest disagreements among the classifiers. While in the uncertainty-based approaches, the most uncertain samples determined by some uncertainty scheme are chosen and labeled.

In the proposed IALSI, the uncertainty of sample \( {\mathbf{x}}_{i} \) are measured as follows:

$$ w\left( {{\mathbf{x}}_{i}} \right)={{\hbox{max} }_{1\le c\le C}}d\left( {{\mathbf{x}}_{i}},{{\overline{\mathbf{x}}}_{c}} \right)-{{\hbox{min} }_{1\le c\le C}}d\left( {{\mathbf{x}}_{i}},{{\overline{\mathbf{x}}}_{c}} \right), $$
(3)

where \( \overline{{\mathbf{x}}}_{c} \) is the mean value of samples belonging to class c, and C is the number of classes. A smaller \( w\left( {{\mathbf{x}}_{i} } \right) \) means that the distances between sample \( {\mathbf{x}}_{i} \) and each class center are comparable and thus there is ambiguity to label \( {\mathbf{x}}_{i} \) based on the distances. Therefore, \( {\mathbf{x}}_{i} \) is more informative and to be labeled and added to the training set.

Suppose the available training data set is \( \left\{ {{\mathbf{x}}_{i} , l\left( {{\mathbf{x}}_{i} } \right)} \right\}_{i = 1}^{N} \), the unlabeled data set is \( \left\{ {{\mathbf{x}}_{j} } \right\}_{j = 1}^{M} \). We will select m samples from the unlabeled set through active learning. The details of active learning algorithm are summarized in Table 1.

Table 1. Active learning

In the beginning of IALSI, distances between points in \( \left\{ {{\mathbf{x}}_{i} , l\left( {{\mathbf{x}}_{i} } \right)} \right\}_{i = 1}^{N} \). are calculated using Eq. (1), and the shortest distance matrix \( {\mathbf{D}}_{0} \) is obtained. In the \( t \) th iteration, m samples selected by active learning algorithm are added to the training set as \( \left\{ {{\mathbf{x}}_{j} , l\left( {{\mathbf{x}}_{j} } \right)} \right\}_{j = 1}^{m} , \) and used to update the geodesic distances between points \( \left( {1 \le t \le T} \right) \). The extended distance matrix is given as:

$$ {\mathbf{D}}_{t} = \left[ {\begin{array}{*{20}l} {{\mathbf{D}}_{t - 1} } \hfill & {{\mathbf{D}}_{E} } \hfill \\ {{\mathbf{D}}_{E}^{T} } \hfill & {{\mathbf{D}}_{A} } \hfill \\ \end{array} } \right] $$
(4)

where \( {\mathbf{D}}_{A} \) denotes the distance matrix among the new arriving data \( \left\{ {{\mathbf{x}}_{j} , l\left( {{\mathbf{x}}_{j} } \right)} \right\}_{j = 1}^{m} \), \( {\mathbf{D}}_{E} \) denotes the distances between \( \left\{ {{\mathbf{x}}_{i} , l\left( {{\mathbf{x}}_{i} } \right)} \right\}_{i = 1}^{N} \) and \( \left\{ {{\mathbf{x}}_{j} , l\left( {{\mathbf{x}}_{j} } \right)} \right\}_{j = 1}^{m} \). Instead of re-computing the shortest distances for all points, only new data points are considered. For new point \( {\mathbf{x}}_{i} \), the distance between it and point \( {\mathbf{x}}_{j}\, { \in }\,\left\{ {{\mathbf{x}}_{i} , l\left( {{\mathbf{x}}_{i} } \right)} \right\}_{i = 1}^{N + m} \). is initialized as \( d_{G} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) = \tilde{d}\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) \) if \( {\mathbf{x}}_{j} \) and \( {\mathbf{x}}_{i} \) are neighbors, otherwise, put \( d_{G} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) = \infty \). Then Floyd’s algorithm is adopted to dynamically update the shortest distance matrix in all-pairs data. Hence, for each value of \( p = N + 1, \cdots ,N + m \). in turn, replace all entries \( d_{G} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) \) by \( { \hbox{min} }\left\{ {d_{G} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right),d_{G} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{p} } \right) + d_{G} \left( {{\mathbf{x}}_{p} ,{\mathbf{x}}_{j} } \right)} \right\} \). After the iteration, the final matrix is denoted \( {\mathbf{D}}_{t} \). At last, we use the classic MDS algorithm to construct r’-dimensional embedding. During the iterations, the k value in neighborhood graph changes allows

$$ k\left( t \right) = k_{0} + k_{1} t $$
(5)

where \( k_{0} \) is the initial value and \( k_{1} \) acts as a factor controlling the change range. Table 2 lists the details of IALSI.

Table 2. IALSI

Remark on computational complexity. We focus on the space complexity of construction of distance matrix and neighborhood graph, as the steps is the major computing load in practice. The space complexity of original S-Isomap algorithm is \( O\left( {N^{2} } \right) \) [8]. When the new data points are added, the space complexity of batch algorithm reaches to \( O\left( {\left( {N + m} \right)^{2} } \right) \). In IALSI, only the shortest distances involving new data need to be computed. Hence, the space complexity of IALSI is simply \( O\left( {2N \times m + m^{2} } \right) \).

3 Experiments and Discussions

In this section, experiments are carried out to evaluate the performance of IALSI. Three publicly available image data sets, including Corel [12], UC Merced LULC [13], and MNIST [14], are used for image retrieval and classification. Two batch methods including original Isomap and S-Imap, two incremental methods including incremental S-Isomap (ISI) and IALSI are compared. It is noted that, in ISI, samples to be added into training set are randomly selected. All the algorithms are implemented in MATLAB and run on an Intel(R) core(TM) i5-3470 PC at 3.2 GHz with 12 GB RAM.

3.1 Data Sets and Experimental Setting

Two datasets, the UC Merced LULC and Corel, are used for image retrieval. The LULC data set, obtained from aerial imagery, consists of images from 20 classes, with a pixel resolution of 30 cm. Each class contains 100 images of size 256 by 256 pixels. The Corel data set is from the real-world photo from COREL Photo Gallery. It has 20 categories with 500 images per category. We extract basic color features and wavelet texture features to describe images [15]. The features include color histogram (32 dimensions), color moment (64 dimensions), color auto correlogram (6 dimensions), wavelet moment (40 dimensions) and Gabor transform where the number of scales was set 4 and orientation was set 6 (48 dimensions). All the features are concatenated into a long vector as an image feature and each image is represented by a 190-dimensional vector. The new feature is normalized to zero mean and unit variance. The MNIST handwritten digit database is used for classification. It has 70000 hand written digit images with size of \( 28 \times 28 \) pixels. We randomly choose 4000 images from digits “0–9” for evaluation. Figure 1 gives some example images of the three data sets.

Fig. 1.
figure 1

Example images of the three datasets used in our experiments.

In our experiments, each dataset is randomly divided into three partitions: training set, candidate set and test set. Among them, the candidate set is for sample selection where the informative samples are selected and added into the training set in each iteration. For the LUCL, Corel and MNIST data sets, the numbers of training and test samples are 200 and 300, 500 and 500, 500 and 500, respectively. The rest acts as candidate samples. For supervised DR methods, a BP neural network is constructed to approximate the mapping from the high-dimensional space to the lower-dimensional space. In the reduced dimension space, L 2 distance metric is employed for image retrieval to measure the similarities between the query image in the test set and the training images. For classification task, KNN \( \left( {K = 1} \right) \) classifier is used to predict their class labels for the test samples.

In our experiments, the dimension of data is reduced to the number of classes of each data set. The parameters of each method are carefully adjusted to get the best performance. In particular, for Isomap and S-Isomap, the k value in KNN is assigned to 25 for LULC, 80 for COREL, 60 for MNIST. For the incremental learning methods, \( k_{0} \) is set 15 for LUCL, 60 for Corel and MNIST, \( k_{1} \) is 30 for the three data sets. Besides, we set \( m = 300 \) for all datasets, \( T = 3 \) for LULC and \( T = 6 \) for others.

3.2 Experimental Results and Analysis

Table 3 summarizes the retrieval precisions of 4 different methods when top 5 to 25 (with 5 step intervals) images are retrieved. In addition, classification accuracies on MNIST are also listed. It can be seen: (1) supervised methods perform better than unsupervised Isomap; (2) more training samples help to improve the performance; (3) IALSI with active learning outperforms the other methods, as more informative samples are chosen.

Table 3. Performance comparison of different methods

The running time of different methods are shown in Fig. 2. Compared with the batch methods, incremental learning methods are more efficient. Figure 3 takes the Corel data set as an example, and further depicts the running time during iterations. For the batch methods, re-computation of distance matrix is necessary when new samples are added and thus the computation amount rises dramatically while a smooth increase can be observed in the incremental methods.

Fig. 2.
figure 2

Running time comparison of different methods

Fig. 3.
figure 3

Running time during iterations on Corel data set

4 Conclusions

An incremental active learning method for S-Isomap, IALSI, is proposed in the paper. IALSI aims to modify the original S-Isomap into incremental learning version meanwhile selecting informative samples through active learning in each iteration. The core idea of IALSI is to update the geodesic distance matrix based on the existing computation results and select informative samples according to an uncertainty criterion. The encouraging experimental results show that IALSI not only achieves better performance in image retrieval and classification tasks, but reduces the running time greatly.