1 Introduction

Hyperspectral images (HSIs), which are comprised of hundreds of spectral bands and provide rich spectral and spatial information, are widely used in agriculture [24], environmental monitoring [32], mineral exploration [27], military and security [31], astronomy [13], medicine [25], chemistry [34], urban planning [38], etc. For these applications, the HSIs classification, which is to specify specific class for each pixel, is an important basic task. Because the effectiveness of all applications is directly affected by the classification accuracy. Unfortunately, the unbalance between the high dimensionality of spectral bands and the limited number of labeled samples make it very difficult to improve classification accuracy. On the one hand, the explosion of dimensionality not only provides abundant spectral information, but also contains enormous redundant information and noise, which makes the classification accuracy not increase but also decrease. This phenomenon is known as curse of dimensionality. On the other hand, the high cost of labeling samples results in a small number of labeld samples for training model. Therefore, how to extract deep discriminative features from a small number of training samples become a key step of HSI classification tasks [21, 35].

The traditional feature extraction(FE) methods consists of band selection(BS) and dimensionality reduction(DR). The purpose of BS is to select a subset from all spectral bands, which contains not only smaller dimensions, but also enough features representing the raw data for classification [11, 30, 37]. The purpose of DR is to find a lower dimensional representation of raw high dimensional data according to some mapping algrorithms, such as principal component analysis(PCA) [9, 18, 22, 41, 45], linear discriminant analysis(LDA) [7, 16, 19, 28, 43], morphological attribute profiles(MAPs) [2, 8, 10, 23, 40], etc. In the various BS algorithms, only the features of the subset bands are used for classification, that is, the features of other bands are discarded, so it will cause the waste of valuable feature information. The DR algorithms are mainly based on handcrafted features, so only shallow features can be obtained. Due to the inability to obtain deep features, it is difficult for traditional classification methods to further improve the classification accuracy.

In recent years, deep learning(DL) has shown amazing ability in deep feature extraction and achieved great successs in machine vision. So researchers are inspired to introduce the DL models into HSI classification. The DL models for HSI classification mainly include deep belief network(DBN), convolutional neural network(CNN) and autoencoder(AE), etc. Chen [5] proposed a novel deep model architecture for HSI classification, which combined the PCA for dimensionality reduction, the DBN model for spectral feature extraction and logistic regression as a classifier. Ghassemi [12] proposed a HSI classification framework in which the DBN was applied to extract spectral-spatial features. Because the DBN is a one-dimensional(1D) model, it is necessary to expand the two-dimensional(2D) spatial data into 1D vectors before extracting spatial features. The above-mentioned flatten processing of spatial features will cause the loss of spatial features and limits the improvement of classification accuracy.

CNN models, which is the most widely used DL model for HSIs classification, mainly contains three categories: 1D-CNN, 2D-CNN and 3D-CNN [1]. Hu [39] proposed a 1D-CNN model, which consisted of a convolutional layer, a max pooling layer and a full connection layer, for HSI classification with spectral features only. Li [20] proposed a pixel-pair 1D-CNN method combining the spectral and spatial information as the input of model to imrpove the classification accuracy. Yue [44] presented a framework, which consisted of PCA for dimensionality reduction, a deep 2D-CNN for spectral-spatial feature extraction and a logistic regression classifier. Yu [42] introduced a deconvolution layer into a deep 2D-CNN model to enhance the extracted features from raw data. Haque [14] proposed a multi-scale 2D-CNN model named PCA-MS-CNN for HSIs classification. Li [21] proposed a lighter 3D-CNN framwork, which consisted of 3D convolution layers and full connection layers. Roy [29] proposed a hybrid CNN consisting of a spectral-spatial 3D-CNN followed by a spatial 2D-CNN. Zhang [46] proposed a Attention-Dense-HybridSN network based on 3D-CNN and 2D-CNN. In the network, a 3D-Dense block was used for extracting spectral-spatial features, and the channel and spatial attention were introduced to refine the extracted features. Because the 1D-CNN and 2D-CNN models cannot extract the spectral-spatial joint features of HSI data, the HSI classification methods based on 1D-CNN and 2D-CNN will lead to the loss of effective information. The 3D convolution kernel structrually matches the 3D cube data, so it can be used to extract spatial-spectral joint features. In addition, all above-mentioned CNNs are supervised learning models and the satisfactory classification accuracy(CA) can be obtained with sufficient labeled training samples, but the CA will decline rapidly when the training samples is few.

In recent years, AE as a unsupervised learning model has gained much attention. Chen [4] proposed three 1D-SAE models which were used for HSIs classification with spectral information, spatial information and spectral-spatial features respecively. Palma [17] proposed a hybrid unsupervised model based on 1D stacked AE(SAE) by introducding CNN in the training process of encoder and decoder. Mei [26] proposed a 3D convolutional autoncoder(3D-CAE), which consisted of a encoder with 3D convolutional operations only to maximally explore spatial-spectral information and a decoder to reconstruct the raw data. Sun [33] proposed a multi-scale 3D-CAE model composed of 3D convolutional layers and deconvolutional layers. The AE is composed of an encoder which can learns a representation for input data without labeled samples and a decoder which is used to resconstruct the input data.

Targeting the problem that the classification accuracy of models declines significantly with the decrease of the number of training samples, a novel deep learning framwork named Two-stage Multi-dimensional Convolutional Stacked Autoencoder(TMC-SAE) for HSI classification is proposed in this paper. The main contributions of this paper are summarized as follows.

  1. (1)

    The TMC-SAE model was proposed for classification of hyperspectral remote sensing images. The highest classification accuracy was achieved with small number of training samples compared to other state-of-the-art models.

  2. (2)

    The TMC-SAE consists of two independent stacked autoencoders SAE-1 and SAE-2. They are trainded independently by unsupervised learning. This architecture not only makes that the depth of SAE-1 and SAE-2 is not too large, but also ensures that TMC-SAE can extract depth features from HSIs.

  3. (3)

    The SAE-1 is designed to be a 1D asymmetric SAE for spectral dimentionality reduction. The encoder of SAE-1 with 5 layers contains more trainable parameters than the decoder with 3 layers. This makes the feature extraction ability of the encoder obtain more attention during training.

  4. (4)

    The SAE-2 is designed to be a hybrid network with 3D convolution and 2D convolution operations. The deep spatial-spectral-joint features extracted by SAE-2 make sure that the classification accuracy remains high when the number of training samples is small.

The remaining part of this paper is organized as follows. The related theoretical basis is described in Section 2. The framework details of TMC-SAE is presented in Sections 3 and 4. The experimental results over three benchmark hyperspectral datasets are shown in Section 5. Finally, conclusion are drawn in Section 6.

2 Related works

2.1 Stacked autoencoder

Figure 1 shows the general architecture of the autoencoder(AE), which consists of an encoder and a decoder. The function of the encoder is to extract the features of the input data and reduce the dimensionality of the data. The purpose of the decoder is to reconstructes the original data from the features extracted by the encoder.

Fig. 1
figure 1

The architecture of autoencoder

During training, the encoder maps the input \(X\in {R}^{h}\) to low dimensional representations \(Y\in {R}^{i}\) through some algorithm and the decoder recovers \(\widetilde{X}\in {R}^{h}\) from \(Y\in {R}^{i}\) through inverse transformation. The purpose of training is to minimize the error between \(X\) and \(\widetilde{X}\). This stage can be formulated mathematically as

$$\begin{array}{l}Y=f(W_eX+b_e)\\\widetilde X=g(W_dY+b_d)\\\arg\;\min\lbrack loss(X,\widetilde X)\rbrack\end{array}$$
(1)

where \({W}_{e}\), \({b}_{e}\) and \(f\left(\cdot \right)\) denote the weights, bias and activation function of encoder respectively, \({W}_{d}\), \({b}_{d}\) and \(g\left(\cdot \right)\) denote the weights, bias and activation function of decoder respectively.

During testing, only the encoder is adopted for feature extraction, and the features extracted by encoder are fed into the classifier for classification as shown in Fig. 2. The decoder is only used to obtain reconstructed data during training phase. The reconstructed data is closer to the input data, it is considered that the features are more representative.

Fig. 2
figure 2

Testing process of the autoencoder

An AE which encoder and decoder contain more than one layer neural network is called a stacked autoencoder(SAE). In general, the number of operation layers in encoder and decoder are equal and the operations of decoder and encoder are inverse. In other words, the encoder and decoder are structurally symmetrical. The symmetrical structure makes SAE easy to be constructed. However, it is difficult to increase the depth of SAE because when the encoder is added one layer, the decoder must be added one layer, which makes the number of SAE layers be increased by 2. In order to improve the depth of encoder, an asymmetric structure of SAE is proposed, where the number of layers in decoder is smaller than that in encoder. This makes that there are more layers and trainable parameters in encoder to extract deep features for classification.

2.2 2D and 3D convolution

The 2D convolution and 3D convolution, which principle is shown in Fig. 3, are basic operations for extracting features in convolutional neural networks.

Fig. 3
figure 3

a 2D convlution operation; b 3D convolution operation

In the 2D convolution operation, input data is convolved with 2D kernels. The output data \({y}_{i,j}^{x,y}\) at spatial position \(\left(x,y\right)\) in the jth feature map of the ith layer is denoted as

$${y}_{i,j}^{x,y}=f\left(\sum_{m}\sum_{p=0}^{{W}_{1}-1}\sum_{q=0}^{{W}_{2}-1}{w}_{i,j,m}^{p,q}{v}_{(i-1),m}^{(x+p)(y+q)}+{b}_{i,j}\right)$$
(2)

where \(m\) is the index of the feature maps in the \(\left(i-1\right)\)th layer, \({w}_{i,j,m}^{p,q}\) is the weight of position \(\left(p,q\right)\) connected to the mth feature map, \({W}_{1}\) and \({W}_{2}\) are the width and height of the kernel, \({b}_{i,j}\) is the bias for the jth feature map in the ith layer and \(f\left(\cdot \right)\) is the activation function. Through 2D convolution operations, deep spatial features of input data can be extracted into output data.

In the 3D convolution operation, input data is convolved with 3D kernels. The output data \({y}_{i,j}^{x,y,z}\) at position position \(\left(x,y,z\right)\) of the jth feature map in the ith layer is given by

$${y}_{i,j}^{x,y,z}=f\left(\sum_{m}\sum_{p=0}^{{W}_{1}-1}\sum_{q=0}^{{W}_{2}-1}\sum_{r=0}^{{W}_{3}-1}{w}_{i,j,m}^{p,q,r}{x}_{(i-1),m}^{(x+p)(y+q)(z+r)}+{b}_{i,j}\right)$$
(3)

where \({w}_{i,j,m}^{x,y,z}\) is the weight of position \(\left(p,q,r\right)\) connected to the mth feature map in the ith layer, \({W}_{3}\) is the size of kernel along toward spectral dimension, and other parameters are the same as the Eq. (2). The structure of 3D kernel is consistent with that of HSI data cube, so 3D convolution operations can extract spatial and spectral features simultaneously.

3 Proposed TMC-SAE

3.1 Framework of the proposed TMC-SAE

In this paper, the TMC-SAE is proposed for HSI classification. As shown in Fig. 4, the TMC-SAE is composed of two stacked autoencoders(SAE) SAE-1 and SAE-2 respectively and a classifier. Both SAE-1 and SAE-2 contain a encoder and a decoder. The function of encoders and decoders are to extract features and reconstruct input data respectively. The decoders are designed only for training the encoders and not for classification. The network for classification is composed of the SAE-1 encoder, SAE-2 encoder, and classifier. The structures and training details of SAE-1, SAE-2 and classifier will be described in below.

Fig. 4
figure 4

Framework of TMC-SAE

The SAE-1 is a 1D SAE with asymmetric structure as shown in Fig. 5, in which the encoder and decoder are based on full connection(FC) layers and 1D convolutional layers respectively. The purpose of this asymmetric structre is to make the encoder contains more trainable parameters than decoder to improve its ability of feature extraction.

Fig. 5
figure 5

Structure of SAE-1

The encoder of SAE-1 consists of five FC layers which contain k1, k2, k3, k4, k5 neurons respectively and each FC layer is followed by a batch normalization(BN) layer, activation layer with ReLU activation function and dropout layer(rate = 0.5). The decoder of SAE-1 is composed of three 1D deconvolution(DC) layers and each DC layer is followed by a BN layer and activation layer.

It is assumed that the raw HSI data is represented by \(\boldsymbol X\in {\mathbb{R}}^{M\times N\times B}\), where \(M\) and \(N\) are the height and width of the image and \(B\) is the number of spectral bands. After the dimension reduction of spectral by encoder, the pixel data vector \(x\in {\mathbb{R}}^{B}\) is mapped to the feature vecotr \(h\) with k5 dimensionality. The trained encoder will be used to reduce the dimension of raw HSI data and the output of encoder with size of \(M\times N\times k5\) will be taken as the input of ASE-2. The encoder of SAE-1 reduces the number of spectral bands from \(B\) to k5 while maintaining the same spatial dimensions.

A hybrid network SAE-2 is proposed to further extract spectral-spatial features from the data after dimension reduction by encoder of SAE-1. The framework of SAE-2 is shown in Fig. 6. It consists of a encoder, which stacks three 3D convolution layers and three 2D convolution layers to extract spatial-spectral features simultaneously, and a companion decoder, which is composed of three 3D deconvolution layers and three 2D deconvolution layers to reconstruct the input data from the features extracted by the encoder.

Fig. 6
figure 6

Structure of SAE-2 and classifier

The SAE-1 encoder output \(X\in {\mathbb{R}}^{M\times N\times k_5}\) is divided into the 3-D neighboring patches \(\boldsymbol P\in {\mathbb{R}}^{S\times S\times k_5}\), which is taken as the input of SAE-2. Each patch \({P}_{x,y}\in \boldsymbol P\) centered at the spatial location \(\left(x,y\right)\) pixel is generated by covering the \(S\times S\) window and all spectral bands. The function of reshape layer is to combine the spectral dimension and channel dimension of the feature maps to make it suitable for next 2D convolution layer. There is none trainable parameter in the reshape layer. The backpropagate method is used to train the SAE-2 with a MSE loss function. In both the encoder and decoder, the ReLU activation function is adopted for every convolution and deconvolution layer to improve network fitting ability.

After the ASE-2 is trained, the encoder of ASE-2 is used independently to provide extracted spatial-spectral features for classifier. The classifier consists of a flatten layer, which expands the extracted features by ASE-2 encoder to 1D vectors, and three FC layers. The first two FC layers with ReLU activation function are designed to extract features further and followed by a dropout layer to prevent overfitting. The last FC layer with the same number of neurons as the number of classes of pixels uses softmax activation function to implement the classifier.

3.2 Details of training

The training of TMC-SAE is a three-phase process: (1) the training of SAE-1 based on unsupervised learning. In this step, the encoder of SAE-1 automatically extracts features form raw spectral data and the decoder reconstructs the raw data from the output of encoder. The training dataset is composed of all pixel vectors. The trained encoder of SAE-1 realizes dimension reduction from the raw HSI data \(\boldsymbol X\in {\mathbb{R}}^{M\times N\times B}\) to \(\boldsymbol Y\in {\mathbb{R}}^{M\times N\times {k}_{5}}\) only in spectral dimension. (2) the training of SAE-2 based on unsupervised learning. This process is as same as step (1) except that the training data is the extracted features of trained SAE-1 encoder. In this phase, the 3D neighboring pathces dataset \(\boldsymbol Z\in {\mathbb{R}}^{P\times P\times {k}_{5}}\), which contains the information of all labeled pixels and is generated from \(\boldsymbol Y\in {\mathbb{R}}^{M\times N\times {k}_{5}}\), is taken as the training dataset. The parameters P represents the patch window size of the training sample. (3) the training of classifier and fine-tuning of SAE-2 based on supervised learning with small labeled smaples. In this phase, the dataset \(\boldsymbol Z\in {\mathbb{R}}^{P\times P\times {k}_{5}}\) is divided into training and testing groups, respectively. The classifier training and SAE-2 encoder fine-tuning are performed simultaneously based on the training group. After the above process, the classification performance of TMC-SAE is verifed based on the testing group. It can be seen from the above details that the features of all pixels can be used for the ASE-1 and ASE-2 training. This allows the encoders of ASE-1 and ASE-2 make maximum use of the information in the dataset instead of relying on only a small number of labeled samples. Thanks to the deep features extraction ability of SAE-1 and SAE-2 encoders, the high classification accuracy can still be obtained based on a small samples training group. The detailed flowchart of TMC-SAE training and testing is shown in Fig. 7.

Fig. 7
figure 7

Flowchart of training and testing for TMC-SAE

4 Details of experimental

4.1 Data description

In this paper, three benchmark hyperspectral datasets with different environmental settings are adopted to validate our proposed network. The first dataset was gathered by the Airborne Visible Infrared Imaging Spectrometer(AVIRIS) instrument over a mixed vegetation site in northwestern Indiana (Indian Pines, IP). It contains \(145\times 145\) pixels with 220 spectral channels covering the range from 0.4 to 2.5 \(\mu m\). The second dataset was acquired over Kennedy Space Center(KSC), Florida. It consists of \(512\times 614\) pixels with 176 spectral bands. There are 13 different land-cover classes in the raw dataset. The third dataset was gathered over SalinasValley(SV), California. It contains \(512\times 217\) pixels and 224 bands in the range of 0.4–2.5 \(\mu m\). There are 204 bands in the corrected data after 20 water absorption bands are removed. The land-cover classes and the labeled pixel numbers of each class for all datasets are listed in Table 1. The ground truth images of all datasets are shown in Fig. 8. All experiments are conducted on a computer with Intel(R) Core i7- CPU, Nvidia Geforce GTX 3090 GPU and 64 Gb RAM.

Table 1 The Class labels and number of training and testing samples
Fig. 8
figure 8

Ground truth image. a IP dataset. b KSC dataset. c SV dataset

4.2 Network construction

Because the numbers of bands in three datasets are different, the numbers of neuron(k1 ~ k5) in the FC layers of SAE-1 encoder are different. In general, the spectral band compression ratio of the SAE-1 encoder is about 1/8. The network structure of SAE-1 is given in Table 2. It can be seen from Table 2 that the number of trainable parameters in encoder is much larger than those in the decoder. This asymmetric structure imporves the feature extraction ability of encoder.

Table 2 Network structures of SAE-1

The parameters of all layers in SAE-2 are the same for all datasets. The structure of SAE-2 and classifier is given in Table 3. In the SAE-2, the kernel sizes and strides of all layers are based on 3 and 1, respectively. The purpose of this design is to reduce the trainable parameters and the loss of spatial-spectral information during training process. The activation function employed in network is ReLU except for the last layer of the classifier. The learning rates of the ASE-1 and ASE-2 training are both 0.001, but the learning rate is 0.0001 when the classifier is trained and the encoder of ASE-2 is fine-tuned.

Table 3 Network structures of SAE-2 and classifier

5 Experimental results and analysis

5.1 Analysis of parameters

In the architecture of TMC-SAE, the depth of SAE-1 is an important parameter for the classification performance. A series of experiments were conducted to evaluate the impact of SAE-1 depth on classification results. In the experiment, the depth of SAE-1 encoder was set eight different values from 1 to 8 and the overall accuracy(OA) was used to evaluate the classification performance of TMC-SAE with different depth on three datasets repectively. The experimental results are shown in Fig. 9. It can be seen that the OA first increases and then decreases as the depth of SAE-1 increases. This indicates that deeper SAE-1 can extract representative and deep features but will encounter the overfitting. Based on the experimental results, the depth of SAE-1 encoder was determined to be 5.

Fig. 9
figure 9

Impact of SAE-1 encoder depth on overall accuracy

The encoder of SAE-2 consists of 2D convolution layers and 3D convolution layers. The purpose of 3D convolution operations is to extract spatial-spectral joint features from data that have been dimensionally reduced by SAE-1. The function of 2D convolution operations is to extract deeper features for classification task. In order to evaluate the effectiveness of 3D convolution and 2D convolution operations, the incomplete SAE without 3D convolution branch and that without 2D branch were used for classification experiments separately. The experimental results shown in Fig. 10 indicate that the SAE without 2D or 3D operations slightly reduce classification accuracy.

Fig. 10
figure 10

Imapct of 2D or 3D operations in SAE-2 on overall accuracy

The loss and classification accuracy convergence curves of training group are portrayed in Fig. 11. It can be seen that both curves of all datasets converge at about 200 epochs.

Fig. 11
figure 11

The training losses and classification accuracy curves

5.2 Visualization and analysis of ASE-1

In order to gain detailed understanding of the SAE-1, visualization about spectral information is provided in this section. The spectral curves are used to visualize the features before and after extraction by SAE-1. The raw spectral curves of graminoid marsh(class 8) and spartina marsh(class 9) in KSC are shown in Fig. 12a and b. Obviously, the two curves are very similar and fifficult to distinguish. The extracted feature curves by SAE-1 are shown in Fig. 12c and d. These two features, which dimensions are reduced from 175 to 20, become more discriminable and abstract.

Fig. 12
figure 12

Representative spectral curves of two land cover classes of the KSC. a Original spectral of class 8 graminoid mash. b Original spectral of class 9 Spartina marsh. c Features of class 8 after SAE-1. d Features of class 9 after SAE-1

5.3 Comparison of classification results

In this experiment, the overall accuracy(OA), average accuracy(AA), and Kappa coefficient(Kappa) are introduced to evaluate the classification results. In addition, the results of the proposed TMC-SAE are compared with six state-of-the-art HSI classification models, which cover unsupervised learning and supervised learning with different dimensions, such as 1D-CNN [39], 2D-CNN [36], 3D-CNN-C [6], M3D-DCNN [15], 3D-CNN-H [3] and 3D-CAE [33]. The architectures and hyperparameters of these comparative models are consistent with that given in the corresponding papers. All the models are implemented using Python language and TensorFlow library. In order to verify the feature extraction ability of proposed model under the condition of small number of labeled samples, the training sample percentage of each class for IP, KSC and SV is set to 5%, 5% and 1% respectively.

The quantitative results over IP, KSC and SV datasets are listed in Tables 4, 5 and 6 respectively. It can be observed from three tables that the OA, AA and Kappa of propsed TMC-SAE outperform those of all other models for all datasets. The OA of TMC-SAE achieves 92.65% for IP, 94.41% for KSC and 98.50% for SV. The best accuracy of class 1–4, 10, 11, 13, 15 for IP, class 1, 3, 6–13 for KSC and class 1–3, 5–7, 9, 13, 15, 16 for SV is generated by the proposed TMC-SAE model. The experimental results show that there is no much lower result among the accuracy of each class of the proposed TMC-SAE even if the training sample is very few. It can be concluded that the feature extraction capability of TMC-SAE is more stronger and the above capability is enhanced by the unsupervised learning of SAE-1 and SAE-2. Figure 13 illustrates the classification maps of IP dataset with each above-mentioned model. The quality of the classification map of TMC-SAE is much better than other models especially for the classes with small number of samples.

Table 4 Classification accuracy of different models over the Indian Pines dataset
Table 5 Classification accuracy of different models over the KSC dataset
Table 6 Classification accuracy of different models over the Salinas dataset
Fig. 13
figure 13

Classification maps generated by different models over IP dataset. a 1D-CNN. b 2D-CNN. c 3D-CNN-C. d M3D-DCNN. e 3D-CNN-H. f 3D-CAE. g Proposed TMC-SAE

5.4 Impact of the training sample size

In this part, the effect of the different training sample size with all models is explored. For IP and KSC datasets, the percentage of training samples is set 3%, 5%, 10%, 15% and 20% and for SV dataset, it is set 0.5%, 1%, 3%, 5% and 7%. Figure 14 shows the OA results of different percentage of training samples on all datasets. As we can observe in Fig. 14, for all models, higher classification results can be obtained with larger proportion of training samples. However, with the decline of the proportion of training samples, the decline of classification accuracy of different models varies greatly. For IP dataset, the OA results of 2D-CNN-N, 3D-CNN-C, 3D-CAS and TMC-SAE are similar, when the percentage of training sample is 20%. However, there is more than difference between the largest OA result (proposed TMC-SAE, 85.29%) and the smallest classification result (M3D-DCNN, 75.11%) when the percentage of training sample is reduced to 3%. The proposed TMC-SAE model generates the highest accuracies in all experiments with small number of training sample. Specifically, when the proportion of training sample is 3% and 5%, the decline of classification accuracy of the proposed TMC-SAE is the smallest. For SV dataset, when the percentage of training samples is 7%, the OA results of all methods exceed 99% except 1D-CNN. It indicates that these models can extract sufficient features for classification when there are enough training samples. When the percentage of training samples decreases, especially at 1% and 0.5%, the OA of TMC-SAE remains the highest value. It indicates that the TMC-SAE maintains better feature extraction ability in small number of training samples.

Fig. 14
figure 14

Experimental results of all models with different percentages of training samples over three datasets. a IP. b KSC. c SV

6 Discussion and conclusion

In this paper, a new network architecture for hyperspectral remote sensing image classification is proposed. It consists of two stacked autoencoder networks SAE-1 and SAE-2. The purpose of SAE-1 based on 1D CNN is for feature extraction in spectral domain only. The asymmetric architecture improves the feature extraction ability of SAE-1 by making the number of trainable parameters in encoder more than that in decoder. The SAE-2 based on 2D and 3D CNN can extract spatial-spectral joint features from the information compressed by SAE-1. Generally, there is only one unsupervised learning in the previous network training. In this paper, the proposed TMC-SAE is divided into two independent autoencoders SAE-1 and SAE-2. This architecture increases the number of unsupervised training times to two, so that the information in unlabeled samples can be extracted more fully. The experimental results with real hyperspectral images demonstrate that the proposed TMC-SAE can achieve better classification result with a small number of training samples.