1 Introduction

Scene classification of remote sensing image can automatically classify scene images into specified semantic categories based on their contents [1]. Currently, supervised methods based on deep neural networks are the mainstream, which usually rely on larger scale labeled samples to obtain higher classification accuracy [2]. However, labeling remote sensing images is often costly. Unsupervised scene classification methods can learn directly from a large number of unlabeled samples, but their classification accuracy is difficult to meet practical applications because they cannot make full use of labels [3]. Semi-supervised scene classification is a combination of supervised and unsupervised, which can learn from a small number of labeled samples and a large number of unlabeled samples to obtain satisfactory classification accuracy [4].

In recent years, generative adversarial networks (GANs) [5] are introduced into semi-supervised image classification. GANs can learn the underlying data distribution from real training samples to compete with state-of-the-art semi-supervised image classification methods [6, 7]. Salimans et al. [8] proposed feature-matching GANs (FMGAN) by extending standard classifier. Li et al. [9] proposed a triple GANS (tripleGAN) to achieve excellent semi-supervised classification performance using a three-player’s game. Dai et al. [10] proposed a new GANs-based semi-supervised classification model (BADGAN) to effectively improve the classification performance. Lecouat et al. [11] leveraged GANs with manifold regularization (REGGAN) for semi-supervised image classification. These GAN-based methods have achieved better semi-supervised classification performance using variants of the standard DCGAN [12] on CIFAR10 and SVHN datasets. GAN-based methods increase the number of training samples using generated samples for better classification performance. However, for complex remote sensing scene images, it is difficult for the discriminative network of standard DCGAN to extract more discriminative features, which affects the performance of semi-supervised classification. Therefore, it is a crucial challenge to further investigate more discriminative feature extraction to improve the semi-supervised classification performance.

Although some success has been achieved in the classification of low-resolution images, the GANs-based remote sensing scene classification still needs to be improved. More recently, Roy et al. [13] introduced a semantic branch into GANs (SFGAN) for semi-supervised satellite image classification to obtain better classification performance. However, SFGAN use a standard DCGAN structure, which is suitable for processing images with relatively simple scenes and low spatial resolution, the classification performance is more limited for high-resolution remote sensing images with complex scenes. Inspired by SFGAN, we further investigate methods to extract more discriminative features through discriminative network. Guo et al. [4] proposed a GAN-based semisupervised remote sensing scene classification method, they introduced a gating unit and a self-attention gating (SAG) module into the discriminative network to improve semisupervised classification performance (SAGGAN). Ledig et al. [14] used GANs with dense residual block for super-resolution reconstruction of natural images to achieve more realistic image texture structure. Miech et al. [15] introduced a learnable nonlinear unit (named context gating) that aims to model the interdependencies between network activations for video classification. Liu et al. [16] proposed gated full convolutional blocks to improve micro video scene classification performance. Duta et al. [17] proposed pyramidal convolution (PyConv) with different sizes, types and depths of filters to extract different levels of details in scene image. Miyato et al. [18] introduced spectral normalization (SN) into GANs to stabilize the training process for improving the performance of GANs.

Inspired by the above works, we propose a novel semi-supervised remote sensing scene classification model based on GANs (SSGAN), which uses gating units (GU), PyConv, pre-trained network branch, SN and dense residual block to enhance the feature extraction capability of discriminative network. GU is dedicated to capturing the dependencies between features and adaptively focusing on the important features of the input image; PyConv is dedicated to capturing the detailed features at different levels in scene images; the dense residual blocks are used to replace the convolution in GANs, improving the quality of the generated images and enhancing the feature discrimination of the discriminative network; and SN aims at stabilizing the training process of GANs to improve the performance of GANs. Compared with SAGGAN [4], in the proposed SSGAN, we introduced the dense residual blocks to replace the convolution in SAGGAN to enhance feature discrimination ability, integrated the PyConv into residual blocks to capture feature details of different levels, and introduced SN into both discriminative network and generative network to promote the performance of GAN. Extensive experiments on EuroSAT [19] and UCM Merced [20] datasets show that the proposed SSGAN achieves higher classification accuracy than other state-of-the-art semi-supervised methods based on GANs. The main contributions of this paper are summarized as follows.

  1. 1.

    A novel GAN-based semi-supervised SSGAN model is proposed, by enhancing the feature exacting ability of discriminative network for improving semi-supervised scene classification accuracy.

  2. 2.

    GU, dense residual block and PyConv are introduced into the discriminative network to adaptively focus on important features and capture the details at different levels for achieving more discriminative feature representation.

  3. 3.

    SN is integrated into both generative and discriminative network to stabilize the training of GANs for improving classification accuracy.

  4. 4.

    Dense residual block is introduced the generative network to enhance the quality of the generated images for augmenting the training samples.

The remainder of this paper is organized as follows. In Sect. 2, the related works of this paper is introduced. In Sect. 3, the architecture of SSGAN is presented. In Sect. 4, the experimental results and discussions are shown. Finally, the conclusions are drawn in Sect. 5.

2 Related Works

In this section, we review related works on gating mechanisms, PyConv, semi-supervised image classification based on GANs, and so on.

2.1 GANs-Based Semi-supervised Image Classification

GANs are widely used in semi-supervised classification in recent years because of their powerful generation ability to effectively augment training samples to improve classification performance. However, the traditional GANs-based semi-supervised classification methods use the standard DCGAN structure, which affects the feature discriminant ability of the discriminative network to hinder the performance of semi-supervised classification. Therefore, we improve the standard DCGAN to further enhance the discriminant ability of the discriminative network to improve the semi-supervised classification performance. The principle of GANs-based semi-supervised image classification is detailed below.

The standard GANs have two components, the discriminative network D and generative network G. G synthesizes fake samples G(z) by random noise z, D distinguishes between real and fake samples [5], and GANs accomplish the related task by a min–max game between G and D. The value function V(GD) can be expressed as follows:

$$\begin{aligned} \mathop {\min }\limits _{G} \mathop {\max }\limits _{D} \;V(D,G)& = {E_{x \sim {p_{{\text {data}}}}(x)}}[\log \;D(x)] \nonumber \\&+ {E_{z \sim {p_z{(z)}}}}[\log (1 - D(G(z)))], \end{aligned}$$
(1)

where z is a random noise vector that is generated following a priori distribution \((z \sim {p_z{(z)}})\), \(p_{{\text {data}}}\) denotes real data distribution, G(z) is image generated from G, D(x) represents class probability that x is from real sample, and D(G(z)) denotes the probability that the sample is generated by G. The goal of D is to maximize the probability of the real sample, and the goal of G is to increase the probability of the generated sample being classified as a real image.

Springenberg et al. [21] proposed categorical generative adversarial networks (CatGAN) for semi-supervised image classification by using a multi-classifier to substitute binary classifier. Salimans et al. [8] further extended standard classifier. They add images generated from G as a new category \(y = K + 1\), then the output dimension of D becomes \({\text {logits}} = \{{l_1, l_2 ,\ldots , l_{K + 1}}\}\). These logits are converted into the class probabilities using softmax function. Then, the probability that x is a real sample of the j-th category is as follows:

$$\begin{aligned} p_{{\text {model}}}(y=j|x,j< K+1) = \frac{{\text{exp}}(l_j)}{\sum {_{K=1}^{K+1}}({\text{exp}}(l_k))}, \end{aligned}$$
(2)

and the probability that x is fake sample is as follows:

$$\begin{aligned} p_{{\text {model}}}(y=K+1|x) = \frac{{\text{exp}}(l_{K+1})}{\sum {_{K=1}^{K+1}}({\text{exp}}(l_k))}, \end{aligned}$$
(3)

For the semi-supervised classification model, labeled samples are trained in a supervised manner, while unlabeled samples are trained in an unsupervised manner. The network framework for GANs-based semi-supervised classification is shown in Fig. 1. One can observe from Fig. 1 that the inputs of D consist of real labeled samples, real unlabeled samples and samples generated by G. Therefore, the loss object of D is as follows:

$$\begin{aligned} L_D& = -E_{x,y \sim {p_{{\text {data}}}(x,y)}}[\log (p_{{\text {model}}}(y|x,y<K+1))]\nonumber \\&-\{ E_{x \sim {p_{{\text {data}}}(x)}}\log ([1-p_{{\text {model}}}(y=K+1|x))]\nonumber \\&+ E_{x \sim {G}}[\log (p_{{\text {model}}}(y=K+1|x))]\}\nonumber \\& = L_{\text {s}}+L_{{\text {un}}}, \end{aligned}$$
(4)

where \(L_{\text {s}}\) presents the supervised loss, \(L_{{\text {un}}}\) denotes the unsupervised loss. The first term in \(L_{{\text {un}}}\) indicates the loss of real unlabeled sample, and the second term is the loss of fake sample. For unsupervised learning, D only outputs true or false without distinguishing categories, so it can be expressed by Eq. (5).

$$\begin{aligned} D(x) = 1- p_{{\text {model}}}(y=K+1|x), \end{aligned}$$
(5)
Fig. 1
figure 1

The network framework of semi-supervised classification using GANs

Substituting Eq. (5) into \(L_{{\text {un}}}\), we can obtain Eq. (6) as follows.

$$\begin{aligned} \begin{array}{l} L_{{\text {un}}} = -\{E_{x \sim p_{{\text {data}}}}\log D(x)+E_{z \sim {p_z(z)}}\log (1-D(G(z)))\}, \end{array} \end{aligned}$$
(6)

The loss \(L_G\) of G can be expressed as follows.

$$\begin{aligned} L_{G}& = E_{x \sim {p_{{\text {data}}}(x)}}f(x)- E_{z \sim {G}}f(\hat{x})_2^2\nonumber \\&- E_{x \sim {G}}\log [1-p_{{\text {model}}}(y=K+1|x)], \end{aligned}$$
(7)

where f(x) is the activation of middle layer from D to match the features between real samples and generated samples, and \(\hat{x}\) denotes generated samples. The first term in Eq. (9) denotes the feature matching term, which drives G to generate an sample that matches the manifold of real samples, so that D can better distinguish real sample from sample generated by G.

More recently, Lecouat et al. [11] leveraged GANs with manifold regularization for semi-supervised image classification. Li et al. [9] proposed a Triple-GAN, which included G, D, and a separate classifier C to simultaneously achieve superior classification performance and a good image generation. Dai et al. [10] analyzed why good semi-supervised classification performance and good generator cannot be obtained at the same time. They proposed a BAD-GAN based on their analysis to improve classification performance on multiple benchmark datasets.

Traditional GANs-based semi-supervised classification methods use the standard DCGAN structure, which limits the scene classification performance of remote sensing images with complex scenes. Ledig et al. [14] used GANs with the dense residual structure for super-resolution reconstruction of natural images to achieve the more realistic image texture structure. Inspired by [14], we replace standard convolutional structure in the SFGAN with dense residual block.

2.2 Gating Mechanism

More recently, Srivastava et al. [22] leveraged adaptive gating units to train deep neural networks. Miech et al. [15] proposed a context gating unit to aim at capturing interdependencies among network activations for improving video classification performance. Liu et al. [16] introduced the gated fully convolutional blocks to improve micro-video venue classification performance. Guo et al. [4] proposed a self-attention gating module by combining a gating unit and a self-attention block to capture the long-range dependencies among feature maps to focus on crucial regions adaptively. Their experiments demonstrate that the GU can focus on important areas in the scene to eliminate the background effectively improving feature discrimination. Inspired by above mentioned works, we introduce the combination of gating units and residual blocks into GANs to further enhance feature discriminant ability for improving classification performance.

2.3 Pyramidal Convolution

Convolutional neural networks (CNNs) have become the core architecture for current computer vision applications. The core of CNNs are convolutional layers, which are used for visual recognition by learning spatial kernels. Typically, most CNNs utilize relatively smaller kernel sizes (e.g., \(3 \times 3\)) which can greatly reduce the number of parameters and computational complexity. However, smaller kernels limit receptive field of CNNs, which lost useful details to affect the performance of visual tasks. To address this issue, Yu et al. [23] used dilation convolution to aggregate multi-scale contextual information to effectively improve the accuracy of semantic segmentation. In addition, Zhao et al. [24] used a pyramid pooling module to interpret scenes to extract different levels of details. Dilated convolutions with irregular spatial pyramidal pooling are introduced into the literature [25] to encode global context using image-level features for improving semantic segmentation performance. However, these are additional blocks that need to be embedded in the CNNs, which remarkably increase model parameters and computational complexity. Duta et al. [17] introduced PyConv to process the input samples at multiple-scale filters. PyConv consists of a pyramidal kernel in which each layer contains of different types, sizes and depths of filters to capture different levels of details for enhancing feature discriminant ability. Recently, Guo et al. [3] introduce PyConv into each residual block of the discriminative network to capture the different levels of details from multiple-scale filters for enhancing the features discriminant ability. Their experiments show that PyConv is able to capture different levels of detailed features to effectively improve feature discrimination. Inspired by the above works, we replace the middle layer convolution in residual block of discriminative network with PyConv to capture more details for further enhancing the feature discriminant capability of discriminative networks.

3 Proposed Method

The proposed SSGAN is described in detail below.

3.1 Structure of Proposed SSGAN

As described in Sect. 2, the GAN-based semi-supervised classification model has the similar structure to the original GANs [5]. Following the structures in SRGAN [14], the dense residual block in SRGAN is used to replace the standard convolution in SFGAN construct the generative and discriminative network in SSGAN. Figure 2 illustrates the network framework of the proposed SSGAN. The generative network is composed of four residual blocks. The residual block is shown as the G_Block block in Fig. 2, which includes an upsampling layer, a batch normalization (BN), a parametric ReLU activation (PReLU), and two convolutional layers separated by a BN. In addition, a GU is added after the first residual block. The discriminative network contains five residual blocks and one global average pooling layer (GAP).

Fig. 2
figure 2

The framework of the proposed SSGAN

To improve the GANs-based semi-supervised classification performance, inspired by SFGAN [13], we extend the original discriminative network. First, a pre-trained Inception V3 network is introduced as a new branch to extend the discriminative network, which can extract semantic features by fine-tuning, and then a GAP operation is performed on the extracted feature maps. Second, the second convolutional layer of residual block in D is superseded by the PyConv to capture the longer range of contextual information. In addition, a GU is added to the discriminative network to adaptively focus on the crucial regions and filter the useless background. Specifically, the GU is added after the first residual block and GAP in the original discriminative network, and the GU is placed after the GAP in the Inception V3 branch. Finally, the two feature vectors from the Inception V3 and original discriminative network are concatenated and fed into the softmax function for semi-supervised scene classification.

In addition, the instability in the training of GANs is a major factor affecting performance. Miyato et al. [18] stabilized the training of GANs using SN to constrain Lipschitz constant of the discriminative network to satisfy 1-Lipschitz continuity. Zhang et al. [26] demonstrated that SN can achieve better performance when it is introduced into both generative and discriminative network. Inspired by [18, 26], we use spectral norm to achieve 1-Lipschitz continuity in residual blocks of both generative and discriminative network simultaneously to ensure the stability of SSGAN training.

3.2 Structure of Gating Unit

Most GANs-based semi-supervised methods use standard convolution to construct discriminative network. The convolution operation is limited by receptive field, and it is difficult to capture the dependencies among feature maps, which affects the semi-supervised classification performance.

Inspired by gating mechanism [16], the GU is designed and introduced after the first residual block of original discriminative network and after GAP of both branches to enhance the feature description capability. The Fig. 3 illustrates the structure of GU. The derivation of GU is briefly described below.

Fig. 3
figure 3

The structure of GU

The input of GU can be any intermediate layer feature map F from the discriminative network, and GU can convert F into new feature \(F_{GU}\). The derivation process is as follows.

$$\begin{aligned} f_{GU}(F)& = \sigma (fc(F)), \end{aligned}$$
(8)
$$\begin{aligned} F_{GU}& = f_{GU}(F) \odot F, \end{aligned}$$
(9)

where \(\sigma (\cdot )\) presents sigmoid function, \(f_{GU}(\cdot )\) denotes gating function, and after the sigmoid operation, the result is a weight matrix with values in the range [0,1]. \(fc(\cdot )\) represents the fully connection, \(f_{GU}\) is the output of gating unit, and \(\odot \) denotes the dot product operation. The gating unit can effectively extract the dependencies between feature maps, eliminate irrelevant background, and enhance feature representation.

3.3 Structure of Pyramidal Convolution

Owing to the complexity of remote sensing scene images, different ground objects present different sizes in different scenes, and even the same ground objects in the same scene may display different sizes, so it is difficult to capture the diversity effectively using the traditional \(3 \times 3\) convolution. Duta et al. [17] proposed the PyConv, which introduces pyramidal kernels with different filters to extract different levels of details. PyConv with n groups of different kernels can be represented as shown in Fig. 4. The residual block D_Block of discriminative network is constructed following PyConv in this paper, and the structure is shown as D_block in the bottom leftmost corner of Fig. 2, where the second convolution is replaced with PyConv. In this paper, we refer to the structure of PyConvHGResNet and use four groups of PyConv with different filter sizes to capture different levels of details.

Fig. 4
figure 4

The structure of PyConv [17]

where FMi denotes the feature map of the middle layer as the input of PyConv and FMo denotes the output of PyConv. PyConv consists of four groups of filters with different sizes and depths, and the residual blocks with PyConv can capture different levels of feature details to enhance feature representation ability in discriminative network.

4 Experimental Results and Analysis

In this section, comprehensive experiments are conducted on the EuroSAT and UCM Merced benchmark datasets to validate the effectiveness of the proposed SSGAN method.

4.1 DataSet Description

EuroSAT dataset: EuroSAT [19] is a recently released remote sensing image dataset acquired by Sentinel-2 satellite, which includes 10 different categories. In total, the dataset consists of 27,000 images with \(64 \times 64\) pixels, which ground sampling distances (GSD) ranges from 10 to 60 m.

UC Merced dataset: The UC Merced [20] includes 21 different categories and has become benchmark dataset for remote sensing images classification. Each category has 100 images, and all images are \(256 \times 256\) pixels in size.

To validate the semi-supervised classification performance of SSGAN, the EuroSAT dataset was split into three pieces as suggested in [13, 19]: the training samples are 80% and the rest are further divided into 90% for testing and 10% for validation, i.e., 216,00 samples for training set and the rest 5,400 samples are further divided into 4860 for testing and 540 for validation. UC Merced dataset was divided in same ratio. The number (\(M = X_l\)) of tagged training samples is set in accordance with SAGGAN [5]. More specifically, the M is set to 100, 1000, 2000, and 21,600 at EuroSAT dataset, the M was set to 100, 200, 400, and 1680 at UC Merced dataset, and the rest are treated as unlabeled samples (\(X_u\)).

4.2 Experimental Setup and Evaluation Metrics

All experiments were performed in PyTorch framework on a 64-bit Ubuntu 16.04 server with an 8-core Intel Gold 6048 CPU and four TITAN V GPUs. During training, the parameters are set following SFGAN [13] and SRGAN [14]. The SSGAN is trained by adaptive moment estimation optimization algorithm (Adam) with parameters \(\beta \_1=0.5\) and \(\beta \_2=0.9\). The minimum batch is set to 128, and the epoch is set to 200. The initial learning rate is set to 0.0003, and the decay rate is 0.9. All comparison methods follow the original settings to ensure impartiality and objectivity.

For the following experiments, the proposed SSGAN will be evaluated using the overall accuracy (OA) and confusion matrix (CM) on the EuroSAT and UCM Merced datasets.

OA: OA indicates the number of correctly classified samples divided by the total number of ones. The formula can be derived as follows:

$$\begin{aligned} {\text {OA}}=\frac{T}{T+F}, \end{aligned}$$
(10)

where T represents the sample number correctly classified, and F indicates the sample number misclassified.

CM: CM is an information table to represent the confusion ratio among different categories. The row denotes the real category, and the column indicates the predicted category. The row-column intersection indicates the proportion of real categories classified as column categories, from which it is easy to observe whether these categories are confused and confusion ratio.

In addition, to ensure the reliability of experiments, all experimental results are the mean of 10 replicate experiments with randomly selected samples.

4.3 Experimental Analysis

In this section, classification accuracies of the proposed SSGAN and several representative methods are compared on EuroSAT and UCM Merced datasets. CNNs (from scratch) is the supervised classification method, Inception V3 is the method based on transfer learning. The rest are semi-supervised classification methods based on GANs, which include SAGGAN [4], tripleGAN [9], BADGAN [10], SFGAN [13], FMGAN [8], and REGGAN [11]. Overall accuracies of several methods are presented in Table 1.

Table 1 OA(%) of SSGAN and other compared approaches. The bold shows the highest, and the underlined denotes the second

One can see the following results from Table 1.

  1. 1.

    The proposed SSGAN achieves the highest OA on two datasets because of the introduction of GU, PyConv, SN and Inception V3 branch and dense residual block to further enhance feature discriminant capability. SAGGAN ranks second because of the introduction of self-attention gating module and gating unit to enhance feature discriminative capability. SFGAN [13] ranks third in performance, it is probably owing to the use of pre-trained Inception V3 network to strengthen feature discriminant capability of discriminative network. Inception V3 method outperforms other methods besides SAGGAN, SFGAN and proposed SSGAN because Inception V3 is pre-trained on the large-scale ImageNet dataset and able to extract more discriminative features in remote sensing images by fine-tuning the network. SAGGAN, SFGAN and SSGAN outperform Inception V3 because such three methods introduce a pre-trained Inception V3 branch into the discriminative network. CNNs (from scratch) has the lowest OA, which might be since that CNNs trained from scratch can be trained only using labeled samples in a fully supervised manner.

  2. 2.

    The four methods FMGAN, triple GAN, BADGAN and REGGAN use the standard DCGAN structure, the OA is significantly lower than SSGAN, SAGGAN and SFGAN. However, tripleGAN achieves higher OA due to the introduction of a separate classifier network. FMGAN effectively improves the OA by using the feature matching term, and the OA is slightly lower than tripleGAN. The OA of REGGAN is the lowest among these four methods, which indicates that manifold regularization is less effective for remote sensing image scene classification with complex scenes.

  3. 3.

    For all methods, the higher the number of labeled samples, the higher the OA. Furthermore, the proposed SSGAN at \(M = 1000\) exceeds CNNs (from scratch) at \(M = 21\),600 on EuroSAT dataset. Interestingly, SSGAN still has higher accuracy than pre-trained Inception V3 network even when \(M = 21\),600. It may be because SSGAN utilizes the samples generated by G for additionally training, while these generated samples are not available for CNNs (from scratch) and pre-trained Inception V3. The same trend is found on UCM Merced dataset. These demonstrate that SSGAN can achieve higher OA using fewer labeled samples.

  4. 4.

    On the EuroSAT dataset, the OA of SSGAN reaches 78.56.2%, 89.02%, 91.53% and 95.50% at \(M = 100\), 1000, 2000 and 21,600, which is 1.77%, 0.30%, 0.87% and 1.18% higher than SAGGAN, respectively.This is probably mainly due to the introduction of dense residual blocks, PyConv, and SN in SSGAN. Similarly, the OA is 9.96%, 2.92%, 2.53% and 2.30% higher than SFGAN, respectively. These show that SSGAN is indeed effective. In particular, the overall accuracies of SSGAN at \(m = 100\) is 9.96% and 1.77% higher than SFGAN and SAGGAN, which indicates that SSGAN can obtain higher performance with fewer labeled samples. Similar results can be seen on UC Merced dataset.

  5. 5.

    On the UC Merced dataset, the OA of SSGAN is only 59.52%, 76.13%, 83.86% and 91.02% at \(M = 100\), 200, 400 and 1680, which is because the total training samples is insufficient, and that limits GANs-based semi-supervised classification performance. But, SSGAN shows the highest OA compared to other methods.

To further evaluate the performance of proposed SSGAN, confusion matrices were generated at \(M = 100\), 1000, 2000 and 21,600 on EuroSAT dataset, respectively. The following observations can be obtained from the confusion matrix in Fig. 5.

  1. 1.

    As the number M of labeled samples increases, the accuracy of each category increases accordingly, while confusion ratio decreases. The accuracy of 8 among 10 categories is higher than 80% at \(M = 100\), which indicates that the proposed SSGAN obtains higher classification accuracy with few labeled data.

  2. 2.

    By comparing the two confusion matrices at \(M = 100\), 2000, the accuracy of categories 1, 2, 6, 7, and 8 improves by 23%, 34%, 28%, 21%, and 21%, respectively, which indicates that as the number of labeled samples increases, the classification accuracy of each category increases significantly.

  3. 3.

    In the case of \(M = 21\),600, the classification accuracy of 9 among all 10 categories is higher than 95%, which shows that the proposed SSGAN can achieve good semi-supervised classification performance.

Fig. 5
figure 5

The confusion matrices generated by proposed SSGAN on EuroSAT dataset at \(M = 100\), 1000, 2000 and 21,600

In short, the proposed SSGAN is effective.

4.4 Ablation Experiments

Compared with other semi-supervised image classification methods, the proposed SSGAN achieves the best performance by enhancing discriminative networks. In this section, the effectiveness of pre-trained Inception V3 branch, GU, PyConv and SN is verified. Four variants of proposed SSGAN are investigated individually: (1) SSGAN-GU is the variant without GU, (2) SSGAN-I is the variant without pre-trained Inception V3 branches, (3) SSGAN-P is the variant without PyConv, and (4) SSGAN-SN is the variant without SN.

For a fair comparison, extensive experiments were conducted in this paper at same experimental setup on same datasets. As shown from the experimental results in Table 2, Inception V3 branch, SN, PyConv, and GU all contribute to improving SSGAN performance on two datasets. Among them, SSGAN-I has the lowest OA in two datasets, which indicates that Inception V3 branch is the most effective because it can extract high-level semantic information from scene images and then feed extracted semantic features into GU to further enhance feature discriminative ability. The second most effective one is PyConv, because it can obtain different levels of details through multiple groups of PyConv operations with different kernel sizes to enhance feature representation. The third effective one is GU and the least effective one is SN, but the accuracy is also improved significantly on both datasets. Interestingly, SSGAN-SN has significantly lower accuracy than SSGAN when the number of labeled M is larger, which suggests that SN performs better under more labeled training samples.

Table 2 The comparison results of ablation study

5 Conclusion

In this paper, we propose a new GANs-based semi-supervised method for remote sensing scene classification using dense residual block, GU, PyConv, pre-trained Inception V3 network and SN. The proposed method achieves higher semi-supervised classification accuracy using a few labeled and numerous unlabeled samples. Specially, the pre-trained Inception V3 network is introduced into the discriminative network as a new branch to extract semantic features; the GU and PyConv are integrated into dense residual block to strengthen the feature discriminant capability of discriminative network; and SN is introduced into both generative and discriminative network to stabilize the training of GANs to improve semi-supervised classification accuracy. Comprehensive experimental results illustrate the proposed approach gains higher overall accuracy compared with other comparison methods, especially, when only there are a few labeled samples. In the future, it is planned to investigate unsupervised remote sensing image scene classification based on GANs, which is more difficult in the field of computer vision.