1 Introduction

1.1 Background

Training data regarding quality and quantity is crucial for the effectiveness of models and algorithms within the dynamic field of artificial intelligence and machine learning. Convolutional neural networks (CNNs) have gained significant traction as the most efficient method for image classification. Existing CNN models continue to have substantial flaws. It is common for datasets to lack training samples or to have an uneven class distribution [1]. Creating an extensive image collection takes time and resources. Data augmentation has become a potent strategy for improving machine learning models’ resilience, generalizability, and overall efficacy [2].

Data augmentation can be used to meet the needs of the different types of training data and the amount of training data. In classification tasks, augmented data can also address the challenge of classes exhibiting excessive similarity or substantial disparity. Data augmentation is significant when a model is used to analyse the parts of an image. Let us say we want to pull out the details of a ship from a satellite image. A large amount of data is required because the ship’s location, shape, and size constantly change, meaning the dataset needs to grow. To train a model for semantic segmentation, you must give it pairs of data, including the original image and a semantically labelled image. These two data pairs must be provided to train the model. As a direct result of this, we will simultaneously have to generate two identical images [3,4,5]. A unique progressive remote sensing ship image data augmentation approach was developed using ship simulation samples and an NST-based network. There are two steps to their procedure. A visible-light imaging simulation system is used to produce samples for the ship simulation first from images taken in the actual environment. The training dataset is made more diverse by this process. Second, researchers may recreate the simulated aesthetic in the real world utilizing some authentic images and a newly created NST-based network called Sim2RealNet [6]. Several ship targets were used to assess the suggested approach for classifying remote sensing images.

Conventional data augmentation methods, including those involving geometric transformations such as flipping, translation, and rotation, are used to generate augmented data. This enhanced dataset is subsequently utilized for training purposes, allowing the generation of an improved deep model [7]. A novel data augmentation technique, random image cropping and patching (RICAP), has been developed. This technique involves randomly cropping four images patched together to create a unique training image. In addition, the RICAP technique incorporates the class labels of the four images, leading to a notable benefit in using soft labels. An evaluation of RICAP using contemporary convolutional neural networks, including the shake-shake regularization model, is considered at the forefront of the field [1]. Due to global warming, forest fires have become a major cause of ecological harm. Because of its rapid updates and extensive coverage area, remote sensing (RS) is essential for monitoring forest fires. A major factor affecting classifier performance is the loss of significant image characteristics from the existing basic mixed sample data augmentation (MSDA) algorithms for smoke scene recognition. For MSDA, there is a brand-new technique called CAMMix. Using CAMMix, choose the area and mix the intensity throughout the significance map. A mixed mask that integrates class significance is produced by CAMMix using an auxiliary (AUX) classifier such that the distribution of the mixed sample closely resembles that of the original data [8].

A data augmentation technique, such as CutOut with iterative spatial–spectral training (ISST) [9,10,11], requires the input square area to be randomly masked before the training begins. Both the resilience of the convolutional neural network to errors and its overall performance might see an improvement as a result of this change. Generative adversarial networks have also been used for data augmentation in conventional RGB and satellite images. The method described is an unsupervised approach to data generation [12]. The generative model typically consists of the generator and the discriminator, which function like game components. The former’s primary objective is to acquire the ability to produce visually authentic images and deceive the discriminator, distinguishing between genuine and artificially generated images. Several examples of generative adversarial networks (GANs) that have been utilized in the context of satellite imagery are DCGAN, CycleGAN, and SSSGAN. The progressive growth GAN technique generated high-resolution images [13]. Because these methods generate fresh samples, which are then rapidly modified before being stitched together at the image level, it might be challenging to determine the point at which one item concludes and another begins. Because borders play a significant part in semantic segmentation activities, the approaches described up to this point are inappropriate for upgrading samples for these pursuits. Because of its capacity to provide correct results, generative adversarial networks, often known as GAN, have been one of the most prominent unsupervised methods in recent years [14,15,16]. For instance, DCGAN and Marta GAN have been suggested to improve the image quality acquired by remote sensing. In contrast to the deep convolutional generative adversarial network (DCGAN), the Marta generative adversarial network (Marta GAN) can generate images with greater detail and resolution. Because of remote sensing images’ inherent ambiguity and complexity, GAN-based augmentation algorithms have a tough time learning the target objects’ distribution properties, ultimately leading to a low-quality augmentation effect. One of the many reasons why GAN-based augmentation algorithms have such a difficult time increasing image quality is because this is one of those reasons. For instance, the generated images could be of better quality and lack the majority of components generally agreed upon as essential. They can be augmented via UNet and its counterparts [17,18,19]. In addition, GAN-based augmentation algorithms cannot create matching semantic tag images, which are necessary for semantic segmentation and are typically annotated manually at a significant expense. This is because these images have to be annotated by hand. Because semantic segmentation is such an important endeavour [20,21,22], this is a considerable disadvantage. Therefore, it would be ideal to develop a technique for enhancing images collected by remote sensing by efficiently addressing the annotation complexity while minimizing the expense. Recently, a kind of convolutional neural network or CNN [23,24,25] was characterized as having the capacity to effectively satisfy the image translation challenge. This was accomplished via the use of neural networks. In this paper, we show how to use data augmentation as a pre-processing approach for training a deep CNN and empirically evaluate the efficacy of our data augmentation strategy for improving CNN representational power.

1.2 Motivation

As discussed in Sect. 1.1 above, a deep model’s ability to describe things depends significantly on how different the training data is. However, the most advanced deep learning techniques in remote sensing mainly focus on making new multilayer representations. We have yet to look into how the size and variety of the training dataset affect their performance. Deep learning cannot be used to its full potential in remote sensing because there is not enough training data. From what we have discussed, it is clear that researchers have come up with a wide range of feature representation models, most of which are more challenging to compute or have lower classification performance.

1.3 Contribution

This paper discusses these fundamental data limitations that make it hard to use deep learning’s full power to classify images from remote sensing. We describe a way that suggests making a high-density feature representation model for efficiently augmenting satellite images to make remote sensing datasets bigger and varied and then using this dataset to train a deep CNN. The proposed model initially collects a small set of available satellite images and represents them via a hybrid of long short-term memory (LSTM) and gated recurrent unit (GRU) features. These features are processed via an iterative genetic algorithm (IGA), identifying optimal augmentation methods for the extracted feature sets. An iterative fitness function is modelled to analyse the efficiency of this optimization process, which assists in the incremental improvement of the classification process. The function uses an accuracy & precision-based feedback mechanism that helps in tuning the hyperparameters of the proposed LSTM & GRU feature extraction process.

In Sect. 4, the suggested model’s accuracy, precision, and recall performances were evaluated and contrasted to those of conventional augmentation methods. In addition to providing suggestions for further enhancing the suggested augmentation model’s performance in various use cases, this paper concludes with a few insightful observations.

2 Brief review of image augmentation models

A wide variety of deep learning-based techniques are proposed for the augmentation of images, and each of them varies in terms of their quantitative performance measures and qualitative characteristics. Deep CNNs have shown promising results when processing images, but their eloquence might be too exact. Existing datasets may be improved using techniques for data augmentation without introducing unintended bias. Modern CNN designs with more parameters make applying traditional data augmentation techniques useless. Table 1 presents additional discoveries about data augmentation. The information in Table 1 has been gathered from diverse research papers.

Table 1 Overview of different data augmentation techniques

3 Design of the proposed High-density Feature Representation model for effective Augmentation of Satellite images

As per the analysis of existing feature representation models for augmenting satellite images, most have higher computational complexity or lower classification efficiency levels. The design of a high-density feature representation model for efficient augmentation of satellite images is discussed in this part to address these problems. As observed in Fig. 1, the proposed model initially collects a small set of available satellite images and represents them via a hybrid of short-term memory (LSTM) & gated recurrent unit (GRU) features. These features are processed via an iterative genetic algorithm (IGA), identifying optimal augmentation methods for the extracted feature sets. An iterative fitness function is modelled to analyse the efficiency of this optimization process, which assists in the incremental improvement of the classification process. The function uses an accuracy & precision-based feedback mechanism that helps in tuning the hyperparameters of the proposed LSTM and GRU feature extraction process.

Fig. 1
figure 1

Flow of the proposed feature representation process

At first, the proposed model pulls out many different sets of features from each image. These feature sets are extracted via a novel combination of long short-term memory (LSTM) with gated recurrent unit (GRU)-based representation techniques. The reason for combining these techniques is due to their differential feature representation characteristics. The fused feature extraction model is depicted in Fig. 2, where different variance operations are combined with tangent operations to identify multimodal feature sets.

Fig. 2
figure 2

Design of the LSTM & GRU-based feature extraction process

The model initially extracts initialization (i), temporal feature (f), and temporal output features via Eqs. 1, 2, and 3 as follows,

$$i = {\text{var}} \left( {x_{in} *U^{i} + h_{t - 1} *W^{i} } \right)$$
(1)
$$f = {\text{var}} \left( {x_{in} *U^{f} + h_{t - 1} *W^{f} } \right)$$
(2)
$$o = {\text{var}} \left( {x_{in} *U^{o} + h_{t - 1} *W^{o} } \right)$$
(3)

where \(U \& W\) represents variance constants for the LSTM & GRU processes, while \(h\) is a kernel matrix used to activate of these features [38, 39]. These features are combined to form a temporal convolutional feature set (C) via Eq. 4,

$$C_{t}^{\prime } = \tanh \left( {x_{in} *U^{g} + h_{t - 1} *W^{g} } \right)$$
(4)

All these features are used to generate the output feature matrix via Eq. 5,

$$T_{{{\text{out}}}} = {\text{var}} \left( {f_{t} *x_{in} \left( {t - 1} \right) + i*C_{t}^{\prime } } \right)$$
(5)

Based on this output feature matrix, a new kernel matrix is generated via Eq. 6,

$$h_{{{\text{out}}}} = \tanh \left( {T_{{{\text{out}}}} } \right)*o$$
(6)

These temporal output features are further processed via GRU- based operations. To perform these operations, an initial resistance (r) & impedance (z) metric is estimated via Eqs. 7 & 8 as follows,

$$z = {\text{var}} \left( {W_{z} *\left[ {h_{{{\text{out}}}} * T_{{{\text{out}}}} } \right]} \right)$$
(7)
$$r = {\text{var}} \left( {W_{r} *\left[ {h_{{{\text{out}}}} * T_{{{\text{out}}}} } \right]} \right)$$
(8)

These metrics are augmented via Eqs. 9 & 10 to estimate updated kernel metric and output feature metrics as follows,

$$h_{t}^{\prime } = \tanh \left( {W*\left[ {r*h_{{{\text{out}}}} * T_{{{\text{out}}}} } \right]} \right)$$
(9)
$$x_{{{\text{out}}}} = \left( {1 - z} \right)*h_{t}^{\prime } + z*h_{{{\text{out}}}}$$
(10)

These feature sets are capable of representing input images into multimodal sets. However, this feature extraction technique’s efficiency must be validated to estimate efficient augmentation operations. To perform this task, an iterative genetic algorithm (IGA) is developed, which assists in evaluating high variance constants for the fused feature extraction process. This IGA model works as per the following process: To start the optimizer, set the following constants,

  • Total iterations used for generation & configuration of solutions (\({N}_{i}\))

  • Total solutions that will be generated & reconfigured (\({N}_{s}\))

  • Rate at which the model will learn from other solutions (\({L}_{r}\))

  • Initially, generate \({N}_{s}\) solutions as per the following process,

  • For each satellite image, generate rotated, zoomed, width shifted, height shifted, and scaled images via augmentation operations.

  • Setup the values of \(U \& W\) as per Eq. 11 & 12,

    $$U = U\left( {{\text{Old}}} \right) \pm f*{\text{STOCH}}\left( {L_{r} , 1} \right)$$
    (11)
    $$W = W\left( {{\text{Old}}} \right) \pm f*{\text{STOCH}}\left( {L_{r} , 1} \right)$$
    (12)

    where \(W\left(Old\right) \& U(Old)\)represents old values for the LSTM & GRU constants, and \(STOCH\) represents the production of number sets via a stochastic Markovian process.

  • Using an iterative convolutional neural network (CNN), which is covered in the later sections of this text, classify satellite images based on these values by evaluating the LSTM and GRU features for each of the enhanced feature.

  • After classification, estimate solution fitness as per Eq. 13,

    $$f = \mathop \sum \limits_{i = 1}^{{N_{{{\text{images}}}} }} \frac{{t_{p} }}{{t_{p} + t_{n} }} + \frac{{t_{p} + t_{n} }}{{t_{p} + t_{n} + f_{p} + f_{n} }} + \frac{{t_{p} + f_{p} }}{{t_{p} + t_{n} + f_{p} }}$$
    (13)

    where \({t}_{p}, {t}_{n}, {f}_{p} \& {f}_{n}\) represents values of true positive rates, true negative rates, false positive rates, and false negative rates for the classification operations.

  • Repeat this process for each solution, and then use Eq. 14 to figure out a solution fitness threshold.

    $$f_{th} = \mathop \sum \limits_{i = 1}^{{N_{s} }} f_{i} *\frac{{L_{r} }}{{N_{s} }}$$
    (14)
    • Once these solutions are generated, check if \(f>{f}_{th}\), and mark these solutions as ‘not to be mutated’, while marking all other solutions as ‘to be mutated’.

    • Scan all solutions for \({N}_{i}\) iterations, and modify the solutions that are marked as ‘to be mutated’.

    • At each iteration, update the fitness and solutions fitness thresholds. The proposed algorithm is depicted in Table 2.

    Table 2 Algorithm of proposed methodology

When all possible solutions have been found, pick the one with the highest fitness level and use its features to classify satellite images. This classification is done via a convolutional neural network (CNN), depicted in Fig. 3, wherein various convolutional, max pooling & drop out layers are connected to estimate augmented feature sets. The CNN processes LSTM & GRU features and classifies them into land-specific categories. The designed CNN model initially extracts convolutional feature sets from the LSTM & GRU feature sets via Eq. 16, which assists in extracting many features.

$${\text{Conv}}_{{{\text{out}}_{i,j} }} = \mathop \sum \limits_{{a = - \frac{m}{2}}}^{\frac{m}{2}} \mathop \sum \limits_{{b = - \frac{n}{2}}}^{\frac{n}{2}} F_{{\text{LSTM, GRU}}} \left( {i - a, j - b} \right)*{\text{Re}} {\text{LU}}\left( {\frac{m}{2} + a,\frac{n}{2} + b} \right)$$
(15)
Fig. 3
figure 3

Design of the CNN algorithm for augmented feature set classification

The window size for convolutional operations is represented by m, n, and a,b, which represent stride sizes, and ReLU represents a rectilinear unit model for activation of feature sets. The parameters are mentioned in Table 3 and design in Fig. 4.

Table 3 Parameters used in the training model
Fig. 4
figure 4

Design of FCNN layers

The extracted features are given to a threshold engine, which assists in the estimation of the variance threshold via Eq. 16,

$$f_{{{\text{th}}}} = \left( {\frac{1}{{X_{k} }}*\mathop \sum \limits_{{x \in X_{k} }} x^{{p_{k} }} } \right)^{{1/p_{k} }}$$
(16)

where \(X \& p\) represents features’ intensity and probability levels tuned by the CNN process.

The max pooling layer removes all features with \(f<{f}_{th}\), while passing others to consecutive layers. A fully connected neural network (FCNN)-based model is used to classify the characteristics collected at the final layer, aiding in estimating various image classes. This FCNN layer combines different weights (w) and biases with a SoftMax-based activation function (b) as per Eq. 17,

$$c_{out} = {\text{SoftMax}}\left( {\mathop \sum \limits_{i = 1}^{{N_{f} }} f_{i} *w_{i} + b} \right)$$
(17)

where \({N}_{f}\) represents a number of extracted features by the fused layers. The suggested model can categorize the images with high levels of efficiency since it uses CNN. The following section of this paper evaluates these efficiency levels and compares them to those of standard models.

4 Result analysis and comparison with standard augmentation techniques

The proposed model helps represent input images into multimodal feature sets by combining LSTM and GRU-based feature extraction algorithms. An effective iterative genetic algorithm (IGA) is trained using the collected features to help identify high-density augmentation operations and feature constants. As a result of these operations, the proposed model can improve the accuracy, precision, and recall of different satellite image classification applications. This model was verified on the following datasets to evaluate its performance:

  • Copernicus image sets obtained from Kaggle.

  • Sentinel image sets obtained from Google Earth Engine

  • IEEE data port sets for different areas

These sets were aggregated to form a total of 100 k images, out of which 60% were used to train the model, while 20% each were used for validation & testing purposes. Based on this evaluation, the classification’s accuracy (Ac) was compared with ISST [40], GAN [41], and UNet [17] with respect to total validation & test images (TVTI) for different applications. Results of these augmentations can be observed from Fig. 5a, b, and c, wherein different satellite images were used for the classification process.

Fig. 5
figure 5

a Use of different augmentation operations. b Classification of the augmented image sets. c Use of the augmentation for different application sets

The accuracy of this model is tabulated in Table 4 as follows,

Table 4 Accuracy obtained during the classification process

Considering this evaluation and its visualization in Fig. 6, it can be seen that the proposed model can increase classification accuracy by 16.4% compared to ISST [40], 17.1% compared to GAN [41], and 13.6% compared to UNet [17], making it highly beneficial for a range of real-time classification applications. The reason for this enhancement is the use of accuracy during the optimization of fitness, which assists in estimating high-efficiency augmented feature sets. Table 5 shows the precision levels as follows:

Fig. 6
figure 6

Accuracy obtained during the classification process

Table 5 Precision obtained during the classification process

Considering this evaluation and its visualization in Fig. 7, it can be seen that the proposed model can increase classification precision by 16.1% compared to ISST [40], 14.5% compared to GAN [41], and 12.2% compared to UNet [17], making it highly beneficial for a range of real-time classification applications. The reason for this precision enhancement is using LSTM & GRU during feature extraction, which assists in estimating high-efficiency augmented feature sets. Table 6 shows the recall levels as follows:

Fig. 7
figure 7

Precision obtained during the classification process

Table 6 Recall obtained during the classification process

Considering this evaluation and its visualization in Fig. 8, it can be seen that the proposed model can increase classification recall by 38% compared to ISST [40], 34.9% compared to GAN [41], and 28.1% compared to UNet [17], making it highly beneficial for a range of real-time classification applications. This recall enhancement is due to the use of Iterative GA & LSTM with GRU during feature extraction, which assists in estimating high-efficiency augmented feature sets. These improvements allow the proposed model to identify classes in satellite images with high accuracy, precision, and recall. As a result, it applies to a wide range of real-time use cases.

Fig. 8
figure 8

Recall obtained during the classification process

5 Conclusion

According to our research, data augmentation is a significant method for preventing a model from becoming overly proficient and reducing the cost of labelling and cleansing the raw dataset. This study proposed a new model for improving the augmentation of satellite images that uses LSTM-based feature extraction with GRU-based feature extraction. First, this study combines LSTM-based feature extraction with GRU-based feature extraction, representing input images as multimodal feature sets. The collected features train an efficient iterative genetic algorithm (IGA) that helps find high-density augmentation procedures and feature constants. These methods can improve the proposed model’s accuracy, precision, and recall for various satellite image classification tasks. According to an evaluation of its accuracy, the suggested model may improve classification accuracy by 16.4% compared to ISST, 17.1% compared to GAN, and 13.6% compared to UNet, making it very beneficial for a range of real-time classification applications. The suggested model is helpful for a range of real-time classification scenarios since it may increase classification precision by 16.1% compared to ISST, 14.5% to GAN, and 12.2% to UNet. This improvement in accuracy is due to the use of LSTM and GRU during feature extraction, which helps estimate high-efficiency augmented feature sets. Estimates of recall levels show that the suggested model can improve classification recall by 38% compared to ISST, 34.9% to GAN, and 28.1% to UNet, making it very effective for a range of real-time classification applications.

As a future improvement, low-complexity and high-density feature extraction methods can be used together to improve the model. We can improve classification results using hybrid bioinspired models, autoencoders, Q-learning, or other deep learning methods.