1 Introduction

With the increasing number of satellite launches, advanced satellite products like satellite images have significantly enhanced the quality of life. Satellite images, also known as remote sensing images, are widely applied in various fields in our society. Specifically, in agriculture, they are used for crop monitoring (Ali et al. 2022), yield prediction (Muruganantham et al. 2022), and irrigation planning (Foster and Mieno 2020). Meanwhile, they are also adopted in forest management for mapping (Lewis et al. 2022), monitoring (Salaria et al. 2023), and other management tasks. In addition, they are applied for environmental monitoring to track (Dinh et al. 2023), water quality (Singh et al. 2020), and natural disasters (Attia et al. 2022) as well. Ultimately, they can also serve critical functions in defense and intelligence missions. However, with the rapid development of image editing software and processing technology, manipulating satellite images by intentionally adding, removing, or modifying target objects is becoming easier. Recently, satellite image forgery has caused severe negative effects and potential threats to the nation, society, and individuals, such as fake satellite images of the Malaysia Airlines Flight MH370 wreckage, Diwali celebrations in India, and the curved Hong Kong-Zhuhai-Macao Bridge, as shown in Fig. 1. Consequently, the forensics of satellite image forgery has attracted broad research interest in the security field. With the growing use of satellite images, the development of forensic technology to verify their authenticity and integrity has become urgent.

Fig. 1
figure 1

Examples of satellite image forgery. a The wreckage of Malaysia Airlines Flight MH370; b Diwali in India; c The curved Hong Kong-Zhuhai-Macao Bridge

Fig. 2
figure 2

Categorization of satellite image forensics methods. Only representative works are listed on this diagram

Research in image forensics has witnessed significant progression and enrichment (Hao et al. 2021; Bhagtani et al. 2022; Kaur et al. 2023), including detection methods for the traces of basic image operations such as resampling, median filtering, copy-move-paste, splicing, image smoothing, sharpening, and other operations analyses. It also encompasses detection methods for various image operation chains, including double JPEG compression, JPEG compression, combinations of resampling, continuous resampling, and combinations of diversified operations, as well as universal detection methods employing deep learning-based detectors. These methods consider the authenticity of digital images from multiple perspectives and enable effective detection. However, applying forensic methods to satellite image forgery remains a challenge due to the unique features of satellite images in terms of spectral bands and data representation, as well as the complexities involved in compression strategies, post-processing techniques, and sensor technology. Thus, the detection and localization of satellite image forgery, also known as satellite image forensics (Abady et al. 2024), has become increasingly interesting in the security field. Here, we will focus solely on a systematic review of forensic methods for satellite image tampering.

Satellite image forgery can be conveniently divided into global tampering and local tampering. Global tampering (Abady et al. 2022; Yates et al. 2022; Mansourifar et al. 2023; Ghelichkhani et al. 2023; Alibani et al. 2024) usually generates fake satellite images directly using Generative Adversarial Network (GAN). The general local forgery usually involves three types: copy-move (copying and moving pixels from one area of the image to another), splicing (creating a new image by combining two or more image patches), and inpainting (filling a deleted region using an inpainting algorithm) Chen et al. (2021). However, since satellite image tampering is an emerging research field, only the splicing operation is adopted to construct forgery datasets (Yarlagadda et al. 2018; Horváth et al. 2020, 2021a, b; Niloy et al. 2023) in the research literature, in which the tampering methods used for satellite images are similar to those used in general image forgery. As a consequence, the corresponding datasets for the other two tampering ways will be generated in future studies. Consequently, global tampering of satellite images can lead to different styles, while for local splicing tampering of satellite images, some objects can be spliced from other satellite images, which is usually difficult to detect with the naked eye, but changes key information.

In particular, multispectral satellite images are composed of multiple bands or channels. These channels capture the energy reflected or radiated by ground objects within specific wavelength ranges, indicating their unique properties. In remote sensing image analysis, the information of each channel is important, and the information of one channel may be used as a supplement to the information of another channel. Forgery can disrupt the interrelationships between the multiple channels of multispectral images, which may mislead the interpretation of surface features and result in unreliable analysis.

Since satellite images play an important role in the field of remote sensing, their tampering detection in the authenticity identification of satellite images has made some positive progress. Depending on the types of tampering of satellite images, detection methods can be classified as global tampering detection and local tampering detection. Figure 2 lists the milestones in recent research on global and local tampering detection algorithms for satellite images. It is concluded from systematic investigation that with the improvement of computing power, detection technology has gradually transitioned from hand-crafted to deep learning methods, and research in satellite image forensics has also been increasing annually. At present, there are some published articles on the overview of image forensic tasks (Hao et al. 2021; Bhagtani et al. 2022; Kaur et al. 2023). But unlike these works, we address the gap and investigate recent developments in forensics research on satellite image tampering from the viewpoints of tampering means and forensic traces. We introduce two types of satellite image forgeries and their corresponding forensic methods, and then we highlight the underlying forensic clues and some constructive suggestions for further research in this field. We aim to assist readers with a detailed survey of current state-of-the-art forensic techniques for satellite image tampering, together with future potential research directions for researchers in this field. The following are the primary contributions of our work.

  • We provide a comprehensive survey of forensic methods for satellite image tampering, including definitions of forgery, public benchmark datasets, evaluation metrics, and a systematic comparison of existing forensic methods.

  • We summarize the detection and localization accuracy of representative methods in public benchmark datasets. Additionally, we list their characteristics, advantages, and key factors affecting detection efficiency.

  • We outline the challenges and future trends in satellite image forensics research, focusing on providing insightful guidance to this community.

The paper is laid out as follows. In Sect. 2, we expound on the basic content of satellite image forensics, including the concept of satellite images, the definition of forgery, and commonly used datasets. In Sect. 3, we introduce and analyze methods for global tampering detection. In Sect. 4, we categorize and discuss forensic methods for local tampering into three main groups, analyzing their advantages and disadvantages and comparing their performance across various datasets. Finally, in Sects. 5 and 6, we outline future research directions and provide a summary of the survey.

2 Base concepts

2.1 Concept of satellite images

Remote sensing is a comprehensive detection technology that uses sensors on platforms such as drones, airplanes, and artificial earth satellites to detect and reveal the characteristics and changes of objects by analyzing the electromagnetic wave signatures from a distance. From a macro perspective, remote sensing takes the Earth as the research object, based on the interaction between electromagnetic waves and Earth’s surface matter, to explore the spatial distribution characteristics and spatio-temporal changes of Earth’s resources and environment. A complete remote sensing system includes six parts: information characteristics of the measured target, information acquisition, information recording, transmission and reception, information processing, and information application.

Table 1 11 wave-bands of Landsat8 satellite images

Satellite images, also known as remote sensing images, carry information about remote sensing observation targets. Common sources of satellite images include various types of satellites, each with different purposes and characteristics. Some common satellites that provide remote sensing images include the Landsat program, the Sentinel series, the National Oceanic and Atmospheric Administration series, the Geo Operational Environment Satellite series, etc. Here are some characteristics of satellite images.

  • High resolution and wide coverage. Remote sensing offers broad spatial coverage that allows the simultaneous observation of large areas. Currently, understanding the macroscopic spatial distribution of objects on the Earth’s surface often requires space remote sensing. For example, an American Landsat TM image covers an area of \(185 \times 185\) km, and it only needs more than 500 images to cover the entire territory of China.

  • Spectral range and rich information. Multi-spectral and hyper-spectral sensors can capture electromagnetic waves from visible light to infrared and even microwave bands. This allows satellite images to provide rich information about surface features. Therefore, the amount of information in remote sensing images far exceeds that in ordinary conventional images. Let us take Landsat 8 remote sensing images as an example, its spectral range covers 11 wave-bands, as detailed in Table 1.

  • Time-efficient. Geosynchronous meteorological satellites observe the Earth every half hour, while Solar Synchronous Orbit meteorological satellites observe the same area twice a day. Satellite images are utilized to monitor the changing rules of nature, especially in weather conditions and natural disasters, which fully embodies its superior timeliness.

  • Different distributions. The statistical properties of satellite images can be described by specific statistical models, such as negative exponential distribution, gamma distribution, etc., which help to understand the heterogeneity and non-uniformity of images. In contrast, the statistical properties of ordinary camera images typically exhibit Gaussian distribution (Zenghui and Wenxian 2016).

2.2 Definition of satellite images forgery

2.2.1 Global forgery

Global forgery usually occurs when fake images are generated using GAN methods (such as using CycleGAN to perform domain transfer on images) or the diffusion model, and the generated images are defined as global fake images. The forensic method for satellite image forgery treats the detection task as a binary classification problem, that is, correctly classifying an image as real or fake.

$$\begin{aligned} \text {Detection (image) =} {\left\{ \begin{array}{ll} \text {fake} &{} if \, image \, is \, global\,generated,\\ \text {real} &{} if \, image\, is \, original. \end{array}\right. } \end{aligned}$$
(1)
Fig. 3
figure 3

Examples of real images and fake images using global manipulation method CycleGAN

Fig. 4
figure 4

Examples of local spliced satellite images (left) and their forgery masks (right)

Table 2 Some satellite datasets for global manipulation detection
Table 3 Some satellite datasets for local manipulation detection

Examples of global forgery of satellite images generated by CycleGAN are reported in Fig. 3b. Comparing Fig. 3a with Fig. 3b reveals that the real and fake satellite images are nearly indistinguishable to the naked eye. This poses a significant challenge in satellite image forensics. However, this kind of global forgery also goes through down-sampling and then bilinear up-sampling to complete the synthesis from feature space to whole high-resolution and high-quality satellite frames by adopting advanced deep learning techniques. Despite the deep network paradigm, these techniques can produce satellite images that appear natural and smooth and cannot be observed with the human eye, but they unavoidably leave some subtle tampering traces between the various bands of signals because of different strategies of compression schemes, post-processing, and sensors.

2.2.2 Local forgery

Local forgery is mainly concerned with splicing operations in satellite images. That is, using a spliced object from another image to cover a certain area in the image, which can fabricate certain objects or mask their existence. Therefore, satellite image forgery forensics can be approached from two dimensions: tampering detection and tampering localization. Tampering detection aims to verify the authenticity of satellite images, determining whether an image has been forged. In contrast, tampering localization seeks to identify the specific areas where splicing tampering occurred (Abady et al. 2022a).

Usually, a satellite image f is expressed by the resolution size of \(X \times Y\). f(xy) represents the grayscale value at position (xy) in the satellite image. Use a forgery mask M(xy) with the same resolution size as the satellite image to indicate its integrity. There are only two distinct values in M: 0 or 1. In M, assign a value of 1 to the tampered area points and 0 to the original area points. A formal definition is expressed as.

$$\begin{aligned} \textbf{M}(x,y)= {\left\{ \begin{array}{ll} 1 &{} \text {if } \, \textbf{f}(x,y)\, is \,forged,\\ 0 &{} \text{ o }therwise. \end{array}\right. } \end{aligned}$$
(2)

Some examples of local forgery of satellite images are shown in Fig. 4a–f, including tampering objects and their corresponding white markers. Figure 4a is a Landsat8 satellite image that has forged pixels from a normal camera. Figure 4b is a Sentinel-2 satellite image containing a spliced object. Figure 4c is a WorldView-3 satellite image containing a spliced object from another PlanetScope satellite image. Figure 4d, e are Sentinel-2 satellite images that include forged pixels generated by GANs. Figure 4f is a WorldView-3 satellite image containing a spliced object. The results of these figures have shown that all achieve the effect of being real. However, this presents a challenge for the forensics of such satellite images. In fact, local tampering of satellite images can be regarded as a complex signal processing system, because the tampering operation needs to add objects to the satellite images, which may change the inter-pixel correlation between the object regions and the background regions to a certain extent. In addition, the correlation between pixels in different bands of satellite images may also change to a certain extent. All of these provide ideas for further developing the forensic methods of satellite image tampering.

2.3 Forgery dataset of satellite images

In this subsection, we introduce several commonly used datasets in research for the detection of satellite image forgeries. For unavailable datasets, we do not provide download links and only give a brief introduction based on their content in the corresponding papers. In Tables 2 and 3, we have listed the comparison of global and local tampering datasets, and detailed information about these datasets is described in the next two subsections.

2.3.1 Global forgery dataset of satellite images

UW (Zhao et al. 2021): the UW dataset was created using the image-to-image translation method based on CycleGAN, and is the first open-source dataset for forged satellite images. The real satellite images are downloaded from Google Earth, and the fake ones are synthesized images of the Seattle and Beijing urban landscape produced by CycleGAN on the CartoDB base map. In this dataset, the ratio of real satellite images to fake ones is 1:1.

LC, Scand and China (Abady et al. 2022b): the creation of these three datasets signifies the initial focus on multi-spectral satellite image forensics by researchers. These datasets use Sentinel-2 level1-C images as the original images and synthesize fake images by style transfer using GAN on the original images. Each dataset is trained with two GANs, one for generating 13 band images and the other for generating 4 band images. The Land Cover Dataset (LC) is created using CycleGAN to transform between barren and vegetated landscapes. The Scandinavian Dataset (Scand) and China Dataset (China) were created using Pix2pix transfer from summer to winter and vice versa. Each dataset has 4000 original images and 4000 generated images.

2.3.2 Local forgery dataset of satellite images

Dataset1 (Yarlagadda et al. 2018): this is the first dataset used for satellite image forensics. Dataset1 is established from Landsat8 satellite images, which comprise 130 real images with \(650 \times 650\) pixels. In these images, 30 images are sliced into \(64 \times 64\) patches for training and validation (20%), so \(D_{train}\) contains 8664 patches while \(D_{val}\) has 2166 ones. The rest 100 images are utilized to create \(D_{test}\), of which 50 images are used as original images and the rest 50 images are used to produce forged images. Then, we splice three sizes of splicing objects (Small, Medium, and Large respectively with \(32 \times 32\), \(64 \times 64\), and \(128 \times 128\) pixels), such as airplanes and clouds, onto 50 selected images in random positions for creating a total of 150 forged images. But all the forged pixels are from a normal camera. In 150 forged images, 50 images contain small objects (\(D_S\)), 50 images contain medium objects (\(D_M\)), and 50 images contain large objects (\(D_L\)). Therefore, \(D_{test}\) contains 200 images, including 50 pristine images, and 150 faked images.

Dataset2 (Horváth et al. 2020): it is constructed by orthorectified images from Sentinel-2 satellite. All images were first cropped to a size of \(1000 \times 1000\) pixels, and then 293 images were selected. From these, 100 satellite images were chosen to generate 500 forged images through splicing operations. Note that the spliced objects are extracted from satellite images, which are common and less likely to cause suspicion. These spliced objects include clouds, airplanes, smoke, and drones, with a resolution of \(16 \times 16\), \(32 \times 32\), \(64 \times 64\), \(128 \times 128\), and \(256 \times 256\) pixels. The training set includes 98 original images, while the test set contains 595 images, including 95 original images, 500 forged images, and their corresponding forgery masks. Compared to Dataset1, Dataset2 offers a greater variety of shapes for splicing objects, allowing for a more thorough verification of the effectiveness of forensic techniques.

Dataset3 (Horváth et al. 2021a): the base data of Dataset3 comes from WorldView-3 satellite. Splicing objects are extracted from images captured by the PlanetScope satellite and spliced into WorldView-3 satellite images to form forged images. Its training set includes 28 original images, while the test set includes 859 forged images and their corresponding forgery masks. The original pixels and forged pixels in Dataset3 come from different satellites, enriching the previous satellite image forgery data where forged pixels came from ordinary cameras.

Dataset4-6 (Horváth et al. 2021b): the original satellite images in these three datasets are from the Sentinel-2 satellite, with an image resolution of \(512 \times 512\) pixels. The proportion of original and forged images in each dataset is approximately 1:1. Dataset4 contains splicing objects generated by StyleGAN2 (Karras et al. 2020), Dataset5 contains splicing objects generated by CycleGAN (Zhu et al. 2017), and Dataset6 contains splicing objects generated by ProGAN (Karras et al. 2017). StyleGAN2, ProGAN, and CycleGAN are trained on Sentinel-2 images (Schmitt et al. 2019). The GAN network is trained on a portion of the original images until it can synthesize images similar to the original images. Subsequently, unsupervised Watershed segmentation (Roerdink and Meijster 2000) is applied to the original image to divide the region, and segments covering 10% or 50% of the image area were selected as the splicing region. The forged images are created by cropping the spliced objects from the GAN-generated images and splicing them into the original images. Datasets4, Datasets5, and Datasets6 contain 17,921, 17,438, and 17,640 images respectively. Dataset4-6, created using deep learning technology, demonstrate that as technology advances, so too do the methods of satellite image forgery, necessitating more sophisticated forensic techniques for detection.

Dataset7 (Niloy et al. 2023): the original images come from the benchmark dataset DeepGlobe (Demir et al. 2018), and the forged images are created using the method of Dataset2, with a resolution of \(1000 \times 1000\) pixels. Due to the scarcity of public datasets, the emergence of Dataset7 enriches the tampered satellite image dataset.

3 Forensic methods for global forgery of satellite images

According to the display format, the forensic methods for global forgery of satellite images can be classified as global tampering detection in RGB satellite images and multi-spectral satellite images.

3.1 Global tampering detection of RGB satellite images

Satellite images can be displayed in the traditional RGB format, hence most image forensics algorithms can be adopted directly or indirectly for satellite image forensics. However, due to different compression schemes, post-processing, and sensors, existing image forensic methods exhibit varying levels of detection accuracy. Therefore, the multi-band satellite image to be tested first converts into RBG bands followed by detection with advanced image forensic algorithms. As a result, some typical forensics methods for satellite images are summarized and classified as RGB image forensics methods for satellite images.

Fig. 5
figure 5

CycleGAN’s application to generate fake satellite images

Zhao et al. (2021) first believe that deepfake primarily relies on GANs as algorithmic mechanisms, and created the first publicly available deepfake satellite image dataset using CycleGAN (as shown in Fig. 5). The visual features that distinguish GAN-generated fake images from real ones include color and texture inconsistency, and frequency domain abnormalities. Fake satellite images differ significantly from real ones, exhibiting more complex and uneven textures, more skewed gray-scale histograms, and clearer edges. Therefore, 26 hand-crafted features are extracted from color histograms, and spatial and frequency domains to distinguish real and fake satellite images followed by feeding them into the Support Vector Machine (SVM) for image authenticity identification.

Fig. 6
figure 6

Geo-DefakeHop method (Chen et al. 2021) of geographic fake image detection

Inspired by DefakeHop (Chen et al. 2021), Chen et al. (2021) proposed a robust method to detect forged satellite images, namely Geo-DefakeHop, as shown in Fig. 6. The assumption is made that GANs can generate low-frequency components well but struggle to generate high-frequency components effectively. Based on this, focusing on the differences in high-frequency components between original and forged satellite images can aid in correct classification. This method uses parallel multiple single-level Saab transform to capture discriminative features, and then utilizes XGBoost classifier to implement the classifying task for true or fake satellite images. Experiments have shown that high-frequency channels are more discriminative for fake satellite image detection than low-frequency channels. During the test, the most discriminative features of several channels are selected for fusion on the basis of the performance of the validation set followed by a classifier for binary classification.

In the case of various image distortion post-processing, namely image resizing, adding white Gaussian noise additive, and JPEG compression, the author verified the performance of Geo-DefakeHop. The results show that the model performs slightly worse with image resizing and JPEG compression distortion than without image processing. For adding additive white Gaussian noise, Geo-DefakeHop’s detection accuracy will be significantly reduced, but it is still better than other models due to its integrated nature. This shows that Geo-DefakeHop has good robustness.

Then, Fezza et al. explored the applicability of four typical convolutional neural network (CNN) architectures, including VGG16 (Simonyan 2014), ResNet-50 (He et al. 2016), Inception-V3 (Szegedy et al. 2016), and Xception (Chollet 2017) for fake satellite image detection (Fezza et al. 2022). In this method, transfer-learning techniques are used to train and adjust these classical classification networks to classify true or fake satellite images, yielding satisfactory results in the UW dataset. This paper also comes to a conclusion that deep-learning-based approaches have higher accuracy than hand-crafted-based algorithms (Zhao et al. 2021) (i.e. spatial, histogram, and frequency features). The authors used two post-processing operations on the test images: adding white Gaussian noise and JPEG compression, to verify the robustness of the four CNN models. Under JPEG compression distortion, the detection performance of Inception-V3 and Xception models decreased with the decrease of the Quality Factor (QF) value, while the performance of VGG16 and ResNet-50 was almost unaffected. However, the detection performance of all methods on Gaussian noise-added images decreased significantly. This indicates that existing CNN models struggle with Gaussian noise attacks. In addition, the author also conducted a comprehensive evaluation of CNN’s performance in fake satellite image detection tasks for the first time, promoting the development of geographic fake image detection, an underdeveloped field.

Fig. 7
figure 7

The framework diagram of hybrid network proposed in method (Liu et al. 2024)

A lightweight and efficient hybrid network is proposed by combining CNN and Transformers for deepfake satellite image detection tasks (Liu et al. 2024), shown in Fig. 7. Although the excellent performance of fake satellite image detection had been achieved using CNNs, these networks had limited ability to model global information over long distances due to the focus on local information modeling by convolutional layers. Subsequently, inspired by the tremendous success of Transformers in image classification, the authors introduced it to the detection of deepfake satellite images. They designed a hybrid network that includes convolutional-based local feature blocks (LFB) and Transformer-based global feature blocks (GFB), which can offer a powerful blend of local spatial information and global semantic information in satellite images. Furthermore, Channel Attention (CA) is also introduced in LFBs to make the model more focused on important spatial information. The proposed hybrid model exhibits almost perfect detection capability on the UW dataset.

The authors used three post-processing operations: JPEG compression, adding Gaussian noise, and applying Gaussian blur, to compare the robustness of the proposed method with other advanced methods. For JPEG compression, the evaluation metrics for the CNN network, Transformer network, and the proposed Hybrid network all decrease as the QF value decreases, while the performance of the manual feature-based method improves due to capturing more discriminative features. Adding noise and applying blur significantly reduce the detection ability of all methods, but the proposed method still performs better than others. The hybrid network performs well under all attacks, proving its good robustness.

Fig. 8
figure 8

VQ-VAE2 with 3-layer structure in method (Abady et al. 2024)

3.2 Global tampering detection of multi-Spectral satellite images

Except for the RGB format, there is another format, i.e. multi-spectral. Therefore, some researchers have proposed forensic methods from a multi-spectral perspective, and representative methods are summarized below.

In this research orientation, Abady et al. (2022b) conducted the first authenticity study on the fake multi-spectral satellite images generated by GANs. They applied CycleGAN for land cover style transfer or Pix2pix for seasonal transfer on Sentinel-2 level1-C satellite images and then created three 13-band or 4-band generated satellite image datasets, i.e. Land Cover Dataset, Scandinavian Dataset, and China Dataset. The detection capability of EfficientNet-B4, especially in EfficientNet-B4 without downsampling in the initial layer, was explored for multi-spectral fake satellite images, and good results were obtained. When using EfficientNet-B4 to classify multi-spectral satellite images to distinguish their authenticity, the author modified the input channel dimension to 4 or 13 to adapt the images. In matched detection scenarios, where both training and test sets are produced by the same GAN, it can achieve a detection accuracy of more than 0.98. However, in mismatched detection scenarios, especially for Land Cover Datasets, the detection accuracy seriously degrades due to different spectral distributions.

To overcome the weak generalization capability, the same team proposed a one-class classification VQ-VAE2-based detection method (Abady et al. 2024) to identify the authenticity of multi-spectral satellite images. The modified VQ-VAE2 model applies three levels of latent space, shown in Fig. 8, enabling the learning of more complex spectral distributions, and capturing global and local information of images. This method used VQ-VAE2 (Van Den Oord and Vinyals 2017; Razavi et al. 2019) to reconstruct the original image, distinguishing the GAN-generated satellite images from the original ones by analyzing the reconstruction differences between the input images and the output images. The authors conducted experiments on three datasets: LC, Scand, and China. Since the multi-spectral satellite images used have 4 bands or 13 bands, the authors trained a VQ-VAE2 model for each band. A reconstruction error is obtained using VQ-VAE2 for each band of satellite images during the prediction phase. The reconstruction error is threshold-processed to output a result indicating whether a certain band of the image is real or generated. 100 original images are selected from the test set of each dataset, and the false positive rate of the model on these 100 images is fixed at 0.1 to determine a detection threshold, which can ensure the accuracy and reliability of the VQ-VAE2 detector.

VQ-VAE2 is trained on the pristine images and then directly tests fake satellite images from different GANs. Even when encountering new types of GAN-generated satellite images, it does not need to be retrained to fit the distribution of these new synthesized satellite images to detect them correctly. As a result, this detector has superior generalization capability.

3.3 Forensic survey for global tampering of satellite images

Based on the description and analysis mentioned above, the highlights of forensic methods for the global manipulation of satellite images are summarized in Table 4. In general, each forensic method for fake satellite images uses different forensic characteristics and hence has different superiority. In the field of satellite image forensics, more in-depth research into the analysis of generation traces and the characteristics of generative networks is needed to develop more efficient detection strategies aimed at improving accuracy.

Table 4 Forensic survey for global tampering of satellite images

4 Forensic methods for local tampering of satellite images

According to the backbone models adopted, forensic methods for local tampering of satellite images are classified into three categories: generative model-based approaches, segmentation model-based approaches and mixture model-based approaches. The key point of the generative model-based approach is to adopt the generative model to extract the distribution pattern of the pristine satellite image, effectively distinguish the pristine satellite image from the manipulated one, and then accurately locate the spliced area in the manipulated satellite image. They are usually unsupervised and trained without annotated data. The central theme of the segmentation model-based approach is to transform the satellite image tampering detection task into a pixel-level binary classification problem. Through end-to-end segmentation algorithms, each pixel in the satellite image is judged whether it has been tampered with, thereby localizing the splicing area in the tampered image. These methods are usually supervised and require annotated data during training, but often achieve better localization performance. The last category generally integrates generative model-based approaches with segmentation model-based approaches to combine both advantages for better detection performance and generalization ability. The information for all types of methods is listed in Table 5.

Table 5 The sources of all methods, the reasons for their selection, the categories they belong to, and the basis for their classification

4.1 Generative model-based methods

The generative model is a type of probability model that can simulate unknown probability distributions that exist in the training dataset. After training, samples are taken from the generative model to synthesize new observation results that look similar to the data in the training dataset, as if they are also included in the training dataset. Currently, there are two different generative models: machine learning-based and deep learning-based models. Among them, the Hidden Markov Model (HMM) (Eddy Sean 2004), the Gaussian Mixture Model (GMM) (Reynolds et al. 2009), and the Deep Belief Network (DBN) (Hinton and Salakhutdinov 2006), etc. belong to the former category, while the Autoregressive model (Bond-Taylor et al. 2021), the AutoEncoder (Goodfellow et al. 2016), the Generative Adversarial Networks (GAN), the Normalizing Flow (Flow) (Dinh et al. 2014), and the Denoising Diffusion Probabilistic Model (Diffusion) (Ho et al. 2020), etc. belong to the latter category. The rest of this subsection will investigate the application of the generative model in satellite image tampering detection and localization tasks.

As the forerunner, Yarlagadda et al. used the autoencoder (Goodfellow et al. 2016) to extract recognizable features for detecting forged satellite image (Yarlagadda et al. 2018). They first divide the original image into small overlapping patches with \(64 \times 64\) pixels, then these patches are reconstructed by autoencoder. In this way, the encoder of the autoencoder can fully capture the probability distribution of the pristine satellite image. Here, the autoencoder is used as a generator, and a discriminator is added to form a GAN structure, which can then be trained in adversarial ways. The structure of the GAN is shown in Fig. 9a. Due to the powerful classification capabilities of One-Class SVM (Wang et al. 2004), this method uses it to learn the features of each original image patch to detect whether there are tampered parts in the test satellite image on a block-by-patch basis, shown in Fig. 9b. Localization results can be obtained by stitching the SVM output on each patch of the image according to the corresponding position and applying a threshold. The authors have conducted some experiments on Dataset1 and achieved good results.

Fig. 9
figure 9

The proposed autoencoder and One-Class SVM method (Yarlagadda et al. 2018)

To perform integrity checks on high-resolution satellite images, manipulation detection is usually performed patch by patch. Although this improves inspection efficiency, individual image patches lack spatial contextual information, making it difficult to distinguish and accurately locate the spliced areas within satellite imagery at the pixel level. The same image-to-patch strategy is also used in the following methods (Horváth et al. 2019, 2020, 2021a).

The same team used the GAN strategy and the One-Class classifier to identify tampering regions in satellite images with high localization accuracy (Yarlagadda et al. 2018). At the same time, the Conditional Adversarial Generative Network (Conditional GAN) (Mirza and Osindero 2014) is employed to discover the hidden correlation between satellite images and their corresponding forgery masks (Bartusiak et al. 2019).

Fig. 10
figure 10

The pix2pix structure used in method (Bartusiak et al. 2019)

Subsequently, Bartusiak et al. extended pix2pix, a variant of the Conditional GAN, to execute tampering localization (Bartusiak et al. 2019), and its structure is provided in Fig. 10. Dataset1 is selected to conduct experiments because of the availability of satellite images (I) and their forgery masks (M). I and M (used as conditions) are embed into the generator. By using the GAN for training, the correspondence between I and M is learned to make the generator ultimately generate a soft forgery mask (\(\hat{\textbf{M}}\)) similar to the forgery mask (M), i.e. \(\hat{\textbf{M}} \approx\) M. Here, the loss of the generator is represented as.

$$\begin{aligned} \mathcal {L}=\mathcal {L}_{cGAN} + \lambda \mathcal {L}_R \end{aligned}$$
(3)

where \(\mathcal {L}_{cGAN}\) and \(\mathcal {L}_R\) are the cGAN loss, and the binary cross entropy, respectively. As a result, the generator of the Conditional GAN can be regarded as a tampering detector for satellite images. Based on the number and position of pixels with a value of 1 in \(\hat{\textbf{M}}\), it can be determined which position in the image it has been tampered with. This method requires manipulated satellite images and the corresponding masks for training and thus belongs to supervised learning. Besides, threshold processing is a necessary step of this method to locate the forged regions from satellite images. In brief, this method achieves 100% detection performance for Dataset1.

Evolving from Yarlagadda et al.’s method, Horváth et al. designed an improved Deep Support Vector Data Description (SVDD), i.e., Satellite Support Vector Data Description (SatSVDD) (Horváth et al. 2019) as a single classifier to detect and localize spliced regions using only information from pristine satellite images. In this method, an autoencoder obtains recognizable features from image patches, and then they are fed into SVDD to construct a hypersphere to make a final decision. Similarly to Yarlagadda et al. (2018), it is also inspired by a reconstruction error to force an autoencoder to reconstruct small patches from the original image. Because forgery images were not trained, SatSVDD will decide image patches outside the hypersphere during prediction as forgery ones and output a large outlier. Outliers are combined in order of position to generate a soft forgery mask \(\hat{\textbf{M}}\). The value of the \(\hat{\textbf{M}}\) at the overlapping position of the image patch is filled by the average of all outliers at that position. To enhance the accuracy of tampering detection, \(\hat{\textbf{M}}\)-based detection function is proposed as.

$$\begin{aligned} d(\hat{\textbf{M}})=\frac{max(\hat{\textbf{M}}) - \mu _{\hat{\textbf{M}}}}{\sqrt{\frac{\sum \nolimits _{x \in I} (x- \mu _{\hat{\textbf{M}}})^{2} }{max(|{\hat{\textbf{M}}|})}}} \end{aligned}$$
(4)

Where \(\mu _{\hat{\textbf{M}}}\) and max(\(\hat{\textbf{M}}\)) are the average and maximum values of all pixels in \(\hat{\textbf{M}}\), respectively. Perhaps because of simultaneous training of autoencoder and SVDD, this method has achieved better results. In addition, a comprehensive manipulation detection score was proposed to improve the ROC_AUC and PR_AUC of tampering detection above 92%.

A one-class classification approach (Horváth et al. 2020) based on the principle of DBN is developed to locate manipulated regions in satellite images. They conducted experiments on Dataset2. Training requires only the original satellite image, and after training, the forgery regions can be identified due to the difference from the distribution of the original satellite image.

Fig. 11
figure 11

DBN structure used in method (Horváth et al. 2020)

Since RBM (Freund et al. 1991) is a symmetrically coupled, stochastic recurrent neural network, the DBN introduces two-level RBMs as an encoder-decoder structure to reconstruct images, shown in Fig. 11. The first-level RBM represents the image block as a hidden representation, while the second-level RBM utilizes the hidden representation to obtain the reconstructed image. Because it is trained only on original images, the DBN learns only the distribution of original images and cannot learn that of spliced objects. As a consequence, the DBN can accurately reconstruct the original patches, but it is unable to reconstruct the spliced regions. Therefore, image patches containing spliced objects will obtain larger mean square error (MSE) values than the original ones during DBN reconstruction. Based on the calculated MSE values, a heatmap is generated. The positions with higher values in the heatmap are more likely to belong to the tampered area. Apply a threshold on the heatmap to obtain the final soft forgery mask, indicating the potential tampered area in the image. This model is highly effective in detecting small spliced objects and is also very competitive in doing so.

Fig. 12
figure 12

Vision transformer pipeline proposed in method (Horváth et al. 2021a)

Using an autoencoder to reconstruct a known target image, after training, it is possible to distinguish this type from other types based on the reconstruction error. As a result, this idea of reconstruction error is fully adopted in the one-class classifier, which can be specially designed to identify a known target class and determine other unknown classes as abnormal data. This idea is also utilized in the detection of satellite image manipulation, such as references Van Den Oord et al. (2016) and Horváth et al. (2021a).

As an attempt, Montserrat et al. integrate autoregressive models, PixelCNNs (Van Den Oord et al. 2016), and Gated PixelCNNs (Van den Oord et al. 2016) to realize the solution of detection and localization of satellite image tampering (Montserrat et al. 2020), and uses Dataset2 as the satellite image dataset. The adopted PixelCNNs can obtain the global distribution of satellite images by taking stock of the distribution of all pixels. The procedure for computing the distribution of the image f is to multiply the conditional distributions of all pixels \(fp_i\) together, expressed as.

$$\begin{aligned} p(f)=\prod _{i=1}^N p(fp_i|fp_1,\dots ,fp_{i-1}) \end{aligned}$$
(5)

It generates all pixels one by one, and the current generated pixel value \(fp_i\) is calculated on the basis of all previously generated pixels, \(fp_{1},\dots , fp_{i-1}\). In RGB satellite images, the values of each pixel on the R, G, and B channels are calculated in order one by one, calculated as.

$$\begin{aligned} p(fp_i|f_{<i})&= p(fp_i|fp_1,\dots ,fp_{i-1}) \nonumber \\ &= p(fp_{i,R}|f_{<i}) p(fp_{i,G}|f_{<i},fp_{i,R}) p(fp_{i,B}|f_{<i},fp_{i,R},fp_{i,G}) \end{aligned}$$
(6)

Subsequently, this process is extended to multi-spectral satellite images and then the values of the points at each position in the image are a multi-tuple, defined as.

$$\begin{aligned} p(fp_i|f_{<i}) = \prod _{j=1}^C p(fp_{i,j}|f_{<i},fp_{i,1},\dots ,fp_{i,j-1}) \end{aligned}$$
(7)

where C and \(fp_{i,j}\) are the channel number and the ith pixel of the jth channel of the satellite image, respectively.

Since the training phase is performed only on the pristine image, PixelCNNs and Gated PixelCNNs can learn the distribution of the pristine satellite image and detect the region out of the learned distribution. Then, the authors use negative log-likelihood loss to determine whether the image contains tampered pixels. When processing the tampered pixel or the untouched one, a low or high likelihood value respectively outputs. If the likelihood of a pixel is greater than the set threshold, the pixel is considered to have been tampered with. Obviously this method is also unsupervised. Compared to previous forensic methods, it performs superior in small spliced objects by processing the entire image in a fully convolutional manner and providing pixel-level localization.

Guided by this new research direction, Horváth et al. proposed another unsupervised satellite image forgery detection method (Horváth et al. 2021a) based on Vision Transformer (Dosovitskiy et al. 2020), as shown in Fig. 12. Two different datasets, Dataset2 and Dataset3 are used to evaluate performance. The core idea is similar to DBN, where the Vision Transformer serves as the autoencoder to catch the distribution of the pristine satellite image and can correctly reconstruct it. During the prediction phase, the image f is used as input and the reconstructed image \(f_r\) is generated through the Vision Transformer, and then \(3\times 3\) Laplace filters are used to convolve each f, \(f_r\) for the other two images \(f_d\), \(f_{rd}\). The Laplace filter is adopted as an edge detector to emphasize high-frequency components on the f and fr. After calculating the differences between f and \(f_r\), and between \(f_d\) and \(f_{rd}\), a heatmap is obtained by taking the mean of these two differences.

Next, the heatmap is first subjected to threshold processing, followed by a specially designed post-processing stage to obtain a soft forgery mask. The post-processing is composed of several ErodeIsolated operations, which is a new type of morphological filter that can remove smaller objects from an image without affecting the larger ones. A part of post-processing is defined as:

$$\begin{aligned} p(f_{EI,a,b}|f_{<i}) = ErodeIsolated(a,b) \end{aligned}$$
(8)
Fig. 13
figure 13

Two examples of ErodeIsolated that proposed in method (Horváth et al. 2021a)

There are two examples depicted in the Fig. 13, where \(\wedge\) is a logical OR operation and \(\vee\) is a logical AND operation. The structural element of ErodeIsolated is a square with sides of length \(2b + 1\), which contains an inner square with sides of length \(2a + 1\). In the structural element, the values in the inner square take 1, and the remaining values take 0. The post-processing composed of ErodeIsolated can reduce the false positive caused by the noise introduced, resulting in more accurate localization. This method is by far the best method for unsupervised detection.

4.2 Segmentation model-based method

The semantic segmentation of images has undergone a long period of development and is gradually becoming mature (Mo et al. 2022). The general segmentation process of an end-to-end segmentation network utilizes its internal structure to shrink the dimensions of the original image and capture features (Long et al. 2015). Subsequently, the smaller feature maps are gradually restored to a prediction image with the same resolution as the input image. Each pixel in the prediction image contains information about the classification of that pixel. Thus, when using segmentation models in satellite image tampering localization tasks, it is only necessary to simplify the multi-classification task of semantic segmentation into a 2-classification task, that is, perform a 2-classification on each pixel to determine whether it has been tampered with, and the final predicted image is a forgery mask. Segmentation model-based methods can achieve prediction at the pixel level, using supervised training. Here, some typical segmentation model-based methods for satellite images are summarized.

Fig. 14
figure 14

Nested attention U-Net architecture used in method (Horváth et al. 2021b)

Fig. 15
figure 15

Network architecture of HRFNet proposed in method (Niloy et al. 2023)

Fig. 16
figure 16

The mixture model of the method (Horváth et al. 2022)

In this subcategory of research, Horváth et al. designed the first segmentation model-based approach to detect counterfeit objects of satellite images by introducing a variant of the U-Net architecture, namely Nested Attention U-Net (NAU-N) (Horváth et al. 2021b), shown in Fig. 14. In this work, objects selected from GAN-generated satellite images are spliced into the original satellite images to create three evaluated datasets, Dataset4-6.

Several versions of U-Net, such as Attention U-Net (Oktay et al. 2018) and Nested U-Net (Zhou et al. 2018), have been developed to execute the task of image segmentation. Inspired by them, the structure of NAU-N is designed by embedding “U-Net”, attention gates, and dense skip connections. These components can reduce semantic differences between feature maps of different scales, and highlight important features while suppressing irrelevant information. The NAU-N is trained on three datasets in an unsupervised learning mode, and outputs a heatmap with a real number at the position of each pixel. In this heatmap, if the number value at a certain position is large, there may be a spliced object generated by GAN at that position. When threshold processing on the heatmap is performed, the localization of GAN-generated objects within the satellite images is picked out. From the experimental results of this article, we can infer that using semantic segmentation methods for satellite image forgery localization tasks is feasible. Besides, the generalization performance of NAU-N in mismatched scenarios is highly variable. However, NAU-N trained on StyleGAN2-synthesized images can accurately detect whether the satellite image involves ProGAN or CycleGAN-generated objects.

Soon afterward, a new strategy (Niloy et al. 2023) is specifically developed for tampering localization tasks in high-resolution satellite images with strong semantic segmentation characteristics. In this work, a new model, namely “HRFNet”, is designed and its core idea is to transform satellite image tampering detection tasks into pixel-level binary segmentation problems. The HRFNet network consists of the RGB branch and the SRM branch, represented in Fig. 15. The former distinguishes between manipulated and real areas by capturing visual inconsistencies in tampered boundaries, while the latter uses SRM filters (Fridrich and Kodovsky 2012) to analyze local noise features in the image. Then, both branches are constructed with shallow and deep parts, where the shallow part can extract features globally with a large receptive field, better capturing spatial information, while the deep part effectively extracts advanced semantic information. Due to the complementarity of deep and shallow parts, feature fusion can improve segmentation performance. After merging features from the RGB and SRM branches, the ASPP module (Chen et al. 2017) captures features on multiple scales using different atrous convolutions, obtaining richer contextual information. Finally, it enters the decoder and generates the final segmentation mask, which indicates the manipulated area. The experiments are guided on Dataset7, and valuable experimental results indicate that the design of the RGB and SRM branches can better locate tampered regions, while the design of the deep and shallow parts can achieve a balance between high AUC values, memory requirements, and processing speed. This strategy is the latest development in satellite image tampering localization tasks.

4.3 Mixture model-based methods

There is only one reported work (Horváth et al. 2022), which has adopted the mixture model for locating spliced regions in satellite images. This method utilizes supervised fusion to combine the results of several unsupervised tampering localization techniques to enhance its detection accuracy and generalization ability.

Table 6 Positive and negative aspects of existing local tampering detection and localization approaches

The architectures of this work are shown in Fig. 16a, b. It receives the mixture of the satellite image, the output of the Gated PixelCNN Ensemble method, and the result of the Vision Transformer method. Note that Gated PixelCNN Ensemble and Vision Transformer methods are two unsupervised detection networks. Besides, MultiTransNAUNet (Multi-Transformer Nested Attention U-Net) leverages the benefits of attention gates, nested architecture, and Transformers’ powerful decoding capabilities to learn all available information from the original satellite image. Finally, a heatmap is also output to produce tampered regions with positive scores. This method is assessed on two large datasets, Dataset3 and Dataset4. After fusing the outputs from two other methods with the input of the MultiTransNAunet, this method achieves excellent performance. The generated mask is almost identical or very similar to the ground-truth mask. The results also demonstrate that this mixture model-based method can detect spliced regions on satellite images from other datasets (not used during training) under mild retraining conditions.

Table 7 Image-level performance comparison of various detection methods on satellite image datasets
Table 8 Pixel-level performance comparison of various localization methods on satellite image datasets

4.4 Advantage and disadvantage analysis for forensic methods of local tampering of satellite images

Here, we analyze several existing forensic methods for satellite image tampering. There are three main types: generative model-based, segmentation model-based, and mixture model-based. Among them, unsupervised approaches are more suitable for detecting unknown tampering operations, while supervised approaches often have better performance, but need to provide original and annotated faked satellite images. All methods have proposed available solutions for a certain challenge in tampering detection of satellite images. The positive and negative aspects of existing local tampering detection and localization approaches are listed in Table 6. From it, we can obtain that the ViT-based generative model method is by far the best method for unsupervised detection, HRFNet is the latest development in satellite image tampering localization tasks, and the integration of multiple detection methods has been verified as an effective strategy to improve detection accuracy and generalization ability.

4.5 Evaluation metrics and comparative discussion

4.5.1 Evaluation metrics

The detection of satellite image forgery, that is, whether they have been tampered with, is an image-level binary classification problem. The tampering localization of satellite images, that is, whether a certain block or pixel in the satellite image has been tampered with, is a pixel-level binary classification problem. For binary classification problems, examples can be divided according to the combination of their real categories and the predicted categories of the model. It can be divided into four situations: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). Thus, for the forensic task of counterfeit satellite images, evaluation metrics commonly use ROC_AUC, P/R_AUC, F1-score, and Jaccard index.

ROC_AUC: true Positive Rate (TPR) and False Positive Rate (FPR) are first calculated as \(TPR=\frac{TP}{TP+FP}, \hspace{5.0pt}FPR=\frac{FP}{TN+FP}\). Then, the ROC curve is drawn with FPR and TPR as horizontal and vertical axes, respectively. Finally, ROC_AUC is computed as the area under the curve. The closer it gets to 1, the better (Zhou 2021).

$$\begin{aligned} ROC\_AUC = \int _{-\infty }^{+\infty }\frac{\textrm{TPR}(t)}{\textrm{FPR}(t)}\,\textrm{d}t \end{aligned}$$
(9)

P/R_AUC: precision and Recall are defined as \(Precision=\frac{TP}{TP+FP}, \hspace{5.0pt}Recall=\frac{TP}{TP+FN}\). A P/R curve is plotted with Recall as the horizontal axis and Precision as the vertical axes. P/R_AUC is the area under the P/R curve, which reflects the comprehensive performance of the model in terms of precision and recall, the closer it is to 1, the better. Generally, ROC_AUC also focuses on positive and negative samples, while PR_AUC focuses more on positive samples. If the numbers of positive and negative samples are relatively balanced, ROC_AUC can be used, while PR_AUC is better when the number is extremely uneven.

$$\begin{aligned} RP\_AUC = \int _{-\infty }^{+\infty }\frac{\textrm{Precision}(t)}{\textrm{Recall}(t)}\,\textrm{d}t \end{aligned}$$
(10)

F1-score: the definition of F1-score (Dice Lee 1945) is described in Eq. (11). The range of F1-score is from 0 to 1. A high F1-score indicates that the model has fewer false positives and fewer false negatives on the data.

$$\begin{aligned} F1-score=\frac{Precision \times Recall}{2 \times (Precision + Recall)} \end{aligned}$$
(11)

Jaccard index (JI): the Jaccard index represents the similarity between the predicted mask generated for localization purposes and the ground-truth (Levandowsky and Winter 1971), expressed as.

$$\begin{aligned} Jaccard \hspace{5.0pt}index(JI)=\frac{TP}{TP+FP+FN} \end{aligned}$$
(12)
Fig. 17
figure 17

Localization results in Dataset2 among GAN+SVM (Yarlagadda et al. 2018), SatSVDD (Horváth et al. 2019) and UU-DBN (Horváth et al. 2020)

Fig. 18
figure 18

Localization results in Dataset3 and Dataset4 among NAU-N (Horváth et al. 2021b) and Sat U-net (Horváth et al. 2022)

Fig. 19
figure 19

Localization results in Dataset2 and Dataset3 among Gated PixelCNN Ensemble (Montserrat et al. 2020) and ViT (Horváth et al. 2021a)

4.5.2 Objective results

Then, we compare the performance of various forensic methods in Dataset1-Dataset7 with the evaluation metrics mentioned above, including image-level detection methods [GAN+SVM (Yarlagadda et al. 2018), cGANs (Bartusiak et al. 2019), SatSVDD (Horváth et al. 2019), UU-DBN (Horváth et al. 2020), NAU-N (Horváth et al. 2021b)], and pixel-level localization methods [GAN+SVM (Yarlagadda et al. 2018), cGANs (Bartusiak et al. 2019), SatSVDD (Horváth et al. 2019), UU-DBN (Horváth et al. 2020), Gated PixelCNN Ensemble (Montserrat et al. 2020), ViT (Horváth et al. 2021a), NAU-N (Horváth et al. 2021b), Sat U-net (Horváth et al. 2022), HRFNet (Niloy et al. 2023)].

The comparison results of detection performance and location performance of these methods on satellite image datasets are reported in Table 7 and Table 8. Among all satellite image datasets, only Dataset1 and Dataset2 contain spliced objects with different sizes. Therefore, in Tables 7 and 8, some metrics include subscripts, which means the performance of the method on images containing stitching objects of a certain size. For example, ROC_AUC\(_{32}\) represents the ROC_AUC value of the forensic method on images containing \(32 \times 32\) pixels spliced objects.

In general, models with good localization performance will also have better detection performance in local manipulation detection and localization tasks for satellite image forgery. Therefore, we pay more attention to the localization efficiency of these models. On Dataset1, cGANs achieved the best results due to their strong ability to capture satellite image distributions and the fact that supervised methods often achieve higher performance. In unsupervised methods, SatSVDD achieved good results because the joint training of Autoencoder and SVDD can extract more discriminative features, and the proposed detection score considering the mask attribute of manipulated images can better detect tampering. Although GAN+SVM performs the worst, it was the first method for satellite image forensics and pioneered the use of generative models to detect satellite image manipulation. Compared to Dataset1, Dataset2 has higher resolution and richer splicing objects, and many works have been researched on Dataset2. For Dataset2, the proven best-performing method is ViT, due to the excellent global modeling capability of the vision transformer and specially designed post-processing modules that can better identify “inconsistencies” in manipulated images. The performance of Gated PixelCNN Ensemble is second, but its full convolutional and pixel-by-pixel processing method can effectively recognize small splicing objects. The performance of UU-DBN is the worst, but it is the first to describe the systematic application of the “reconstruction+error” method in satellite image forensics, which still has a high reference value. Dataset3 contains a more diverse range of spliced objects, with each image containing different spliced objects. Currently, Sat U-net has achieved the best results in Dataset3, which integrates two localization methods and carefully designs a network structure by incorporating multi-scale global information into the network to accurately locate tampered areas. NAU-N performs second but still outperforms other unsupervised methods. For Dataset4-6, spliced objects contained in the manipulated image are generated by GAN. On all three datasets, supervised methods are far superior to unsupervised methods. NAU-N achieved good performance because the use of attention gates in the network makes the tampered area more attractive. On Dataset7, HRFNet performed best due to the ASPP module extracting multi-scale information and supervised training methods.

4.5.3 Subjective results

In addition to quantitative comparisons, we also compared some subjective localization results on Dataset2, Dataset3 and Dataset4. The results are visually displayed in Figs. 17, 18, and 19. From Fig. 17, we can find that UU-DBN achieved the best localization results because DBN can fully capture the distribution of the original image, thus accurately identifying nonoriginal regions. GAN+SVM and SatSVDD are prone to misidentifying the original area as a tampered area. In Fig. 18, we can see that the mask generated by Sat U-Net is very similar to the real forgery mask because the integrated method will strengthen the attention to the tampered area. In the absence of sufficient global information to learn, the mask generated by NAU-N has internal holes. In Fig. 19, ViT achieved better localization results, especially on dataset2, indicating that the unsupervised method can already obtain results close to ground truth when detecting spliced objects from cameras in satellite images. However, due to the limitations of full convolution, the generated mask of the Gated PixelCNN Ensemble contains only a portion of the ground truth.

From the above quantitative and subjective analysis, we can conclude that the “reconstruction+error” method based on generative models has achieved excellent performance in detecting local manipulation, making this unsupervised approach very promising. Global information in satellite images helps to fully locate and splice objects, often resulting in better results.

5 Challenges and future researches

So far, some forensic methods for satellite image forgery have appeared, but they mainly focus on patch-level detection or larger target object detection. Whether in global or local tampering detection, supervised methods often achieve higher localization performance. However, they may struggle to achieve effective generalization when faced with tampering operations that have not appeared in the training data. Therefore, unsupervised strategies that do not rely on tampered satellite images during the training process are the preferred method. It is projected that the fusion of supervised and unsupervised approaches will become a popular trend in future research. Besides, mainstreams of the existing approaches are tailored to work with RGB satellite images, so there should be more excellent work for forensic analysis of multi-spectral satellite images.

In addition, accompanied by the advancement of the representation ability of deep learning, novel deep or large model generation techniques can weaken the negative effects of high-frequency components and the boundary effects between the object and the background, and adopt the end-to-end training strategy to extract features and search for the optimal parameters, so as to obtain better forgery effect. In particular, the global and local generation of satellite images has achieved unprecedented visual effects in recent years. Thus, for the in-depth satellite image generation technology, the study of practical and effective detection is the new direction of future research in the detection and localization field. The following four issues merit further research.

5.1 Deeply mining the subtle relationship between satellite image tampering and deep generation models

Although the generation model of satellite images can infer good global and local tampering results, it is a complex processing system of multi-level joint convolutional filtering, which will inevitably leave source generator attribution. In addition, satellite image has the characteristics of various band types, complex structures, and sensor information. The detection and localization of satellite image forgery must mine the subtle relationship between satellite image manipulation and deep generation models. Therefore, this is more challenging, which makes the research work on satellite image manipulation detection still very lacking.

5.2 Advanced explainable AI applications

With the widespread application of machine learning and deep learning in image forensics, model explainability has become particularly important. Although machine learning and deep learning models have improved the efficiency and accuracy of tamper detection and localization, they have also brought about the problem of the "black box" model. That is, the decision-making process of the model is not transparent enough, and it is difficult for humans to understand how the model obtains detection results. The emergence of eXplainable AI (XAI) has alleviated this problem, making the decision-making process of the model more transparent and easy to understand.

Popular XAI tools include LRP (Bach et al. 2015), Grid-CAM (Selvaraju et al. 2017), RISE (Petsiuk et al. 2018), SHAP (Lundberg and Lee 2017), LIME (Ribeiro et al. 2016) and SOBOL (Fel et al. 2021). Especially, in Ying et al. (2022), by using the heat map generated by Deletion (Samek et al. 2016) to visualize what the network is concerned with, and by using Uniform Manifold Approximation and Projection (UMAP) (McInnes et al. 1802) to observe the topological interpretation of the learned features, so as to design a model with complementary properties. Silva et al. (2022) and Mathews et al. (2023), Grid-CAM was used to provide visual interpretation for deep learning models, highlighting image regions that the model considers important in judging the authenticity of an image. Abir et al. (2023), through the LIME algorithm, it was explained how the model classifies real and fake images. Tsigos et al. (2024), the author critically analyzed the limitations of current XAI tools and devised an efficient, in-depth evaluation method specifically tailored for deepfake detection models.

XAI’s ability to highlight pixels in the image that influence decision-making is also critical to designing and interpreting efficient models for satellite image forensics. General-purpose XAI tools may not meet the requirements of satellite image forensics tasks, and there is an urgent need to develop more efficient XAI dedicated to satellite image forensics.

5.3 Robustness evaluation for various types of post-processing

The robustness of a forensic method is used to measure whether it is still effective in the face of distortion attacks. Various types of post-processing operations, such as image resizing, JPEG compression, adding Gaussian noise, and applying Gaussian blur, are often used to test the robustness of satellite image forensics methods. However, in the detection and localization of local tampering, it has not been widely used by researchers. In the future, more types of post-processing operations should be used to verify the robustness of satellite image tampering localization methods.

5.4 The construction of diversified falsified dataset

Satellite image tampering techniques used in existing studies include global tampering and local tampering. Global tampering uses GANs to generate fake images. Local tampering only refers to splicing. Therefore, the constructed datasets lack diversity. Furthermore, copy-move is also a very important type of tampering, which has been widely studied in the image forensics community. Many excellent methods have achieved good results in the study of copy-move (Verma and Singh 2024). Especially in Lee et al. (2022), the authors propose a new image copy-move tampering detection method based on rotation-invariant small wave characteristics and convolutional neural networks, which realizes efficient and accurate positioning of the tampered area. Diwan et al. (2023), the authors propose an advanced digital image copy-move forgery detection method based on the superpoint keypoint architecture, which can effectively identify and locate areas that have been tampered with by various image processing techniques, with high accuracy and real-time detection capabilities.

Future research methods in satellite image forensics should consider studying the construction of diversified falsified datasets generated by different tampering techniques, such as using more types of GAN methods to generate global fake satellite images and using more types of local tampering such as copy-move and inpainting to create local forged satellite images. As a result, the forensics community of satellite image forgery can flourish.

6 Conclusion

Due to the constant advancement of technology, the financial cost and technical difficulty of falsified satellite images are continually declining, leading to increasing challenges in verifying their authenticity and integrity. Satellite image forensics has emerged as a necessary tool for analyzing imagery. This paper recommends the concept of satellite images, the definition of satellite image forgery detection problems, and commonly used datasets. We also investigate two types of tampering patterns, global and local forgery. We discussed and analyzed detection and localization methods for these two types of manipulation separately.

Regarding the detection of global forgery of satellite images, hand-crafted features can be used to distinguish between true and false satellite images, as well as deep learning methods. We believe that the combination of CNN and Transformers is a promising approach. The one-class classifier method has a strong generalization capability, which is similar to the “reconstruction+error” method based on generative models; we summarize the highlights of all methods.

For the localization of local tampering of satellite images, we discuss them in three categories: generative model-based, segmentation model-based, and mixture model-based. We further analyzed their principles, benchmark datasets, and evaluation metrics. Afterward, the positive and negative aspects of these approaches are pointed out and their performance on different datasets is assessed.