Images and videos are by now a dominant part of the information flowing on the Internet and the preferred communication means for younger generations. Besides providing information, they elicit emotional responses, much stronger than text does. It is probably for these reasons that the advent of AI-powered deepfakes, realistic and relatively easy to generate, has raised great concern among governments and ordinary people alike. Yet, the art of image and video manipulation is much older than that. Armed with conventional media editing tools, a skilled attacker can create highly realistic fake images and videos, so-called “cheapfakes”. However, despite the nickname, there is nothing cheap in cheapfakes. They require significant domain knowledge to be crafted and their detection, can be much more challenging than the detection of AI-generated material. In this chapter, we focus on the detection of cheapfakes, that is, image and video manipulations carried out with conventional means. However, we skip conventional detection methods, already thoroughly described in many reviews, and focus on the data-driven deep learning-based methods proposed in recent years. We look at manipulation detection methods under various perspectives. First, we analyze in detail the forensics traces they rely on, and then the main architectural solutions proposed for detection and localization, together with the associated training strategies. Finally, major challenges and future directions are highlighted.

1 Introduction

In the last few years, multimedia forensics has been drawing ever-increasing attention in the scientific community and beyond. Indeed, with the editing software tools available nowadays, modifying images or videos has become extremely easy. In the wrong hands, and used for the wrong purposes, this capability may represent major threats for both individuals and the society as a whole. In this chapter, we focus on conventional manipulations, also known as cheapfakes, in contrast to deepfakes that are based on the use of deep learning tools for their generation (Paris and Donovan 2019). Generating a cheapfake does not require artificial-intelligence-based technology, however a skilled attacker can carry out very realistic fakes, which can have disruptive consequences. In Fig. 11.1, we show some examples of very common manipulations: inserting an object copied from a different image (splicing), replicating an object coming from the same image (copy-move), or removing an object by extending the background (inpainting). Usually, to better fit the object in the scene, some suitable post-processing operations are also applied, like resizing, rotation, boundary smoothing, or color adjustment.

Fig. 11.1
figure 1

Examples of image manipulations carried out using conventional media editing tools. From left to right: splicing (alien material inserted in the image), copy-move (an object has been cloned), inpainting (objects removed using background patches)

A large number of methods have been proposed for forgery detection and localization, both for images and videos (Farid 2016; Korus 2017; Verdoliva 2020). Here, we focus on methods that verify the digital (as opposed to physical or semantic) integrity of the media asset, namely, discover the occurrence of a manipulation by detecting the pixel-level inconsistencies it caused. It is worth emphasizing that even well-crafted manipulations, which do not leave visible artifacts on the image, always modify its statistics, leaving traces that can be exploited by pixel-level analysis tools. In fact, the image formation process inside a camera comprises a certain number of operations, both hardware and software, specific of each camera, which leave distinctive marks on each acquired image (Fig. 11.2). For example, camera models use widely different demosaicing algorithms, as well as different quantization tables for JPEG compression. Their traces allow one to identify the camera model that acquired a given image. So, when parts of different images are spliced together, a spatial anomaly in these traces can be detected, revealing the manipulation. Likewise, the post-camera editing process may introduce some specific artifacts, as well as disrupt camera-specific patterns. Therefore, by emphasizing possible deviations with respect to the expected behavior, one can establish with good confidence if the digital integrity has been violated.

Fig. 11.2
figure 2

In-camera processing pipeline. An image is captured using a complex acquisition system. The optical filter reduces undesired light components and the lenses focus the light on the sensor, where the red-green-blue (RGB) components are extracted by means of a color filter array (CFA). Then, a sequence of internal processing steps follow, including lens distortion correction, enhancement, demosaicing, and finally compression

In the multimedia forensics literature, there has been intense work on detecting any possible type of manipulation, global or local, irrespective of its purpose. Several methods detect a large array of image manipulations, like resampling, median filtering, and contrast enhancement. However, these operations can also be carried out only to improve the image appearance or for other legitimate purposes, with no malicious intent. For this reason, in this chapter, we will focus on detecting local (semantically) relevant manipulations, like copy-moves, inpainting, and compositions, that can modify the meaning of the image.

Early digital integrity method followed a model-based approach, relying on some mathematical or statistical models of data and attacks. More recent methods, instead, are for the most part data-driven, based on deep learning: they exploit the availability of large amounts of data to learn how to detect specific or generic forensic clues. These can be further classified based on whether they give as output an image-level detection score that indicates the probability that the image has been manipulated or not, or a pixel-level localization mask that provides a score for each pixel. In addition, we will distinguish methods based on the training strategy they apply:

  • two-class learning, where training is carried out on datasets of labeled pristine and manipulated data;

  • one-class learning, where only real data are used during training and manipulations are regarded as anomalies.

In the rest of the chapter, we first analyze in detail the forensics clues exploited in the various approaches, then review the main architectural solutions proposed for detection and localization, together with their training strategies and the most popular datasets proposed in this field. Finally, we draw conclusions and highlight major challenges and future directions.

2 Forensics Clues

Learning-based detection methods are typically designed on the basis of the specific forensic artifacts they look for. For example, some proposed convolutional neural networks (CNNs) base their decision on low-level features (e.g., camera-based), while others focus on higher level features (e.g., edge boundaries in a composition), and still others on features of both types. Therefore, this section will review the most interesting clues that can be used for digital integrity verification.

2.1 Camera-Based Artifacts

One powerful approach is to exploit the many artifacts that are introduced during the image acquisition process (Fig. 11.2). Such artifacts are specific of the individual camera (the device) or the camera model, and therefore represent some form of signature superimposed on the image, which can be exploited for forensic purposes. As an example, if part of the image is replaced with content taken from another source, this added material will lack the original camera artifacts, allowing detection of the attack. Some of the most relevant artifacts are related to the sensor, the lens, the color filter array, the demosaicing algorithm, and the JPEG quantization tables. In all cases, with the aim of analyzing weak artifacts, the image content (the scene) represents an undesired strong disturbance. Therefore, it is customary to work on the so-called noise residuals, obtained by suppressing the high-level content. To this end, one can estimate the “true” image by means of a denoising algorithm, and subtract it from the original image, or use some high-pass filters in the spatial or in the transform domain (Lyu and Farid 2005; He et al. 2012).

Fig. 11.3
figure 3

Training a noiseprint extraction network. Pairs of patches feed the two branches of the Siamese network that extract the corresponding noiseprint patches. The output distance must be minimized for patches coming from the same camera and spatial location, maximized for the others

A powerful camera-related artifact is the photo-response non-uniformity noise (PRNU), successfully used not only for forgery detection but also for source identification (Chen et al. 2008). The PRNU pattern is due to unavoidable imperfections in the sensor manufacturing process. It is unique for each individual camera, stable in time, and present in all acquired images, therefore it can be considered as a device fingerprint. Using such a fingerprint, manipulations of any type can be detected. In fact, whenever the image is modified, the PRNU pattern is corrupted or even fully removed, a strong clue of a possible manipulation (Lukàš et al. 2006). The main drawbacks of PRNU-based methods are

  1. (i)

    the need of images taken from the same camera under test to obtain a good estimate of the reference pattern and

  2. (ii)

    the need to have a PRNU pattern spatially aligned with the image under test.

In Cozzolino and Verdoliva (2020), a learning-based strategy is proposed to extract a stronger camera fingerprint, the image noiseprint. Like the PRNU pattern, it behaves as a fingerprint embedded in all acquired images. Contrary to it, it depends on all camera-model artifacts which contribute to the digital history of an image. Noiseprints are extracted by means of a CNN, trained to capture, and emphasize all camera-model artifacts, while suppressing the high-level semantic content. To this end, the CNN is trained in a Siamese configuration, with two replicas of the network, identical for architecture and weights, on two parallel branches. The network is fed with pairs of patches drawn from a large number of pristine images of various camera models. When the patches on the two branches are aligned (same camera model and same spatial position, see Fig. 11.3) they can be expected to contain much the same artifacts. Therefore, the output of each branch can be used as reference for the input of the other one, by-passing the need for clean examples. Eventually, the network learns to extract the desired noiseprint, and can be used with no further supervision on images captured by any camera model, both inside and outside the training set.

Fig. 11.4
figure 4

Examples of manipulated images (top) with their corresponding noiseprints (middle) and heat maps obtained by a proper clustering (bottom)

Of course, the extracted noiseprints will always contain traces of the imperfectly rejected high-level image, acting as noise for the forensic purposes. However, by exploiting all sorts of camera-related artifacts, noiseprints enable more powerful and reliable forgery detection procedures than PRNU patterns. On the down side, the noiseprint does not allow one to distinguish between two devices of the same camera model. Therefore, it should not be regarded as a substitute of the PRNU, but rather as a complementary tool.

In Fig. 11.4, we show some examples of noiseprints extracted from different fake images, with the corresponding heat maps obtained by feature clustering. In the first case, the noiseprint clearly shows the 8 \(\times \) 8 grid of the JPEG format. Sometimes, traces left by image manipulations on the extracted noiseprint are so strong to allow easy localization even by direct inspection. The approach can be also easily extended to videos (Cozzolino et al. 2019), but larger datasets are necessary to make up for the reduced quality of the sources, due to heavy compression. Interestingly, noiseprints can be also used in a supervised setting (Cozzolino and Verdoliva 2018). Given a suitable training set of images, a reliable estimate of the noiseprint is built, where high-level scene leakages are mostly removed, and used in a PRNU-like fashion to discover anomalies in the image under analysis (see Fig. 11.5).

Fig. 11.5
figure 5

Noiseprint in supervised modality. The localization procedure is the classic pipeline used with PRNU-based methods: the noiseprint is extracted from a set of pristine images taken by the same camera of the image under analysis, their average represents a clean reference, namely, a reliable estimate of the camera noiseprint that can be used for image forgery localization

As we mentioned before, many different artifacts arise from the in-camera processing pipeline, related to sensor, lens, and especially the internal processing algorithms. Of course, one can also focus on just one of these artifacts to develop a manipulation detector. Indeed, this has been the dominant approach in the past, with many model-based methods proposed to exploit very specific features, leveraging a deep domain knowledge. In particular, there has been intense research on the CFA-demosaicing combination. Most digital cameras use a periodic color filter array, so that each individual sensor element records light only in a certain range of wavelengths (i.e., red, green, blue). The missing color information is then interpolated from surrounding pixels in the demosaicing process, which introduces a subtle periodic correlation pattern in all acquired images. Both CFA configuration and demosaicing algorithms are specific of each camera model, leading to different correlation patterns which can be exploited for forensic means. For example, when a region is spliced in a photo taken by another camera model, its periodic pattern will appear anomalous. Most of the methods proposed in this context are model-based (Popescu and Farid 2005; Cao and Kot 2009; Ferrara et al. 2012), however recently a convolutional neural network specifically tailored to the detection of mosaic artifacts has been proposed (Bammey et al. 2020). The network is trained only on authentic images and includes \(3\times 3\) convolutions with a dilation of 2, so as to examine pixels that all belong to the same color channel and detect possible inconsistencies. In fact, a change of the mosaic can lead to forged blocks being detected at incorrect positions modulo (2, 2).

2.2 JPEG Artifacts

Many algorithms in image forensics exploit traces left by the compression process. Some methods look for anomalies in the statistical distribution of original DCT samples, assumed to comply with the Benford law (Fu et al. 2007). However, the most popular approach is to detect double compression traces (Bianchi and Piva 2012). In fact, when a JPEG-compressed image undergoes a local manipulation and then is compressed again, double compression (or double quantization, DQ) artifacts appear all over the image (with very high probability) in the forged area. This phenomenon is exploited also in the well-known Error Level Analysis (ELA), widespread among practitioners for its simplicity.

Fig. 11.6
figure 6

Histograms of the DCT coefficients clearly show the statistical differences between images subject to a single or double JPEG compression

The effects of double compression are especially visible in the distributions of the DCT coefficients which, after the second compression, show characteristic periodic peaks and valleys (Fig. 11.6). Based on this observation, Wang and Zhang (2016) performs splicing detection by means of a CNN trained with histograms of DCT coefficients. The histograms are computed either from single-compressed patches (forged areas) or double-compressed ones (pristine areas), with quality factors in the range 60–95. To reduce the dimensionality of the feature vector, only a small interval near the peak of each histogram is chosen to represent the whole histogram. Following this seminal paper, also Amerini et al. (2017) and Park et al. (2018) use histograms of DCT coefficients as input, while (Kwon et al. 2021) uses one-hot coded DCT coefficients as already proposed in Yousfi and Fridrich (2020) for steganalysis. Also the quantization tables can bring useful information, and are used as additional input in Park et al. (2018); Kwon et al. (2021). Leveraging domain-specific prior knowledge, the approach proposed in Barni et al. (2017) works on noise residuals rather than on image pixels, and uses the first layers of the CNN to extract histogram-related features. In Park et al. (2018), instead, the quantization tables from the JPEG header are used as input together with the histogram features. It is worth observing that methods based on deep learning provide good results also in cases where conventional methods largely fail, such as when test images are compressed with a quality factor (QF) never seen in training or when the second QF is larger than the first one. Double compression detectors have been also proposed for H.264 video analysis, with a two-stream neural network that analyzes separately intra-coded and predictive frames (Nam et al. 2019).

2.3 Editing Artifacts

The editing process often generates a trail of precious traces, besides artifacts related to re-compression. Indeed, when a new object is inserted in an image, it typically requires several post-processing steps to fit the new context smoothly. These include geometric transformations, like rotation and scaling, contrast adjustment, and blurring, to smooth the object-background boundaries. Both rotation and resampling, for example, require some forms of interpolation, which in turn generates periodic artifacts. Therefore, some methods (Kirchner 2008) look for traces of resampling as a proxy for possible forgeries. In this section, we describe the most used and discriminative editing-based traces exploited in forensics algorithms.

Copy-moves

A very common manipulation consists in replicating or hiding objects. Of course, the presence of identical regions is a strong hint of forgery, but clones are often modified to disguise traces, and near-identical natural objects also exist, which complicates the forensic analysis. Studies on copy-move detection date back to 2003, with the seminal work of Fridrich (2003). Since then, a large literature has grown on this topic proposing solutions that allow for copy-move detection even in the presence of rotation, resizing, and other geometric distortions (Christlein et al. 2012). Methods based on keypoints are very efficient, while block-based dense methods (Cozzolino et al. 2015) are more accurate and deal also with removals.

A first solution that relies on deep learning has been proposed in Wu et al. (2018), where an end-to-end CNN-based solution is able to jointly optimize the three main steps of a classic copy-move solution, that is, feature extraction, matching, and post-processing. To improve performance across different scenarios, multiscale feature analysis and hierarchical feature mapping are employed in Zhong and Pun (2019). Overall, deep-learning-based methods appear to outperform conventional methods on low-resolution images (i.e., small-size images) and can be designed to distinguish the copy from the original region. In fact, typically, copy-move detection methods generate a map where both the original object and its clone are highlighted, but do not establish which is which. The problem of source-target disambiguation is considered in Wu et al. (2018); Barni et al. (2019); Chen et al. (2020). The main idea originally proposed in Wu et al. (2018) is to use a CNN for manipulation detection and another one for similarity computation. Then a suitable fusion of these two outputs helps to distinguish the source region from its copies (Fig. 11.7).

Fig. 11.7
figure 7

Examples of additive copy-moves. From left to right: manipulated images and corresponding ground truths, binary localization maps obtained using a dense approach (Cozzolino et al. 2015) and those obtained by means of a deep-learning-based one (Chen et al. 2020). The latter method allows telling apart the source and copied areas

Inpainting

Inpainting is a very effective image manipulation approach that can be used to hide objects/people. The main classic approaches for their detection rely on methods inspired by copy-move detection, given that in the widespread exemplar-based inpainting, multiple small regions are copied from all over the image and combined together to cover the object. Patch-based inpainting is addressed in Zhu et al. (2018) with an encoder/decoder architecture which provides a localization map as output. In Li and Huang (2019), instead, a fully convolutional network is used to localize the inpainted area in the residual domain. The CNN does not look any specific editing artifact caused by inpainting, but only trained on a large number of examples of real and manipulated images, making the detection very effective for the analyzed methods.

Splicings

Other approaches focus on anomalies which appear at the boundaries of objects when a composition is performed, due to the inconsistencies between regions drawn from different sources. For example, Salloum et al. (2018) uses a multi-task fully convolutional network comprising a specific branch for detecting boundaries between inserted regions and background and another one for analyzing the surface of the manipulation. Both in this work and in Chen et al. (2021), training is carried out also using the edge map extracted from the ground truth. In other papers (Liu et al. 2018; Shi et al. 2018; Zhang et al. 2018; Phan-Xuan et al. 2019), instead, training is performed on patches that are either authentic or composed of pixels belonging to the two classes. Therefore, the CNN is forced to look for the boundaries of the manipulation.

Face Warping

In Wang et al. (2019), a CNN-based solution is proposed to detect artifacts introduced by a specific Photoshop tool, Face-Aware Liquify, which performs image warping of human faces. The approach is very successful for this task because the network is trained on manipulated images automatically generated by the very same tool. To deal with more challenging situations and increase robustness, data augmentation is applied by including resizing, JPEG compression, and various types of histogram editing.

3 Localization Versus Detection

Integrity verification can be carried out at two different levels: image-level (detection) or pixel-level (localization). In the first case, a global integrity score is provided, which indicates the likelihood that a manipulation occurred anywhere in the image. The score can be a real number, typically an estimate of the probability of manipulation, or a binary decision (yes/no). Localization instead builds upon the hypothesis that a detector has already decided on the presence of a manipulation and tries to establish for each individual pixel if it has been manipulated. Therefore, the output has the same size of the image and, again, can be real valued, corresponding to a probability map or more in general a heat map, or binary, corresponding to a decision map (see Fig. 11.8). It is worth noting that most localization methods in practice also try to detect the presence of a manipulation and do not make any assumptions on the fake/pristine nature of the image. Many methods focus on localization, others on detection or on both tasks. A brief summary of the main characteristics of these approaches is presented in Table 11.1. In this section, we analyze the major directions adopted in the literature to address these problems by using deep-learning-based solutions.

Fig. 11.8
figure 8

The output of a method for digital integrity verification can be either a (binary or continuous) global score related to the probability that the image has been tampered or a localization map, where the score is computed at pixel level

3.1 Patch-Based Localization

The images to analyze are usually much larger than the input layer of neural networks. To avoid resizing the input image, which may destroy precious evidence, several approaches propose to work on small patches, with size typically spanning from 64\(\times \)64 to 256\(\times \)256 pixels, and analyze the whole image patch by patch. Localization can be performed in different ways based on the learning strategy, as described in the following.

Table 11.1 An overview of the main approaches proposed in the literature together with their main features: the type of artifact they aim to detect, the training and test input, i.e., patches, resized images or original images. In addition, we also indicate if they have been designed for localization (L), detection (D), or both

Two-class Learning

Supervised learning relies on the availability of a large dataset of both real and manipulated patches. Once the patch is classified at test time, localization is carried out by performing a sliding-window analysis of the whole image. Most of the early methods are designed in this way. For example, to detect traces of double compression, in Park et al. (2018) a dataset of single-compressed and double-compressed JPEG blocks was generated using a very large number of quantization tables. Clearly, the more diverse and varied the dataset is, the more effective the training will be, with strong impact on the final performance. Beyond double compression, one can be interested in detecting traces of other types of processing, like blurring or resampling. In this case, one has only to generate a large number of patches of such kind. The main issue is the variety of parameters one can consider and that does not allow to easily cover the whole space. Patch-based learning is also used to detect object insertion in an image, based on the presence of boundary artifacts. In this case, manipulated patches include pixels coming from two different sources (the host image and the alien object), with a simple edge separating the two regions (Zhang et al. 2018; Liu et al. 2018).

One-class learning

Several methods in forensics look at the problem using an anomaly detection perspective, developing a model of the pristine data and looking for anomalies with respect to this model, which can suggest the presence of a manipulation. The idea is to extract a specific type of forensic feature during training, whose distribution depends strongly on the source image or class of images. Then, at test time, these features are extracted from all patches of the image (with suitable stride) and clustered based on their statistics. If two distinct clusters emerge, the smallest one is regarded as anomalous, likely due to a local manipulation, and the corresponding patches provide a map of the tampered area. The most successful features used for this task are low level, like those related to the CFA pattern (Bammey et al. 2020), compression traces (Niu et al. 2021), or all the in-camera processing operations (Cozzolino and Verdoliva 2016; Bondi et al. 2017; Cozzolino and Verdoliva 2020). For example, some features allow one to classify different camera models with high reliability. In the presence of a splicing, the features observed in the pristine and the forged regions will be markedly different, indicating that different image parts were acquired by different camera models (Bondi et al. 2017; Yao et al. 2020).

Autoencoders can be very useful to implement a one-class approach. In fact, a network trained on a large number of patches extracted from untampered images eventually learns to reproduce pristine images with good accuracy. On the contrary, anomalies will give rise to large reconstruction errors, pointing to a possible manipulation. Noise inconsistencies are used in Cozzolino and Verdoliva (2016) to localize splicings without requiring a training step. At test time, handcrafted features are extracted from the patches of the image and an iterative algorithm based on feature labeling is used to detect the anomaly region. The same approach has been extended to videos in D’Avino et al. (2017), using an LSTM recurrent network to account for temporal dependencies.

In Bammey et al. (2020), instead, the idea is to learn locally the possible configurations of a CFA mosaic. Whenever such pattern is not found in the sliding-window analysis at test time, an anomaly is identified, and then a possible manipulation. A similar approach can be followed with reference to the JPEG quality factor: in this case, if an inconsistency is found in the image, this means that the anomalous region was compressed with a different compression level than the rest of the image, a strong evidence of local modifications (Niu et al. 2021).

3.2 Image-Based Localization

Majority of approaches that carry out localization rely on fully convolutional networks and leverage architectures developed originally for image segmentation, regarding the problem itself as a special form of segmentation between pristine and manipulated regions. A few methods, instead, take inspiration from object detection and make decisions on regions extracted from a preliminary region proposal phase. In the following, we review the main ideas.

Segmentation-like approach

These methods take as input the whole image and produce a localization map of the same size. Therefore, they avoid the problem of post-processing a large number of patch-level results in some sensible way. On the down side, they need a large collection of fake and real images for training, as well as their associated pixel-level ground truth, two rather challenging requirements. Actually, many methods keep being trained on patches and keep working on patches at testing time, but all the pieces of information are processed jointly in the network to provide an image-level output, just as it happens with denoising networks. Therefore, for these methods, training can be performed at patch level by considering many different types of local manipulations (Wu et al. 2019; Liu et al. 2021) or including boundaries between the pristine and forged area (Salloum et al. 2018; Rao et al. 2021; Zhang and Ni 2020). Many other solutions, instead, use the information from the whole image by resizing it to a fixed dimension (Wu et al. 2018; Bappy et al. 2019; Mazaheri et al. 2019; Bi et al. 2019, 2020; Kniaz et al. 2019; Shi et al. 2020; Chen et al. 2020; Hu et al. 2020; Islam et al. 2020; Chen et al. 2021). The image is reduced to the size of the network input, and therefore fine-grained pixel-level dependencies typical of in-camera and out-camera artifacts are destroyed. Moreover, local manipulations may become extremely small after resizing and easily neglected in the overall loss. This approach is conceptually simple but causes a significant loss of information. To increase the number of training samples and generalize across a large variety of possible manipulations, in Zhou et al. (2019) an adversarial strategy is also adopted. Even the loss functions used in these works are typically inspired by semantic segmentation, such as pixel-wise cross-entropy (Bi et al. 2019, 2020; Hu et al. 2020), weighted cross-entropy (Bappy et al. 2019; Rao et al. 2021; Salloum et al. 2018; Zhang and Ni 2020), and dice (Chen et al. 2021). Since often the manipulation region is much smaller than the whole image in Li and Huang (2019), it is proposed to adopt the focal loss, which assigns a modulating factor to the cross-entropy term and then can address the class imbalance problem.

Object Detection-Like Approach

A different strategy is to identify only the region box that includes the manipulation adopting approaches that are applied for the object detection task. This path is first followed in Zhou et al. (2018) relying on the Region Proposal Network (RPN) (Ren et al. 2017) very popular for object detection. This is adapted to provide regions with potential manipulations, by using features extracted both from the RGB channels and from the noise residual as input. The approach proposed in Yang et al. (2020), instead, adopts Mask R-CNN (He et al. 2020) and includes an attention region proposal network to identify the manipulated region.

3.3 Detection

Despite the obvious importance of detection, which should take place before localization is attempted, in the literature there has been limited attention to this specific problem. A large number of localization methods have been used to perform detection through a suitable post-processing of the localization heatmap aimed at extracting a global score. In some cases, the average or the maximum value of the heatmap is used as decision statistics (Huh et al. 2018; Wu et al. 2019; Rao et al. 2021). However, localization methods often perform clustering or segmentation in the target image and therefore they tend to find a forged area also in pristine images, generating a large number of false alarms.

Other methods propose ad hoc information fusion strategies. Specifically, in Rao and Ni (2016); Boroumand and Fridrich (2018), training is carried out on a patch-level analysis then statistics-based information is aggregated for the whole image to form a feature vector as the input of a classifier, either a support vector machine or a neural network. Unfortunately, a patch-level analysis does not allow taking into account both local information (through textural analyses) and global information over the whole image (through contextual analyses) at the same time. That is, by focusing on a local analysis, the big picture may go lost.

Based on these considerations, a different direction is followed in Marra et al. (2019). The network takes as input the whole image without any resizing, so as not to lose precious traces of manipulation hidden in its fine-grain structure. All patches are analyzed jointly, through a suitable aggregation, to make the final image-level decision on the presence of a manipulation. More important, also in the training phase, only image-level labels are used to update the network weights, and only image-level information back-propagates until the early patch-level feature extraction layers. To this end, a gradient checkpointing technique is used, which allows to keep in memory all relevant variables at the cost of a limited increase in computation. This approach allows for the joint optimization of patch-level feature extraction and image-level decision. The first layers extract features that are instrumental to carry out a global analysis and the last layers exploit this information to highlight local anomalies that would not appear at the patch level. In Marra et al. (2019), this end-to-end approach is shown to also allow a good interpretation of results by means of activation maps and a good localization of forgeries (see Fig. 11.9).

Fig. 11.9
figure 9

Example of localization analysis based on the detection approach proposed in Marra et al. (2019). From left to right: test image, corresponding activation map, and ROI-based localization result with ground-truth box (green) and automatic ones (magenta). Detection scores are shown on the corner of each box

4 Architectural Solutions

In this section, we describe the most popular and interesting deep-learning-based architectural solutions proposed in digital forensics.

4.1 Constrained Networks

To exploit low-level features related to camera-based artifacts, one needs to suppress the scene content. This can be done by pre-processing the input image, through a preliminary denoising step, or by means of suitable high-pass filters, like the popular spatial rich models (SRM) initially proposed in the context of steganalysis (Fridrich and Kodovsky 2012) and applied successfully in image forensics (Cozzolino et al. 2014). Several CNN architectures try to include this pre-processing in their architecture by means of a constrained first layer which performs the desired high-pass filtering. For example, in Rao and Ni (2016), a set of fixed high-pass filters inspired to the spatial rich model are used to compute residuals of the input patches, while in Bayar and Stamm (2016) the high-pass filters of the first convolutional layer are learnt using the following constraints:

$$\begin{aligned} {\begin{matrix} w_k \left( 0, 0 \right) &{}= -1 \\ \sum _{i, j} w_k \left( i, j \right) &{}= 0 \\ \end{matrix}} \end{aligned}$$
(11.1)

where \(w_k \left( i, j \right) \) is the weight of k-th filter at the \(\left( i, j \right) \) position and \(\left( 0, 0 \right) \) indicates the central position. This solution is pursued also in other works, such as (Wu et al. 2019; Yang et al. 2020; Chen et al. 2021), while many others use a fixed filtering (Li and Huang 2019; Barni et al. 2017; Zhou et al. 2018) or adopt a combination of fixed high-pass filters and trainable ones (Wu et al. 2019; Zhou et al. 2018; Bi et al. 2020; Zhang and Ni 2020).

A slightly different perspective is considered in Cozzolino et al. (2017) where a CNN architecture is built to replicate exactly the co-occurrence-based methods of Fridrich and Kodovsky (2012). All the phases of this procedure, i.e., extraction of noise residuals, scalar quantization, computation of co-occurrences, and histogram are implemented using standard convolutional or pooling layers. Once established the equivalence, the network is modified to enable its fine-tuning and further improve performance. Since the resulting network has a lightweight structure, fine-tuning can be carried out using a small training set, limiting computation time.

4.2 Two-Branch Networks

In order to exploit different types of clues at the same time, it is possible to adopt a CNN architecture that accepts multiple inputs Zhou et al. (2018); Shi et al. (2018); Bi et al. (2020); Kwon et al. (2021). For example, a two-branch approach is proposed in Zhou et al. (2018) in order to take into account both low-level features extracted from noise residuals and high-level features extracted from RGB images. In Amerini et al. (2017); Kwon et al. (2021), instead, information extracted from the spatial and DCT domain is analyzed so as to take explicitly into account also compression artifacts. Two-branch solutions can also be used to produce multiple outputs, as done in Salloum et al. (2018), where the fully convolutional network has two output branches: one provides the localization map of splicing while the other provides a map containing only the edges of the forged region. In the context of copy-move detection, the two-branch structure is used to address separately the problems of detecting duplicates and disambiguating between the original object an its clones (Wu et al. 2018; Barni et al. 2019).

4.3 Fully Convolutional Networks

Fully convolution networks (FCNs) can be extremely useful since they preserve spatial dependencies and generate an output of the same size of the input image. Therefore, the network can be trained to output a heat map which allows to perform tampering localization. Various architectures have shown to be particularly useful for forensics applications.

Some methods (Bi et al. 2019; Shi et al. 2020; Zhang and Ni 2020) rely on a plain U-Net architecture (Ronneberger et al. 2015), one of the most successful network for semantic segmentation, initially proposed for biomedical applications. A more innovative variation of FCNs is proposed in Bappy et al. (2019); Mazaheri et al. (2019) where the architecture includes a long short-term memory (LSTM) module. The blocks of the image are sequentially processed by the LSTM. In this way, the network can model the spatial relationships between neighboring blocks, which facilitates detecting manipulated blocks as those that break the natural statistics of authentic ones. Other papers include spatial pyramid mechanisms in the architectures, in order to capture relationships at different scales of the image (Bi et al. 2020; Wu et al. 2019; Hu et al. 2020; Liu et al. 2021; Chen et al. 2021).

Both LSTM modules and spatial pyramids aim at overcoming the intrinsic limits of networks based only on convolutions where, due to the limited overall receptive field, the decision on a pixel depends only on the information in a small region around the pixel. With the same aim, it is also possible to use attention mechanisms (Chen et al. 2008, 2021; Liu et al. 2021; Rao et al. 2021), as in Shi et al. (2020), where channel attention is used in the Gram matrix to measure the correlation between any two feature maps.

4.4 Siamese Networks

Siamese networks can be extremely useful for several forensic purposes. In fact, by leveraging the Siamese configuration, with two identical networks working in parallel, one can focus on the similarity and differences among patches, making up for the absence of ground-truth data.

This approach is applied in Mayer and Stamm (2018), where a constrained network is used to extract camera-model features, while another network is trained to learn the similarity between pairs of such features. This strategy is further developed in Mayer and Stamm (2019) where a graph-based representation is introduced to better identify the forensic relationships among all patches within an image. A Siamese network is also used in Huh et al. (2018) to establish if image patches were captured with different imaging pipelines. The proposed approach is self-supervised and localizes image splicings by predicting the consistency of EXIF attributes between pairs of patches in order to establish whether they came from a single coherent image. Once trained on pristine images, featured by their EXIF header, the network can be used on any new image without further supervision. As already seen in Sect. 1.2.1, a Siamese approach can help to extract a camera-model fingerprint, where artifacts related to camera model are emphasized (Cozzolino and Verdoliva 2020, 2018; Cozzolino et al. 2019).

5 Datasets

For two-class learning-based approaches, having good data for training is of paramount importance. By “good”, we mean abundant (for data-hungry modern CNNs), representative (of the many possible manipulations encountered in the wild) and well curated (e.g., balanced, free from biases). Moreover, to assess the performance of new proposals, it is important to compare results on multiple datasets with different features. The research community has made considerable efforts through the years to release a number of datasets with a wide array of image manipulations. However, not all of them possess the right properties to support, by themselves, the development of learning-based methods. In fact, a way too popular approach is to split a single dataset in training, validation, and test set, carrying out training and experiments on this single source, a practice that may easily induce some forms of polarization or overfitting, if the dataset is not built with great care.

In Fig. 11.10, we show a list of the most widespread datasets that have been proposed since 2011. During this short time lapse, the size of the datasets has rapidly increased, by roughly two orders of magnitude, together with the variety of manipulations (see Table 11.2). although the capacity to fool an observer has not grown at a similar pace. In this section, the most widespread datasets are described, and their features are briefly discussed.

Fig. 11.10
figure 10

An overview of the most used datasets in the current literature for image manipulation detection and localization. We can observe that more recent ones are much larger, while the level of realism has not really improved upon time. Note that the size of the circles corresponds to the number of samples in the dataset

Table 11.2 List of datasets including generic image manipulations. Some of them are targeted for specific manipulations, while others are more various and contain different types of forgeries

The Columbia dataset (Hsu and Chang 2006) is one of the first dataset proposed in the literature. It has been extensively used, however it includes only unrealistic forgeries without any semantic meaning. A more realistic dataset with splicings is DSO-1 (Carvalho et al. 2013) that in turn has the limitation that all images have the same resolution, they are in uncompressed format and there is no information on how many cameras were used. A very large dataset of face swapping has been introduced in Zhou et al. (2017), which contains around 4,000 real and manipulated images using two different algorithms. Several datasets have been designed for copy-move forgery detection (Christlein et al. 2012; Amerini et al. 2011; Tralic et al. 2013; Cozzolino et al. 2015; Wen et al. 2016), one of the most studied forms of manipulation. Some of them modify the copied object in multiple ways, including rotation, resizing, and change of illumination, to stress the capabilities of copy-move methods.

Many other datasets include various types of manipulations so as to present a more realistic scenario. The CASIA dataset (Dong et al. 2013) includes both splicings and copy-moves and inserted objects are post-processed to better fit the scene. Nonetheless, it exhibits a strong polarization, as highlighted in Cattaneo and G. Roscigno (2014): authentic images can be easily separated from tampered ones since they were compressed with different quality factors. Forgeries of various nature are present also in the Realistic Tampering Dataset proposed in Korus and Huang (2016) that comprises all uncompressed, but very realistic manipulations. The Wild Web Dataset is instead a collection of cases from the Internet (Zampoglou et al. 2017).

The U.S. National Institute of Standards and Technology (NIST) has released several large datasets (Guan et al. 2019) aimed at testing forensic algorithms in increasingly realistic and challenging conditions. The first one, NC2016, is quite limited and present some redundancies, such as same images repeated multiple times in different conditions. While this can be of interest to study the behavior of some algorithms, it prevents the dataset to be split for training, validation, and test to avoid overfitting. Much more challenging and large datasets have been proposed by NIST in subsequent years: NC2017, MFC2018, and MFC2019. These datasets are very large and present very different types of manipulations, resolutions, formats, compression levels, and acquisition devices.

Some very large datasets have been released recently. DEFACTO (Mahfoudi et al. 2019) comprises over 200,000 images with a wide variety of realistic manipulations, both conventional, like splicings, copy-moves, and removals, and more advanced, like face morphing. The PS-Battles Dataset, instead, provides for each original image a varying number of manipulated versions (Heller et al. 2018), for a grand total of 102,028 images. However, it does not provide ground truths and the original themselves are not always pristine. SMIFD-500 (Social Media Image Forgery Detection Database) Rahman et al. (2019) is a set of 500 images collected from popular social media, while FantasticReality (Kniaz et al. 2019) is a dataset of splicing manipulations split into two parts: “Rough” and “Realistic”. The “Rough” part contains 8k splicings with obvious artifacts while the “Realistic” part provides 8k splicings that were retouched manually to obtain a realistic output.

A dataset of manipulated images has been also built in Novozámský et al. (2020). It features 35,000 images collected from 2,322 different camera models, and manipulated in various ways, e.g., copy-paste, splicing, retouching, and inpainting. Moreover, it also includes a set of 2,000 realistic fake images downloaded from the Internet along with the corresponding real images.

Since some methods work on anomaly detection and are trained only on pristine data, well-curated datasets comprising a large number of real images are also of interest. The Dresden image database (Gloe and Böhme 2010) contains over 14,000 JPEG images from 73 digital cameras of 25 different models. To allow studying the peculiarities of different models and devices, the same scene is captured by a large number of cameras. Raw images can be found both in Dresden and in the RAISE dataset (Dang-Nguyen et al. 2015), composed of 8,156 images taken by 3 different camera models. A dataset comprising both SDR (standard dynamic range) and HDR (high dynamic range) images has been presented in Shaya et al. (2018). A total of 5,415 images were captured by 23 mobile devices in different scenarios under controlled acquisition conditions. The VISION dataset, instead, Shullani et al. (2017) includes 34,427 images and 1,914 videos from 35 devices of 11 major brands. They appear both in their original format and after uploading/downloading on various platforms (Facebook, YouTube, WhatsApp) so as to allow studies on data retrieved from social networks. Another mixed dataset, SOCRATES, is proposed in Galdi et al. (2017). It contains 6,200 images and 680 videos captured using 67 smartphones of 14 brands and 42 models.

6 Major Challenges

In this section, we want to highlight some major challenges of current deep-learning-based approaches.

  • Localization versus Detection. To limit computational complexity and memory storage, deep networks work on relatively small input patches. Therefore, to process large images, one must either resize them or analyze them patch-wise. The first solution does not fit forensic needs, as it destroys precious statistical evidence related to local textures. On the other hand, a patch-wise analysis may miss image-level phenomena. A feature extractor trained on patches can only learn good features for local decisions, which are not necessarily good to make image-level decisions. Most deep-learning-based methods sweep under the rug the detection problem and focus on localization. Then, detection is addressed as an afterthought through some forms of processing of the localization map. Not surprisingly, the resulting detection performance is often very poor. Table 11.3 compares the localization and detection performances of some patch-wise methods. We present results in terms of Area Under the Curve (AUC) for detection and the Matthews Correlation Coefficient (MCC) for localization. MCC is robust to unbalanced classes (the forgery is often small compared to the whole image) its absolute value belongs to the range [0, 1]. It represents the cross-correlation coefficient between the decision map and the ground truth, computed as

    $$\begin{aligned} \text{ MCC } = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} \end{aligned}$$

    where

    • TP (true positive): # positive pixels declared positive;

    • TN (true negative): # negative pixels declared negative;

    • FP (false positive): # negative pixels declared positive;

    • FN (false negative): # positive pixels declared negative;

    We can note that, while the localization results are relatively good, detection results are much worse and often no better than coin flipping. A more intense effort on the detection problem seems necessary.

  • Robustness. Forensic analyses rely typically on the weak traces associated with the digital history of the target image/video. Such traces are further weakened and tend to disappear altogether in the presence of other quality-impairing processing steps. As an example, consider the image of Fig. 11.11, with a large splicing, localized with high accuracy. After compression or resizing, the localization map becomes more fuzzy, and decision unreliable. Likewise, in the images of Fig. 11.12 with copy-moves, the attribution of the two copies becomes incorrect after heavy resizing. On the other hand, most images and videos of interest are found on social networks, where they are routinely heavily resized and recompressed for storage efficiency, with non-deterministic parameters. Therefore, robustness is a central problem that all multimedia forensic methods should confront with from the beginning.

  • Generalization. Carrying out a meaningful training is a hard task in image forensics, in fact, it is very easy to introduce some sort of bias. For example, one may train and test the network on images acquired only by a limited number of devices, identifying the provenance source instead of the artifact trace, or may use different JPEG compression pipelines for real and fake images, giving rise to different digital histories. A network will no doubt learn all these features if they help discriminating between pristine and tampered images, and may neglect more general but weaker traces of manipulation. This will lead to a very good performance on a training-aligned test set, and very bad performance on independent tests. Therefore, it is important to avoid biases to the extent possible, and even more important to test the trained networks on multiple and independent test sets.

  • Semantic Detection. Developing further on detection, a major problem is telling apart innocent manipulations from malicious ones. An image may be manipulated for many reasons, very frequently only for improving its appearance or to better fit it in the context of interest. These images should not be detected, as they are not really of interest and their number would easily overwhelm any subsequent analysis system. A good forensic algorithm should be able to single out only malicious attacks, based on the type of manipulation, often concerning only a small part of the image, and especially based on the context, including all other related media and textual information.

Table 11.3 Localization by means of MCC and detection by means of AUC performance compared. With a few exceptions, the detection performance is very poor (AUC \(=\) 0.5 amounts to coin flipping)
Fig. 11.11
figure 11

Results of splicing localization in the presence of different levels of compression and resizing. MCC is 0.998, it decreases to 0.985, 0.679, 0.412, and 0.419 by varying the compression level: JPEG quality equal to 90, 80, 70, and 60, respectively. MCC decreases to 0.994, 0.991, 0.948, and 0.517 by varying the resizing factor: 0.90, 0.75, 0.50, and 0.40, respectively

Fig. 11.12
figure 12

Results of copy-move detection and attribution in the presence of resizing

7 Conclusions and Future Directions

Manipulating images and videos is becoming simpler and simpler, and fake media are becoming widespread, with the well-known associated risks. Therefore, the development of reliable multimedia forensic tools is more urgent than ever. In recent years, the main focus of research has been on deepfakes, images, and videos (often of persons) fully generated by means of AI tools. Yet, detecting computer-generated material seems to be relatively simple, for the time being. These images and videos lack some characteristic features of real media assets (think of device and model fingerprints) and exhibit instead traces of their own generation process (e.g., GAN fingerprints). Moreover, they can be generated in large number, providing for a virtually unlimited training set for deep learning methods to learn from.

Cheapfakes, instead, though crafted with conventional methods, keep evading more easily the scrutiny of forensic tools. Also for them, detectors based on deep learning can be expected to provide improved performance. However, creating large unbiased datasets with manipulated images representative of the wide variety of attacks is not simple. In addition, the manipulations themselves rely on real (as opposed to generated) data, hardly detected as anomalous. Finally, manipulated images are often resized and recompressed, further weakening the feeble forensic traces a detector can rely on. All this said, the current state of deep-learning-based methods for cheapfake detection can be considered promising, and further improvements will certainly come.

Fig. 11.13
figure 13

Examples of conventional manipulations carried out using deep learning methods: splicing (Tan et al. 2018), copy-move (Thies et al. 2019), inpainting (Zhu et al. 2021)

However, new challenges loom on the horizon. The acquisition and processing technology evolves rapidly, deleting or weakening well-known forensic traces, just think of the effects of video stabilization and new shooting modes on PRNU-based methods (Mandelli et al. 2020; Baracchi et al. 2020). Of course, new traces will appear, but research may struggle keeping this fast pace. Likewise, also manipulation methods evolve, see Fig. 11.13. For example, the inpainting process has been significantly improved thanks to deep learning, with recent methods which fill image regions using newly generated semantically consistent content. In addition, methods have been already proposed to conceal the traces left by various types of attacks. Along this line, another peculiar problem of deep-learning-based methods is their vulnerability to adversarial attacks. The class (pristine/fake) of an asset can be easily changed by means of a number of simple methods, without significantly impairing its visual quality.

As a final remark, we underline the need for interpretable tools. Deep learning methods, despite their excellent performance, act as black boxes, providing little or no hints on why certain decisions are made. Of course, such decisions may be easily questioned and may hardly stand in a court of justice. Improving explainability, though largely unexplored, is therefore a major topic for current and future work in this field.