1 Introduction

Deep neural networks, most noticeably Convolutional Neural Networks (CNNs), are increasingly used in image forensic applications due to their superior accuracy in detecting a wide number of image manipulations, including double and multiple JPEG compression (Wang and Zhang 2016; Barni et al. 2017), median filtering (Chen et al. 2015), resizing (Bayar and Stamm 2018), contrast manipulation (Barni et al. 2018), splicing (Ahmed and Gulliver 2020). Good performance of CNNs have also been reported for image source attribution, i.e., to identify the model of the camera which acquired a certain image (Bondi et al. 2017; Freire-Obregon et al. 2018; Bayar and Stamm 2018), and deepfake detection (Rossler et al. 2019). Despite the good performance they achieve, the use of CNNs in security-oriented applications, like image forensics, is put at risk by the easiness with which adversarial examples can be built (Szegedy et al. 2014; Carlini and Wagner 2017; Papernot et al. 2016). As a matter of fact, an attacker who has access to the internal details of the CNN used for a certain image recognition task can easily build an attacked image which is visually indistinguishable from the original one but is misclassified by the CNN. Such a problem is currently the subject of an intense research activity, yet no satisfactory solution has been found yet (see Akhtar and Ajmal (2018) for a survey on this topic). The problem is worsened by the observation that adversarial attacks are often transferable from the target network to other networks designed for the same task (Papernot et al. 2016). This means that even in a Limited Knowledge (LK) scenario, wherein the attacker has only partial information about the to-be-attacked network, he can attack a surrogate network mimicking the target one and the attack will be effective also on the target network with good probability. Such a property opens the way toward very powerful attacks that can be used in real applications wherein the attacker does not have full access to the attacked system (Papernot et al. 2016).

While some recent works have shown that CNN-based image forensics tools are also susceptible to adversarial examples (Guera et al. 2017; Marra et al. 2018), some differences exist between the generation of adversarial examples targeting computer vision networks and networks trained to solve image forensic problems. For instance, in Barni et al. (2019), it is shown that adversarial examples aiming at deceiving image forensic networks are less transferable than those addressing computer vision tasks. A possible explanation for such a different behavior could be the different kinds of features image forensic networks rely on with respect to those used in computer vision applications, however a definitive answer to this question is not known yet.

With the above ideas in mind, the goal of this chapter is to give a brief introduction to adversarial examples and illustrate their applicability to image forensics networks. Particular attention is given to the transferability of the examples, since it is mainly thanks to this property that adversarial examples can be exploited to develop practical attacks against image forensic detectors. We also describe some possible defenses against adversarial examples, even if, up to date, a definitive strategy to make the creation of adversarial examples impossible, or at least impractical, is not available.

2 Adversarial Examples in a Nutshell

In this section we briefly review the various approaches proposed so far to generate adversarial examples, without referring specifically to a multimedia forensics scenario. After a rigorous formalization of the problem, we review the main approaches proposed so far, then we consider the generation of adversarial examples in the physical domain. Eventually, we discuss the problems associated with the generation of adversarial attacks in a black-box setting. Throughout the rest of this chapter we will always refer to the case of deep neural networks aiming at classifying an input image into a number of possible classes, since this is by far the most relevant setting in a multimedia forensic context.

2.1 Problem Definition and Review of the Most Popular Attacks

Let \(x\in \mathbb [0,1]^n\) denote a clean imageFootnote 1 whose true label is y and let \(\phi \) be a CNN-based image classifier providing a correct classification of x, namely, \(\phi (x)=y\). An adversarial example is a perturbed image \(x' = x + \delta \) for which \(\phi (x') \ne y\). More specifically, an untargeted attack aims at generating an adversarial example for which \(\phi (x')= y'\) for any \(y' \ne y\), while for a targeted attack the wrong label \(y'\) must correspond to a specific target class \(y_t\).

In practice, we require that \(\delta \) is a very small quantity, so that the perturbed image \(x'\) is perceptually identical to x. Such a goal is usually achieved by constraining the \(L_p\) norm of \(\delta \) to be lower than a small positive quantity, that is \(\left\| \delta \right\| _p \le \varepsilon \), where \(\varepsilon \) is a very small value.

Starting from the seminal work by Szegedy, many approaches to generate the adversarial perturbation \(\delta \) have been proposed, leading to a wide variety of different algorithms an attacker can rely on. The proposed approaches include gradient-based attacks (Szegedy et al. 2014; Goodfellow et al. 2015; Kurakin et al. 2017; Madry et al. 2018; Carlini and Wagner 2017), transfer-based attacks (Dong et al. 2018, 2019), decision-based attacks (Chen et al. 2017; Ilyas et al. 2018; Li et al. 2019), and so on. Most methods assume that the attacker has a full knowledge of the to-be-attacked model \(\phi \) a situation usually referred to as a white-box attack scenario. It goes without saying that this may not be the case in practical applications for which more flexible black-box or gray-box attacks are needed. We will discuss the main difficulties associated with black-box attacks in Sect. 16.2.2.

In the following we review some of the most popular gradient-based white-box attacks.

  • L-BFGS Attack. In Szegedy et al. (2014), Szegedy et al. first proposed to generate adversarial examples aiming at fooling a CNN-based classifier and minimizing the perturbation of the image at the same time. The problem is formulated as

    $$\begin{aligned} \begin{array}{lll} \text{ minimize } &{} \left\| \delta \right\| _2^2 \\ \text {subject to} &{} \phi (x+\delta )=y_t \\ &{} x + \delta \in [0,1]^n \end{array}. \end{aligned}$$
    (16.1)

    Here, the perturbation introduced by the attack is limited by using the squared \(L_2\) norm. Solving the above minimization is not an easy task, so the adversarial examples are looked for by solving more manageable problem:

    $$\begin{aligned} \begin{array}{lll} \mathop {\mathrm {minimize}}\limits _{\delta } &{} \lambda \cdot \left\| \delta \right\| _2^2 + \mathcal L({x+\delta }, y_t) \\ \text {subject to} &{} x + \delta \in [0,1]^n \end{array} \end{aligned}$$
    (16.2)

    where \(\mathcal L({x+\delta }, y_t)\) is a loss term forcing the CNN to assign the target label \(y_t\) to \(x+\delta \) and \(\lambda \) is a parameter balancing the importance of two terms. In Szegedy et al. (2014), the usual cross-entropy loss is adopted as \(\mathcal L({x+\delta }, y_t)\). The latter optimization problem is solved by the box-constraint L-BFGS method. Moreover, the optimal value of \(\lambda \) is determined by conducting a binary search within a predefined range of values.

  • Fast Gradient Sign Method (FGSM). A drawback of the L-BFGS method is its computational complexity. For this reason, Goodfellow et al. (2015) proposed a faster attack. The new method starts from the observation that, for small values of the perturbation, the output of the CNN can be computed by using a linear approximation, and that, due to the large dimensionality of x, a small perturbation of the input in the direction of gradient can result in a large modification of the output. Accordingly, the adversarial examples are generated by adding a small perturbation to the clean image as follows:

    $$\begin{aligned} x' = x + \varepsilon \cdot \text {sign}(\nabla _x \mathcal L(x,y)) \end{aligned}$$
    (16.3)

    where \(\nabla _x \mathcal L\) is the gradient of \(\mathcal L\) with respect to x, and the adversarial perturbation is constrained by \(\left\| \delta \right\| _\infty \le \varepsilon \). Using the sign of the gradient, ensures that the perturbation is within the allowed limits. Since the adversarial examples are generated based on the gradient of the loss function with respect to the input image x, they can be efficiently calculated by using the back-propagation algorithm, thus resulting in a very efficient attack.

  • Iterative FGSM (I-FGSM). The FGSM attack is a one-step attack whose strength can be modulated only by increasing the value of \(\varepsilon \), with the risk that the linearity approximation the method relies on is no more valid. In Kurakin et al. (2017), the one-step FGSM is extended to an iterative, multi-step, method by applying FGSM multiple times with a small step size, each time by recomputing the gradient. The perturbed pixels are then clipped in an \(\varepsilon -\)neighborhood of the original input, to ensure that the distortion constraint is satisfied. The N-th step of the resulting method, referred to as I-FGSM (or sometimes BIM—Basic Iterative Method), is defined by

    $$\begin{aligned} x'_0 = x, \ x'_{N} = x'_{N-1} + \text {clip}_{\varepsilon }\{\alpha \cdot \text {sign}(\nabla _{ x} \mathcal L( x'_{N-1}, y))\} \end{aligned}$$
    (16.4)

    where \(\alpha \) is a small step size used to update the attack at each iteration. Similarly to I-FGSM, a universal first-order adversary, called projected gradient descent (PGD), has been proposed in Madry et al. (2018). Compared with I-FGSM, in which the starting point of the iterations is exactly the input image, the starting point of PGD is obtained by randomly projecting the input image within an allowed perturbation ball, which can be done by adding small random perturbations to the input image. Then, the following iterations are conducted in the same way as I-FGSM.

  • Jacobian-based Saliency Map Attack (JSMA). According to the JSMA algorithm proposed by Papernot et al. in (2016), the adversarial examples are generated on the basis of an adversarial saliency map, which indicates the contribution of the image pixels to the classification result. In particular, the saliency map is computed based on the forward propagation of the pixel values to the output of the to-be-attacked network. According to the saliency map, at each iteration, a pair of pixels whose modification leads to a significant increase of the output values assigned to the target class and/or a decrease of the values assigned to the other classes are perturbed by an amount \(\theta \). The iterations stop when the perturbed image is classified as belonging to the target class or a maximum distortion level is reached. To keep the perturbation imperceptible, each pixel can be modified up to a maximum of T times.

  • Carlini & Wagner Attack (C&W). In Carlini and Wagner (2017), Carlini and Wagner proposed to solve the optimization problem in (16.2) in a more efficient way. Specifically, two modifications of the basic optimization algorithm are proposed. On one hand, the loss term is replaced by a properly designed function directly related to the maximum difference between the logit of the target class and those of the other classes. On the other hand, the box-constraint used to keep the perturbed pixels in a valid range is automatically satisfied by letting \(\delta = \frac{1}{2}(\tanh (\omega )+1) - x\), where \(\omega \) is a new variable used for the transformation of \(\delta \). In this way, the box-constraint optimization problem in (16.2) is transformed into the following unconstrained problem:

    $$\begin{aligned} \begin{array}{lll} \mathop {\mathrm {minimize}}\limits _{\omega }&\left\| \frac{1}{2}(\tanh (\omega ) + 1) - x \right\| _2^2 + \lambda \cdot g(\frac{1}{2}(\tanh (\omega ) + 1)) \end{array} \end{aligned}$$
    (16.5)

    where the function g is defined as

    $$\begin{aligned} g(x') = \max (\max _{i\ne t} z_i(x') - z_t(x'), -\kappa ), \end{aligned}$$
    (16.6)

    and where \(z_i\) is the logit corresponding to the \(i-\)th class, and the parameter \(\kappa > 0\) is used to encourage the attack to generate adversarial examples classified with high confidence. Finally, the adversarial examples are generated by solving the modified optimization problem using the Adam optimizer.

2.2 Adversarial Examples in the Physical Domain

The algorithms described in the previous sections work in the digital domain, since they directly modify the digital images fed to the CNN. In many applications, however, it is required that the adversarial examples are generated in the physical domain, producing real-world objects that, when sensed by the sensors of the attacked system and fed to the CNN, cause a misclassification error. In this setting, and by focusing on the case of still images, which is the most relevant case for multimedia forensics applications, the images with the adversarial examples must first be printed or displayed on a screen and then shown to a camera, which will digitize them again before feeding them to the CNN classifier. This process, sometimes referred to as image rebroadcast, involves a digital-to-analog and an analog-to-digital conversions that may degrade the adversarial perturbation thus making the attack ineffective. The rebroadcast process involved by a physical domain attack is exemplified in Fig. 16.1.

Fig. 16.1
figure 1

Physical domain and digital domain attacks

As shown in the figure, during the rebroadcast process, the image is affected by several forms of degradation including noise addition, light changes and geometric transformations. If no countermeasure is taken, it is very likely that the invisible perturbations introduce to generate the adversarial examples are damaged up to a point to make the attack ineffective.

Recognizing the above difficulties, several methods have been proposed to generate physical domain adversarial examples. Kurakin et al. first brought up this concept in Kurakin et al. (2017). In this paper, the new images taken after rebroadcasting are geometrically calibrated before being fed to the classifier. In this way, the adversarial perturbation is not affected by any geometric distortion, hence easing the attack. More realistic scenarios are considered in later works. In the seminal paper by Athalye et al. (2018) a method is proposed to generate physical adversarial examples that are robust to various distortions, including changes of the viewing angle and distance. Specifically, a procedure, named Expectation Over Transformation (EOT), is applied to generate the adversarial examples by ensuring that they are effective across a number of possible transformations.

Let x be the to-be-attacked image, with ground truth label y, and let \(y_t\) be target class of the attack. Without loss of generality, let the output of the classifier be obtained by computing the probability \(P(y_i|x)\) that the input image x belongs to class \(y_i\) and then choosing for the class resulting in the maximum probability,Footnote 2 namely,

$$\begin{aligned} \phi (x) = \arg \max _i P(y_i|x). \end{aligned}$$
(16.7)

The generation of adversarial examples is formalized as an optimization problem that maximizes the likelihood of the target class \(y_t\) over an \(\epsilon \)-radius ball around the original image x, i.e.,

$$\begin{aligned} \arg \max _{\delta } \quad&\log P(y_t|x+\delta )\end{aligned}$$
(16.8)
$$\begin{aligned} \text {subject to} \quad&||\delta ||_p \le \epsilon . \end{aligned}$$
(16.9)

To ensure robustness of the attack, EOT considers a set T of transformations (usually consisting of geometric transformations) and a probability distribution \(P_T\) over T, and maximizes the likelihood of the target class averaged over a transformed version of the input images, generated according to \(P_T\). A similar strategy is applied to the constraint, i.e., EOT constrains the expected distance between the transformed versions of the adversarial example and the original image. As a result, with EOT the optimization problem is reformulated as

$$\begin{aligned} \arg \max _{\delta } \quad&{E}_{P_T}[\log P(y_t|t(x+\delta ))]\end{aligned}$$
(16.10)
$$\begin{aligned} \text {subject to} \quad&{E}_{P_T}[||t(x+\delta )-t(x)||_p] \end{aligned}$$
(16.11)

where t is a transformation in T and \({E}_{P_T}\) denotes expectation over T. The basic EOT approach has been extended in many ways, including a wider variety of transformations and by optimizing the perturbation over more than a single image. For instance, Sharif et al. (2016) have proposed a way to generate adversarial examples in the physical domain for face recognition systems. They restrict the perturbation within a small area around the eye-glass frame and increase the robustness of the perturbations by optimizing them over a set of properly chosen images. Eykholt et al. (2018) have proposed an attack capable of generating adversarial examples under even more realistic conditions. The attack targets a road sign classification system possibly deployed on board of autonomous vehicles. In addition, to sample the images from synthetic transformations as done by the basic EOT attack, they also generated experimental data containing actual physical condition variability. Their experiments show that the generated adversarial examples are so effective that can be printed on stickers and applied to road signs, and fool a vehicular traffic sign recognition system in driving-by tests.

Zhang et al. (2020) further enlarged the set of transformations to implement a physical domain attack against a face authentication system equipped with a spoofing-detection module, whose aim is to distinguish real faces and rebroadcast face photos. This situation presents an additional challenge, as explained in the following. To attack the face authentication system in the physical world, the attacker must present the perturbed images in front of the system by printing them or showing them on a screen. The rebroadcasting procedure, then, introduces new physical traces into the image fed to the classifier, which, having been trained to recognize rebroadcast traces, can recognize them and classify the images correctly despite the presence of the adversarial perturbation. To exit this apparent deadlock, the authors propose to utilize the EOT framework to design an attack that pre-emptively takes into account the new traces introduced by the rebroadcast procedure. The experiments shown in Zhang et al. (2020) demonstrate that EOT effectively allows to carry out an adversarial attack even in such difficult conditions.

2.3 White Versus Black-Box Attacks

An adversary may have different levels of knowledge about the algorithm used by the forensic analyst. From this point of view, adversarial examples in DL are commonly classified as white-box and black-box attacks (Yuan et al. 2019). In a white-box scenario, the attacker knows everything about the target forensic classifier, that is, he knows all the details of \(\phi (\cdot )\). In a black-box scenario, instead, the attacker knows nothing about it, or, has only a very limited knowledge. In some works, following the taxonomy in Zheng and Hong (2018), the class of gray-box attacks is also considered, referring to a situation wherein the attacker has a partial knowledge of the classifier, for instance, he knows the architecture of the classifier, but has no knowledge of its internal parameters (e.g., he does not know some of the hyper-parameters or, more in general, he does not know the training data \(\mathcal {D}_{tr}\) used to train the classifier).

In the white-box scenario, the attacker is also assumed to know the defence strategy adopted by the forensic analyst \(\phi (\cdot )\), hence he can take into account the presence of such a defence during the attack. Having full access to \(\phi (\cdot )\), the most common white-box attacks rely on some form of gradient-descent (Goodfellow et al. 2015; Kurakin et al. 2017; Dong et al. 2018), or directly solve the optimization problem underlying the attack as in Carlini and Wagner (2017); Chen et al. (2018).

When the attacker has a partial knowledge of the algorithm (gray-box setting), he can build its own version of the classifier \(\phi '(\cdot )\), usually referred to as substitute or surrogate classifier, providing a hopefully good approximation of \(\phi (\cdot )\), and then use \(\phi '(\cdot )\) to carry out the attack. Hence, the attacker implements a white-box attack against \(\phi '(\cdot )\), hoping that the attack will also work against \(\phi (\cdot )\) (attack transferability, see Sect. (16.3.1)). Mathematically speaking, the difference between white-box and black-box attacks is the loss function that the adversary seeks to maximize, namely \(\mathcal {L}(\phi (\cdot ), \cdot )\) in the white-box case and \(\mathcal {L}(\phi '(\cdot ), \cdot )\) in the black-box case. As long as an attack thought to work against \(\phi (\cdot )\) can be transferred to \(\phi '(\cdot )\), adversarial samples crafted for \(\phi '(\cdot )\) will also be misclassified by \(\phi (\cdot )\) (Papernot et al. 2016). The substitute classifier \(\phi '(\cdot )\) is built by exploiting the available information and making an educated guess about the unknown parameters.

Black-box attacks often assume that the target network can be queried as an oracle, for a limited number of times, to get useful information to build the substitute classifier (Liu et al. 2016). When the score values or the logit values \(z_i\) can be obtained by querying the target model (score-based black-box attacks), the gradient can be estimated by drawing some random samples and acquiring the corresponding loss values (Ilyas et al. 2018). A more challenging scenario is when only hard-label predictions, i.e., the predicted classes, are returned by the queried model (Brendel et al. 2017).

In the DL literature, several approaches have been proposed to craft adversarial examples against substitute models, in such a way to maximize the probability that they can be transferred to the original model. One approach is the translation-invariant (Dong et al. 2019) attack method, that, instead of optimizing the objective function directly on the perturbed image, optimizes it over a set of shifted versions of the image. Another closely related approach improves the transferability of adversarial examples by creating diverse input patterns (Xie et al. 2019). Instead of only using the original images to generate adversarial examples, the method applies image some transformations to the inputs with a given probability at each iteration of the gradient-based attack. The transformations include resizing and padding.

A general and simple method to increase the transferability of adversarial examples is described in Sect. 16.3.2.

3 Adversarial Examples in Multimedia Forensics

Given the recent trend toward the use of deep learning architectures in Multimedia Forensics, several Counter-Forensic (CF) attacks against deep learning models have been developed in the last years. In this case, one of the main advantages of CNN-based classifiers and detectors, namely, their ability to learn forensic features directly from the input data (be their images, videos, or audio tracks), turns into a weakness. An informed attacker can in fact generate his attacks directly in the sample domain without the need to map them from the feature domain to the sample domain, as it happens with conventional methods based on hand-crafted features.

The vulnerability of multimedia forensics tools based on deep learning has been addressed by many works as summarized in Marra et al. (2018). The underlying motivations for the vulnerability of CNNs used in multimedia forensics are basically the same characterizing image recognition applications, with a prominent role played by the huge dimension of the space of possible inputs, which is substantially larger than the set of images used to train the model. As a result, in a white-box scenario, the attacker can easily create slightly modified images that fall into an ‘unseen’ region of the image space thus forcing a misclassification error. Such a weakness of deep learning-based approaches represents a serious threat in multimedia forensics where the tools are designed to work under intrinsically adversarial conditions.

Adversarial examples that have been developed against CNNs in multimedia forensics are reviewed in the following.

The first targeted attack based on adversarial examples was proposed in Guera et al. (2017) to fool a CNN-based camera model identification system. By relying on adversarial examples, the authors propose a counter-forensic method for slightly altering an image in such a way to change the estimated camera model. A state-of-the-art CNN-based camera model detector based on the DenseNet architecture (Huang et al. 2017) is considered in the analysis. The counter-forensic method uses both the FGSM attack and the JSMA attack to craft adversarial images obtaining good performance in both cases.

Adversarial attacks against CNN-based image forensics methods have been derived in Gragnaniello et al. (2018) for several common manipulation detection tasks. The following common processing has been considered: image blurring, applied with different variance of the Gaussian filter; image resizing, both downscaling and upscaling; JPEG compression, with different qualities; and median filtering detection, with window sizes \(3\times 3\) and \(7\times 7\). Several CNN models have been successfully attacked by means of adversarial perturbations obtained via the FGSM algorithm applied with different strengths \(\varepsilon \).

Adversarial perturbations have been shown to offer poor robustness to image processing operations, e.g., image compression (Marra et al. 2018). In particular, rounding to integers is sometimes sufficient to wash out the perturbation and make an adversarial example ineffective. A gradient-inspired pixel domain attack to generate adversarial examples against CNN-based forensic detectors in the integer domain has been proposed in Tondi (2018). In contrast to standard gradient-based attacks, which perturb the attacked image based on the gradient of the loss function w.r.t. the input, in the attack proposed in Tondi (2018), the gradient of the output score function is approximated with respect to the pixel values, incorporating the constraint that the input is an integer in the optimization. The performance of the integer-based attack has been assessed against three image manipulation detectors, for median filtering, resizing, and contrast adjustment. Moreover, while common gradient-based adversarial attacks, like FGSM, I-FGSM, C&W, etc. ..., work in a white-box setting, the attack proposed in Tondi (2018) is a black-box one, since it does not need any knowledge of the internal parameters of the model, requiring only that the network can be queried as an oracle and the output observed.

All the above attacks are targeted to a specific CNN classifier. Carrying out an effective attack in a completely black-box or no-knowledge scenario, that is, generating an attack that can be effectively transferred to the (unknown) model used by the analyst, turns out to be a difficult task, that is exacerbated by the complexity of the decision functions learnt by neural networks. For this reason, and given that white-box attacks can be hardly applied in practical applications, studying the transferability of adversarial examples plays a major role to assess the security of DL-based multimedia forensics tools.

3.1 Transferability of Adversarial Examples in Multimedia Forensics

It has been demonstrated in several studies concerning computer vision applications (Tramèr et al. 2017; Papernot et al. 2016) that, in a gray-box or black-box scenario, adversarial examples crafted to attack a given CNN model are also effective against other models designed for the same task. In this section, we focus on the transferability of adversarial examples in an image forensics context and review the main results reported so far regarding the transferability of adversarial examples in image forensics.

Several works have investigated the transferability of adversarial examples in a network mismatch scenario (Marra et al. 2018; Gragnaniello et al. 2018). Focusing on camera model identification, in Marra et al. (2018), Marra et al. analyzed the transferability of adversarial examples generated by means of FGSM among different networks. Four different network architectures for camera model identification trained on the VISION dataset (Shullani et al. 2017) are utilized for experiments, i.e., Shallow CNN (Bondi et al. 2017), DenseNet-40 (Huang et al. 2017), DenseNet-121 (Huang et al. 2017), and XceptionNet (Chollet 2017). The classification accuracy achieved by these models is 80.77% for Shallow CNN, 87.96% for DenseNet-40, 93.88% for DenseNet-121, and 95.15% for XceptionNet. The transferability performance for the case of cross-network mismatch is shown in Table 16.1. As it can be seen, the classification accuracy on attacked images remains pretty high when the network used to build the adversarial examples is not the same targeted by the attack, thus proving a poor transferability of the attacks.

Table 16.1 Classification accuracy (%) for cross-network mismatch in the context of camera model identification. The FGSM attack is used

A more comprehensive investigation of the transferability of adversarial examples in different mismatch scenarios has been carried out in Barni et al. (2019). As in Barni et al. (2019), in the following we report the transferability of adversarial examples in various settings, according to the kind of attack used to create the examples and the mismatch existing between the source and the target models.

Type of attacks

We consider the transferability of attacks generated according to two of the most popular general attacks proposed so far, namely I-FGSM and JSMA (see Sect. 16.2.1 for a detailed description of such attacks). The attacks were implemented by relying on the Foolbox package (Rauber et al. 2017).

Source/target model mismatch

To carry out a comprehensive evaluation of the transferability of adversarial examples, three different mismatch situations between the network used to create the adversarial examples (hereafter referred to as source network—SN) and the network the adversarial examples should be transferred to (hereafter named target network—TN) are considered. Specifically, we consider the following situations: (i) cross-network mismatch, wherein different network architectures are trained on the same dataset; (ii) cross-training-set mismatch, according to which the same network architecture is trained on different datasets; and (iii) cross-network-and-training-set mismatch, wherein different architectures are trained on different datasets.

Network architectures

For cross-network mismatch, we consider BSnet (Bayar and Stamm 2016), designed for the detection of a number of widely used image manipulations, and BC+net (Barni et al. 2018) proposed to detect generic contrast adjustment operations. In particular, BSnet consists of 3 convolutional layers, 3 max-pooling layers, and 3 fully-connected layers, and the filters used in the first convolutional layer are constrained to extract residual-based features from images. As for BC+net, it has 9 convolutional layers, and no constraint is applied to the filters used in the network. The results reported in the following refer to two common image manipulations, namely, median filtering (with window size \(5\times 5\)) and image resizing (by a factor of 0.8).

Datasets

For cross-training-set mismatch, we consider two datasets: the RAISE (R) dataset (Dang-Nguyen et al. 2021) and the VISION (V) dataset (Shullani et al. 2017). More in details, RAISE consists of 8156 high-resolution uncompressed images with size \(4288 \times 2848\) in RAISE, 2000 of which were used for the experiments. With regard to the VISION dataset, it contains 11,732 JPEG images, with 2000 images used for experiments, with a minimum resolution equal to \(2336 \times 4160\) and a maximum one of \(3480 \times 4640\). Results refer to JPEG images compressed with a quality factor larger than 97. The images were split into training, validation, and test sets without overlap. All color images were first converted to gray-scale.

Experimental methodology

Before evaluating the transferability of adversarial examples, the detection models were first trained. The input patch size was set to \(128\times 128\). To train the BSnet models, 200,000 patches per class were considered while 10,000 patches were used for testing. As to BC+net models, \(10^6\) patches were used for training, \(10^5\) for validation, and \(5\times 10^4\) for testing. The Adam optimizer was used with a learning rate of \(10^{-4}\) for both networks, and the training batch size was set to 32 patches. BSnet was trained for 30 epochs and BC+net for 3 epochs. The detection accuracies of the models are given in Table 16.2. For sake of clarity, the symbol \(\phi _{\text {net}}^{\text {DB}}\) is used to denote the trained models, where net \(\in \) {BSnet, BC+net} corresponds to the network architecture, and DB \(\in \) {R, V} is the dataset used for training.

Table 16.2 Detection accuracy (%) of the models in the absence of attacks

In counter-forensic applications, it is reasonable to assume that the attacker is interested only in passing off the manipulated images as original ones to hide the traces of manipulations. Therefore, for all the experiments, 500 manipulated images were attacked to generate the adversarial examples.

For I-FGSM, the number of iterations was set to 10, and the optimal attack strength used in each iteration was determined by spanning the range \([0:\varepsilon _s:0.1]\), where \(\varepsilon _s\) was set to 0.01 and 0.001 in different cases. As for JSMA, the perturbation strength \(\theta \) was set to 0.01 and 0.1. For each attack, two different attack strengths were considered. The average PSNR of the adversarial examples is always larger than 40 dB.

Transferability results

For each setting, we show the average PSNR of the adversarial examples, and the attack success rate on the target network, ASR\(_{\text {TN}}\), which corresponds to the transferability degree (the attack success rate on the source network—ASR\(_{\text {SN}}\)—is always close to 100%, hence it is not reported in the tables).

Table 16.3 Attack transferability for cross-network mismatch

Table 16.3 reports the results regarding cross-network transferability. All models were trained on the RAISE dataset. In most of the cases, the adversarial examples are not transferable, indicating a likely failure of the attacks in a black-box scenario. The only exception is median filtering when the attack is carried out by I-FGSM, for which we have ASR\(_{\text {TN}} = 82.5\%\) with an average PSNR of 40 dB.

Table 16.4 Transferability of adversarial examples for cross-training-set mismatch

With regard to cross-training-set transferability, the BSnet network was trained on the RAISE and the VISION datasets, obtaining the results shown in Table 16.4. For the cases of median filtering, the values of ASR\(_{\text {TN}}\) are relatively high for I-FGSM with \(\varepsilon _s=0.01\), while the transferability is poor with the other attacks. With regard to image resizing, when the source network is trained on RAISE, the ASR\(_{\text {TN}}\) reaches a high level when the stronger attacks are employed (IFGSM with \(\varepsilon _s=0.01\), and JSMA with \(\theta =0.1\)). However, the adversarial examples are never transferable when the SN is trained on VISION, showing that the transferability between models trained on different datasets is not symmetric.

The results for the case of strongest mismatch (cross-network-and-training-set) are illustrated in Table 16.5. Only the case of median filtering attacked by I-FGSM with \(\varepsilon _s=0.01\) achieves a large transferability, while for all the other cases the transferability is nearly null.

In summary, in contrast to computer vision applications, adversarial examples targeting image forensics networks are generally non-transferable. Accordingly, from the attacker’s side, it is important to investigate if the attack transferability can be improved by increasing the attack strength used to generate adversarial examples lying deeper inside the target region of the attack.

3.2 Increased-Confidence Adversarial Examples with Improved Transferability

The lack of transferability of adversarial examples in image forensic applications is partly due to the fact that, in most implementations, the attacks aim at generating adversarial examples with minimum distortion. Consequently, the resulting adversarial examples are close to the decision boundary and a small difference between the decision boundaries of the source and target networks can lead to the failure of the attack on the TN. To overcome this problem, the attacker may want to generate adversarial examples lying deeper into the target region of the attack. However, given the complexity of the decision boundary learned by CNNs, it is hard to control the distance of the adversarial examples from the boundary. In addition, increasing the attack strength by simply going on with the attack iterations until a limit value of PSNR is reached may not be effective, since a lower PSNR does not necessarily result in a stronger attack with higher transferability (see Table 16.3).

Table 16.5 Attacks transferability for cross-network-and-training-set mismatch

In order to generate adversarial examples with higher transferability that can be used for counter-forensics applications, a general strategy consists in increasing the confidence of the misclassification, where the confidence is defined as the maximum difference between the logit of the target class and those of the classes (Li et al. 2020). Specifically, by focusing on the binary case, for a clean image x with label \(y=i \ (i=0,1)\), a perturbed image \(x'\) is looked for which \(z_{1-i}-z_i>c\), where \(c>0\) is the desired confidence value. By implementing the above stop condition, the most popular attacks (such as I-FGSM and PGD) can be modified to generate adversarial examples with increased confidence.

In the following, we evaluate the transferability of increased-confidence adversarial examples by adopting a methodology similar to that used in Sect. 16.3.1.

Attacks

We report the results obtained by applying the increased-confidence strategy to four popular attacks, namely, I-FGSM, Momentum-based I-FGSM (MI-FGSM) (Dong et al. 2018), PGD, and C&W attack. In particular, the MI-FGSM is a method proposed explicitly to improve the transferability of the attacks by mitigating the momentum term in each iteration of the I-FGSM with a decay factor \(\mu \).

Networks

Three network architectures are considered in the cross-network experiments, that is: BSnet (Bayar and Stamm 2016), BC+net (Barni et al. 2018), and VGG-16 network (VGGnet) commonly used in computer vision applications (VGGnet consists of 13 convolutional layers and 3 fully connected layers, more information can be found in Simonyan and Zisserman (2015)). With regard to the image manipulation tasks, we report results referring to median filtering, image resizing, and addition of white Gaussian noise (AWGN) with zero mean and unitary standard deviation.

Datasets

The experiments on cross-training-set mismatch are based on the RAISE and VISION datasets.

Experimental setup

BSnet and BC+net were trained by using the same setting described in Sect. 16.3.1. With regard to VGGnet, \(10^5\) patches were used for training and validation, and \(10^4\) patches were used for testing. The models were trained for 50 epochs by using the Adam optimizer with a learning rate of \(10^{-5}\). The range of detection accuracies in the absence of attacks are [98.1, 99.5%] for median filtering detection, [96.6, 99.0%] for image resizing, and [98.3, 99.9%] for AWGN detection.

The attacks were applied to 500 manipulated images for each task. The Foolbox toolbox (Rauber et al. 2017) was employed to generate increased-confidence adversarial examples with the new stop condition. Specifically, for C&W attack, all the parameters were set to their default values. For PGD, the binary search was conducted with initial values of \(\varepsilon =0.3\) and \(\alpha =0.005\), for 100 iterations. For I-FGSM, we also applied 100 steps, with the optimal stepsize determined in the range [0.001 : 0.001 : 0.1]. Eventually, for MI-FGSM, all the parameters were set to the same values of I-FGSM except the decay factor, for which we let \(\mu =0.2\).

Transferability results

In the following we show some results demonstrating the improved transferability of adversarial examples with increased confidence. For each attack, we report the ASR\(_{\text {TN}}\) and the average PSNR. The ASR\(_{\text {SN}}\) is always close to 100% and hence it is omitted in the tables. For the confidence value c that depends on the logits of the SN, different values were chosen for different networks.

Table 16.6 Attacks transferability for cross-network mismatch (SN = \(\phi _{\text {VGGnet}}^{\text {R}}\), TN = \(\phi _{\text {BSnet}}^{\text {R}}\)). The ASR\({_\text {TN}}\) (%) and average PSNR (dB) of adversarial examples is reported

Table 16.6 reports the results for cross-network mismatch, showing the transferability when (SN,TN) = \((\phi _{\text {VGGnet}}^{\text {R}}, \phi _{\text {BSnet}}^{\text {R}})\) for three different image manipulations. As expected, by using an increased confidence, the transferability of the adversarial examples improves significantly in all these cases. Specifically, the ASR\(_{\text {TN}}\) reaches a maximum of 96.6, 64.4, and 92.2% for median filtering, image resizing, and AWGN, respectively. Moreover, a larger confidence always results in adversarial examples with higher transferability and larger image distortion, and different degrees of transferability are obtained on different attacks. Among the four attacks, the C&W attack always achieves the highest PSNR for similar values of ASR\(_{\text {TN}}\). A noticeable observation regards the MI-FGSM attack. While (Dong et al. 2018) reports that MI-FGSM attack improves the transferability of attacks against computer vision networks, a similar improvement of ASR\(_{\text {TN}}\) is not observed in Table 16.6 (w.r.t. I-FGSM). The reason could be that the gradients between subsequent iterations are highly correlated in image forensics models, and thus the advantage of gradient stabilization sought for by MI-FGSM is reduced.

Table 16.7 Attacks transferability for cross-training-set mismatch (SN: \(\phi _{\text {BSnet}}^{\text {R}}\), TN: \(\phi _{\text {BSnet}}^{\text {V}}\)). The ASR\({_\text {TN}}\) (%) and the average PSNR (dB) of adversarial examples is reported

Considering cross-training-set transferability, the results for the cases of median filtering, image resizing, and AWGN addition are shown in Table 16.7. The transferability of adversarial examples from BSnet trained on RAISE to the same architecture trained on VISION is reported. According to the table, increasing the confidence always helps to improve the transferability of the adversarial examples. Moreover, although the transferability is improved in all the cases, transferring adversarial examples for the case of image resizing detection still tends to be more difficult, as a larger image distortion is needed to achieve a transferability comparable to that obtained for median filtering and AWGN addition.

Table 16.8 Attacks transferability for cross-network-and-training-set mismatch (SN = \(\phi _{\text {BSnet}}^{\text {V}}\), TN = \(\phi _{\text {BC+net}}^{\text {R}}\)), including ASR\({_\text {TN}}\) (%) and average PSNR (dB) of adversarial samples

Finally we consider the strongest mismatch case, i.e., cross-network-and-training-set mismatch, in the case of median filtering and image resizing detection. Only the attack transferability with \(c=0\) and that with largest confidences are reported in Table 16.8. A higher transferability is achieved (with a good image quality) for the case of median filtering, while adversarial examples targeting image resizing detection are less transferable.

To summarize, in contrast to computer vision applications, using adversarial examples for counter-forensics in a black-box or gray-box scenario, requires that proper measures are taken to ensure the transferability of the attacks (Li et al. 2020). As we have shown by referring to the method proposed in Li et al. (2020), however, attack transferability can be ensured in a wide variety of settings, the price to pay being a slight increase of the distortion introduced by the attack, thus calling for the development of suitable defences.

4 Defenses

As a response to the threats posed by the existence of adversarial examples and by the ease with which they can be crafted, many defence mechanisms have been proposed to develop CNN-based forensic methods that can work in adversarial conditions. Generally speaking, defences can work in a reactive or proactive manner. In the first case, the defence mechanisms are designed to work in conjunction with the original CNN algorithm \(\phi \), e.g., by revealing whether an adversarial example is being fed to the CNN by means of a dedicated detector, or by mitigating the (possibly present) adversarial perturbation applying some input transformations before feeding the CNN (this latter approach has been tested in the general DL literature—e.g., Xu et al. (2017)—but has not been adopted as a defence strategy in forensic applications). The second branch of defences aims at building more secure CNN models from scratch or by properly modifying the original models. The large majority of the methods developed in the ML and DL-based forensic literature belongs to this category. Among the most popular approaches, we mention adversarial training, multiple classification, and detector randomization. With regard to detector randomization, we refer to methods wherein the detector is built by including within it some random elements, thus qualifying this approach as a proactive one. This contrasts with randomization techniques commonly adopted in DL literature, like, for instance, stochastic activation pruning (Dhillon et al. 2018), that randomly drops some neurons of the CNN during the forward pass.

In the following, we review some of the methods developed in the forensic literature to defend against adversarial examples.

4.1 Detect Then Defend

The most common defence approach in early anti-counter-forensics works consisted in performing adversarial examples detection to rule out adversarial images, followed by the application of standard (unaware) detectors for the analysis of the samples deemed to be benign.

The analyst, aware of the counter-forensic method the system may be subject to, develops a new detector \(\phi _A\) capable to expose the attacked images by looking for the traces left by the counter-forensic algorithm. This goal is achieved by resorting to new, tailored, features. The new algorithm \(\phi _A\) is explicitly designed to reveal whether the image underwent a certain attack or not. If this is the case, the analyst may refuse the image or try to clean it to remove the effect of the attack. In this kind of approaches, the algorithm \(\phi _A\) is used in conjunction with the original, unaware, algorithm \(\phi \). Such a view is adopted in Valenzise et al. (2013); Zeng et al. (2014), to address the adversarial detection of JPEG compression and median filtering, when the attacker tries to hinder the detection by removing the traces of JPEG compression and median filtering. Among the examples for the case of model-based analysis, we mention the algorithm in Costanzo et al. (2014) against the keypoint removal and injection attack against copy-move detectors. The “detect then defend” approach has also been adopted recently in the case of CNN-based forensic detectors, to defend against adversarial examples. For instance, Carrara et al. (2018) proposes to tell apart adversarial examples from benign images by looking at their behavior in the feature space. The method focuses on the analysis of the trajectory of the internal representations of the network (i.e., the activation of the neurons of the hidden layers), arguing that, for adversarial inputs, the representations follow a different evolution with respect to the case of genuine (non-adversarial) inputs. Detection is achieved by defining a feature distance to encode these trajectories into a fixed length sequence, used to train an Long Short Term Memory (LSTM) neural network to discern adversarial inputs from genuine ones.

Modeling normal data distributions for the activations has also been considered in the general DL literature to reveal abnormal behavior of adversarial examples (Tao et al. 2018).

It goes without saying that, this kind of defences can be easily circumvented if the attacker has some information about the method used by the analyst to expose the attack, thus entering a never-ending loop where attacks and forensic algorithms are iteratively developed.

4.2 Adversarial Training

A simple, yet effective, approach to improve the robustness of machine learning classifiers against adversarial attacks is adversarial training. Adversarial training consists in augmenting the training dataset with examples of adversarial images. Such an approach implicitly assumes that the attack algorithm is known to the analyst and that the attack is not carried out on the retrained version of the detector. This identifies adversarial training as a white-box defense carried out against a gray-box attack.

Let \(\mathcal {D}_A\) be the set of attacked images used for training. The adversarial trained classifier is \(\phi (\cdot , \mathcal {T}; \mathcal {D}_{tr} \cup \mathcal {D}_A)\), where \(\mathcal {T} = \{t_1, t_2,\ldots \}\) denotes the set of all the hyper-parameters (i.e., the non-trainable parameters) of the network, including, for instance, the type of algorithm, its structure, the loss function, the internal parameters, the training procedure, etc. ...Adversarial training has been widely adopted in the general DL literature to improve the robustness of DL classifiers against adversarial examples (Goodfellow et al. 2015; Madry et al. 2018). DL-based forensics is no exception, with the proposal of several approaches that resort to adversarial training to defend against adversarial examples (see, for instance, (Schöttle et al. 2018; Zhao et al. 2018)). In machine-learning-based forensic literature, JPEG-aware training is often considered to achieve robustness against JPEG compression. The interest in JPEG compression is motivated by the fact that JPEG compression is a common post-processing operation applied to images, either innocently (e.g., when uploading images on a social network), or maliciously (in some cases, in fact, JPEG compression can be considered as a simple and effective laundering attack (Barni et al. 2019)). The algorithms in Barni et al. (2017, 2016), for double JPEG detection, and in Boroumand and Fridrich (2017) for a variety of manipulation detection problems, based on Support Vector Machines (SVM), are examples of this approach. Several examples have also been proposed more recently in CNN-based forensics, e.g., for camera model identification (Marra et al. 2018), contrast adjustment detection (Barni et al. 2018), and GAN detection (Barni et al. 2020; Nataraj et al. 2019).

Resizing is another common processing operator applied to images, either innocently or maliciously. In a splicing scenario, for instance, image resizing can be applied to delete the traces of the splicing operation. Accordingly, a resizing-aware detector can be trained to detect splicing in the presence of resizing. From a different perspective, resizing can also be used as a defence against adversarial perturbations. Like other geometric transformations, resizing can be applied to the input images before testing, so to disable the effectiveness of the adversarial perturbations (He et al. 2017).

4.3 Detector Randomization

Detector randomization is another defense strategy that has been proposed to make life difficult for the attacker. With regard to computer vision applications, many randomization-based approaches have been proposed to hinder the transferability of adversarial examples in gray-box or black-box scenarios, and hence improve the security of the applications (Xie et al. 2018; Dhillon et al. 2018; Taran et al. 2019; Liu et al. 2018).

A similar approach can also be employed for securing image forensics detectors. The method proposed in Chen et al. (2019) applies randomization in the feature space by randomly selecting a subset of features to be used by the subsequent forensic detector. Randomization is based on a secret key, unknown to the attacker, who, then, cannot gain full knowledge of the to-be-attacked system (thus being forced to operate in a gray-box attack scenario). In Chen et al. (2019), the effectiveness of random feature selection (RFS) is proven theoretically, under some simplifying assumptions, and verified experimentally for two detectors based on support vector machines.

The approach proposed in Chen et al. (2019) has been extended to detectors based on deep learning by means of a techniques called Random Deep Feature Selection (RDFS) (Barni et al. 2020). In detail, a CNN-based forensic detector is first divided into two parts, i.e., the convolutional layers, playing the role of feature extractor, and the fully connected layers, to be regarded as the final detector. Let N denote the number of features extracted by the convolutional layers. To train a secure detector, a subset of K features is randomly selected according to a secret key. The fully connected layers are then retrained based on the selected K features. To maintain a good classification accuracy, the same model with the same secret key is applied during the training and the testing phases. In this case, the convolutional layers are only used for feature extraction and are frozen in the retraining phase to keep the feature space unchanged.

Assuming that the attacker has no knowledge of the RDFS strategy, he will target the original CNN-based model during his attack and the effectiveness of the randomized network will ultimately depend on the transferability of the attack.

The effectiveness of RDFS has been evaluated in Barni et al. (2020). Specifically, the experiments were conducted based on the RAISE dataset (Dang-Nguyen et al. 2021). Two different network architectures (BSnet (Bayar and Stamm 2016) and BC+net (Barni et al. 2018)) are utilized to detect three different image manipulations, namely, median filtering, image resizing, and adaptive histogram equalization (CL-AHE).

To build the original models based on BSnet, for each class, a number of 100,000 patches was considered for training, and 3000 and 10,000 patches were considered for validation and testing, respectively. As to the models based on BC+net, 500,000 patches per class were used for training, 5000 for validation, and 10,000 for testing. For both networks, the input patch size was set to \(64\times 64\), and the batch size was set to 32. The Adam optimizer with learning rate of \(10^{-4}\) was used for training. The BSnet models were trained for 40 epochs and the BC+net models for 4 epochs. For BSnet, the classification accuracy of the models are 98.83, 91.30, and 90.45% for median filtering, image resizing, and CL-AHE, respectively. The corresponding accuracy values are 99.73, 95.05, and 98.30% for BC+net.

To build the RDFS-based detectors, first, a subset of the original training set with 20,000 patches (per class) was randomly selected. Then, for each model, the FC layers were retrained based on K features randomly selected from the full feature set. The Adam optimizer with learning rate of \(10^{-5}\) is utilized. An early stop condition was adopted with a maximum number of 50 epochs, and the training process will stop when the validation loss changes less than \(10^{-3}\) within 5 epochs. The number of K considered in the experiments are \(K = \{5,10,30,50,200,400\}\) as well as the full feature with \(K=N\). For each K, 50 models trained on randomly selected K features were utilized in the experiments.

Three popular attacks were applied to generate adversarial examples, namely L-BFGS, I-FGSM and PGD. All the attacks were conducted based on Foolbox (Rauber et al. 2017). The detailed parameters of the attacks are given below. For L-BFGS, the parameters were set as default. I-FGSM was conducted with 10 iterations, and the best strength was found by spanning the range [0 : 0001 : 0.1]. For PGD, a binary search was conducted with initial values of \(\varepsilon =0.3\) and \(\alpha =0.05\), to find the optimal parameters. An exception for PGD is that the parameters used for attacking the models for the detection of CL-AHE were set as \(\varepsilon =0.01\) and \(\alpha =0.025\) without using binary search.

In the absence of attacks, the classification performance of the models trained on K features was evaluated on 4000 patches per class, while the defense performance was evaluated based on 500 manipulated images.

Table 16.9 Accuracy (%) of the RDFS detector based on BSnet. ‘No’ corresponds to the absence of attacks
Table 16.10 Accuracy (%) of the RDFS detector based on BC+net. ‘No’ corresponds to the absence of attacks

Tables 16.9 and 16.10 show the classification accuracy for the cases of BSnet and BC+net on three different manipulations. The RDFS strategy is helpful to hinder the transferability of adversarial examples on both networks at the expense of a slight decrease of the classification accuracy on clean images (in absence of attacks). Considering the BSnet, with the decrease of K, the detection accuracy of adversarial examples increases. For instance, the gain of the accuracy reaches 20–30% for \(K=30\) and 30–50% for \(K=10\), while the classification accuracy on clean images decreases by only 2–4% in these two cases. A similar behavior can be observed in the cases of BC+net in Table 16.10. In some cases, the detection accuracy for \(K=N\) is already pretty high, which means that the adversarial examples built on the original CNN cannot transfer to the new detector trained on the full feature set. This phenomenon confirms the conclusion that adversarial examples have a poor transferability in the context of image forensics.

Further research has been carried out to evaluate the effectiveness of RDFS against increased-confidence adversarial examples (Li et al. 2020). Both the original CNNs and the RDFS detectors utilized in Barni et al. (2020) were tested. Two different attacks were conducted, namely, C&W attack and PGD, to generate adversarial examples with confidence value c. Similarly to Li et al. (2020), different confidence values are chosen for different SN. The detection accuracies of the RDFS detectors for the median filtering and image resizing task in the presence of increased-confidence adversarial examples are shown in Tables 16.11 and 16.12. It can be observed that decreasing K also helps improving the defense performance of the RDFS detectors against increased-confidence adversarial examples. For example, considering the case of BSnet for median filtering detection, for adversarial examples with \(c=10\) generated based on C&W attack, by decreasing K from N to 10, the detection accuracy is improved by 34.5%. However, with the increase of the confidence, the attack tends to be stronger, and the detection accuracy tends to be very low, even for small value of K.

Table 16.11 Detection accuracy (%) of the RDFS detector (based on BSnet) on increased-confidence adversarial examples
Table 16.12 Detection accuracy (%) of the RDFS detector (based on the BC+net) on increased-confidence adversarial examples

4.4 Multiple-Classifier Architectures

Another possibility to combat counter-forensics attacks is to resort to intrinsically more secure architectures. This approach has been widely considered in the general literature of ML security and in ML-based forensics, however there are only few methods resorting to such an approach in the case of CNNs.

Among the approaches developed for general ML applications we mention (Biggio et al. 2015), where a multiple classifier architecture is proposed, referred to as a one-and-a-half classifier. The one-and-a-half class architecture consists of two-class classifiers and two one-class classifiers run in parallel followed by a final one-class classifiers. It has been shown that, when properly trained, this architecture can effectively limit the damage of an attacker with perfect knowledge. The one-and-a-half architecture has also been considered in forensic applications, to improve the security of SVM-based detectors (Barni et al. 2020). In particular, considering several manipulation detection tasks, the authors of Barni et al. (2020) showed that the one-and-a-half class architecture can outperform two-class architectures in terms of security against white-box attacks, for a fixed level of robustness. The effectiveness of such an approach for CNN-based classifiers has not been investigated yet.

Another simple possibility to design a more secure classifier consists in building an ensemble of individual algorithms using some convenient ensemble strategies, as proposed in Strauss et al. (2017) in the general DL literature. A similar approach has been consider in Fontani et al. (2014) for ML-based forensics, where the authors propose to improve the robustness of the decision by fusing the outputs of several forensic algorithms looking for different traces. The method has been assessed for the representative forensic task of splicing detection in JPEG images. Even in this case, the effectiveness of such an approach for CNN-based forensic applications has still to be investigated.

Other approaches that have been proposed in security-oriented applications for general ML-based classification consist in using multiple classifiers in conjunction with randomization. Randomness can pertain to the selection of the training samples of the individual classifiers, like in Breiman (1996), or it can be associated with the selection of the features used by the classifiers, as in Ho (1998). Another strategy resorting to multiple classification and randomization has been proposed in Biggio et al. (2008) for spam-filtering applications, where the source of randomness is in the choice of the weights assigned to the filtering modules of the individual classifiers.

Regarding the DL literature, an approach that goes in this direction is model switching (Wang et al. 2020), according to which a certain number of trained sub-models are randomly selected at test time.

All considered, combining multiple classification with randomization in the attempt to improve the security of DL-based forensic classifiers is something that has not been studied much and is worth of further investigation.

5 Final Remarks

Deep learning architectures provide new powerful tools enriching the toolbox available to image forensics designers. At the same time, they introduce new vulnerabilities due to some inherent security weaknesses of DL techniques. In this chapter, we have reviewed the threats posed by adversarial examples and their possible use for counter-forensics purposes. Even if adversarial examples targeting image forensics networks tend to be less transferable than those created for computer vision applications, we have shown that, by properly increasing the strength of the attacks, a transferability level which is sufficient for practical applications can be reached. We have also presented some possible remedies against attacks based on adversarial examples, even if a definitive solution capable to prevent such attacks in the most challenging white-box scenario has not been found yet. We are convinced that, together with the necessity of developing image forensics techniques suitable to be applied outside controlled laboratory settings, robustness against intentional attacks is one of the most pressing needs, if we want that image forensics is finally used in real-life applications contributing to restore the credibility of digital media.