Since the advent of smartphones, photography is increasingly being done with small, portable, multi-function devices. Relative to the purpose-built cameras that dominated previous eras, smartphone cameras must overcome challenges related to their small form factor. Smartphone cameras have small apertures that produce a wide depth of field, small sensors with rolling shutters that lead to motion artifacts, and small form factors which lead to more camera shake during exposure. Along with these challenges, smartphone cameras have the advantage of tight integration with additional sensors and the availability of significant computational resources. For these reasons, the field of computational imaging has advanced significantly in recent years, with academic groups and researchers from smartphone manufacturers helping these devices become more capable replacements for purpose-built cameras.

1 Introduction to Computational Imaging

Computational imaging (or computational photographic) approaches are characterized by the co-optimization of what and how the sensor captures light and how that signal is processed in software. Computational imaging approaches now commonly available on most smartphones include panoramic stitching, multi-frame high dynamic range (HDR) imaging, ‘portrait mode’ for low depth of field, and multi-frame low-light imaging (‘night sight’ or ‘night mode’).

As image quality is viewed as a differentiating feature between competing smartphone brands, there has been tremendous progress improving subjective image quality, accompanied by a lack of transparency due to the proprietary nature of the work. Like the smartphone market more generally, computational imaging approaches used therein change very quickly from one generation of the phone to the next. This combination of different modes, and the proprietary and rapid nature of changes, all pose challenges for forensics practitioners.

This chapter investigates some of the forensics challenges that arise from the increased adoption of computational imaging and assesses our ability to detect automated focus manipulations performed by computational cameras. We then look more generally at a cue to help distinguish optical blur from synthetic blur. The chapter concludes with a look at some early computational imaging research that may impact forensics in the near future.

This chapter will not focus on the definitional and policy issues related to computational imagery, believing that these issues are best addressed within the context of a specific forensic application. The use of an automated, aesthetically driven computational imaging mode does not necessarily imply nefarious intent on the part of the photographer in all cases, but may in some. This suggests that broad-scale screening for manipulated imagery (e.g., on social media platforms) might not target portrait mode images for detection and further scrutiny, but in the context of an insurance claim or a court proceeding, higher scrutiny may be necessary.

As with the increase in the number, complexity, and ease of use of software packages for image manipulation, a key forensic challenge presented by computational imaging is the degree to which they democratize the creation of manipulated media. Prior to these developments, the creation of a convincing manipulation required a knowledgeable user dedicating a significant amount of time and effort with only partially automated tools. Much like Kodak cameras greatly simplified consumer photography with the motto “You Press the Button, We Do the Rest,” computational cameras allow users to create—with the push of a button—imagery that’s inconsistent with the classically understood physical limitations of the camera. Realizing that most people won’t carry around a heavy camera with a large lens, a common goal of computational photography is to replicate the aesthetics of Digital Single Lens Reflex (DSLR) imagery using the much smaller and lighter sensors and optics used in mobile phones. From this, one of the significant achievements of computational imaging is the ability to replicate the shallow depth of field of a large aperture DSLR lens on a smartphone with a tiny aperture.

2 Automation of Geometrically Correct Synthetic Blur

Optical blur is a perceptual cue to depth (Pentland 1987), a limiting factor in the performance of computer vision systems (Bourlai 2016), and an aesthetic tool that photographers use to separate foreground and background parts of a scene. Because smartphone sensors and apertures are so small, images taken by mobile phones often appear to be in focus everywhere, including background elements that distract a viewer’s attention. To draw viewers’ perceptual attention to a particular object and avoid distractions from background objects, smartphones now include ‘portrait mode’ which automatically manipulates the sensor image via the application of spatially varying blur. Figure 3.1 shows an example of the all-in-focus natively captured by the sensor and the resulting portrait mode image. Because there is a geometric relationship between optical blur and depth in the scene, the creation of geometrically correct synthetic blur requires knowledge of the 3D scene structure. Consistent with the spirit of computational cameras, portrait modes enable this with a combination of hardware and software. Hardware acquires the scene in 3D using either stereo cameras or sensors with the angular sensitivity of a light field camera (Ng 2006). Image processing software uses this 3D information to infer and apply the correct amount of blur for each part of the scene. The result is automated, geometrically correct optical blur that looks very convincing. A natural question, then, is whether or not we can accurately differentiate ‘portrait mode’-based imagery from genuine optical blur.

Fig. 3.1
figure 1

(Left) For a close-in scene captured by a smartphone with a small aperture, all parts of the image will appear in focus. (Right) In order to mimic the aesthetically pleasing properties of a DSLR with a wide aperture, computational cameras now include a ‘portrait mode’ that blurs the background so the foreground stands out better

2.1 Primary Cue: Image Noise

Regardless of how the local blurring is implemented, the key difference between optical blur and portrait mode-type processing can be found in image noise. When blur happens optically, before photons reach the sensor, only small signal-dependent noise impacts are observed. When blur is applied algorithmically to an already digitized image, however, the smoothing or filtering operation also implicitly de-noises the image. Since the amount of de-noising is proportional to the amount of local smoothing or blurring, differences in the amount of algorithmic local blur can be detected via inconsistencies between the local intensity and noise level. Two regions of the image having approximately the same intensity should also have approximately the same level of noise. If one region is blurred more than the other, or one is blurred while the other is not, an inconsistency is introduced between the intensities and local noise levels.

For our noise analysis, we extend the combined noise models of Tsin et al. (2001) and Liu et al. (2006). Ideally, a pixel produces a number of electrons \(E_{num}\) proportional to the average irradiance from the object being imaged. However, shot noise \(N_S\) is a result of the quantum nature of light and captures the uncertainty in the number of electrons stored at a collection site; \(N_S\) can be modeled as the Poisson noise. Additionally, site-to-site non-uniformities called fixed pattern noise K are a multiplicative factor impacting the number of electrons; K can be characterized as having mean 1 and a small spatial variance \(\sigma ^2_K\) over all of the collection sites. Thermal energy in silicon generates free electrons which contribute dark current to the image; this is modeled as an additive factor \(N_{DC}\), modeled as the Gaussian noise. The on-chip output amplifier sequentially transforms the charge collected at each site into a measurable voltage with a scale A, and the amplifier generates zero mean read-out noise \(N_R\) with variance \(\sigma ^2_R\). Demosaicing is applied in color cameras to interpolate two of the three colors at each pixel, and it introduces an error that is sometimes modeled as noise. After this, the camera response function (CRF) \(f(\cdot )\) maps this voltage via a non-linear transform to improve perceptual image quality. Lastly, the analog-to-digital converter (ADC) approximates the analog voltage as an integer multiple of a quantization step q. The quantization noise can be modeled as the addition of a noise source \(N_Q\).

With these noise sources in mind, we can describe a digitized 2D image as follows:

$$\begin{aligned} D(x,y)&= f \Big ( (K(x,y)E_{num}(x,y) + N_{DC}(x,y)+ N_S(x,y) + N_R(x,y))A \Big ) \nonumber \\&+ N_Q(x,y) \end{aligned}$$

The variance of the noise is given by

$$\begin{aligned} \sigma ^2_N(x,y) = f'^2 \Big ( A^2 \big ( K(x,y)E_{num}(x,y) + \text {E}[N_{DC}(x,y)] + \sigma ^2_R \big ) \Big ) + \frac{q^2}{12} \end{aligned}$$

where \(\text {E}[\cdot ]\) is the expectation function. This equation tells us two things which are typically overlooked in the more simplistic model of noise as an additive Gaussian source:

  1. 1.

    The noise variance’s relationship with intensity reveals the shape of the CRF’s derivative \(f'\).

  2. 2.

    Noise has a signal-dependent aspect to it, as evidenced by the \(E_{num}\) term.

An important corollary to this is that different levels of noise in regions of an image having different intensities is not per se an indicator of manipulation, though it has been taken as one in past work (Mahdian and Saic 2009). We show in our experiments that, while the noise inconsistency cue from Mahdian and Saic (2009) has some predictive power in detecting manipulations, a proper accounting for signal-dependent noise via its relationship with image intensity significantly improves accuracy.

Measuring noise in an image is, of course, ill-posed and is equivalent to the long-standing image de-noising problem. For this reason, we leverage three different approximations of local noise, measured over approximately uniform image regions: intensity variance, intensity gradient magnitude, and the noise feature of Mahdian and Saic (2009) (abbreviated NOI). Each of these is related to the image intensity of the corresponding region via a 2D histogram. This step translates subtle statistical relationships in the image to shape features in the 2D histograms which can be classified by a neural network. As we show in the experiments, our detection performance on histogram features significantly improves on that of popular approaches applied directly to the pixels of the image.

2.2 Additional Photo Forensic Cues

One of the key challenges in forensic analysis of images ‘in the wild’ is that compression and other post-processing may overwhelm subtle forgery cues. Indeed, noise features are inherently sensitive to compression which, like blur, smooths the image. In order to improve detection performance in such challenging cases, we incorporate additional forensic cues which improve our method’s robustness. Some portrait mode implementations appear to operate on a JPEG image as input, meaning that the outputs exhibit cues related to double JPEG compression. As such, there is a range of different cues that can reveal manipulations in a subset of the data.

2.2.1 Demosaicing Artifacts

Forensic researchers have shown that the differences between the demosaicing algorithm and the differences between the physical color filter array bonded to the sensor can be detected from the image. Since focus manipulations are applied on the demosaiced images, the local smoothing operations will alter these subtle Color Filter Array (CFA) demosaicing artifacts. In particular, the lack of CFA artifacts or the detection of weak, spatially varying CFA artifacts indicates the presence of global or local tampering, respectively.

Following the method of Dirik and Memon (2009), we consider the demosaicing scheme \(f_d\) being a bilinear interpolation. We divide the image into \(W \times W\) sub-blocks and only compute the demosaicing feature at the non-smooth blocks of pixels. Denote each non-smooth block as \(B_i\), where \(i = 1,\dots ,m_B\), and \(m_B\) is the number of non-smooth blocks in the image. The re-interpolation error of i-th sub-block for the k-th CFA pattern \(\theta _k\) is defined as \(\hat{B}_{i,k} = f_d(B_i,\theta _k)\) and \(k = 1,\dots ,4\). The MSE error matrix \(E_i^{(2)}(k,c), c \in {R,G,B}\) between the blocks B and \(\hat{B}\) is computed in non-smooth regions all over the image. Therefore, we define the metric to estimate the uniformity of normalized green channel column vector as

$$\begin{aligned} F&= median \bigg ( \sum _{l=1}^4 |100 \times \frac{E_i^{(2)}(k,2)}{\sum _{l=1}^3 E_i^{(2)}(l,2)}-25| \bigg ) \nonumber \\ E_i^{(2)}(k,c)&= 100 \times \frac{E_i(k,2)}{\sum _{l=1}^3 E_i(l,2)} \nonumber \\ E_i(k,c)&= \frac{1}{W \times W} \sum _{x=1}^W \sum _{y=1}^W \Big ( B_i(x,y,c)-\hat{B}_{i,k}(x,y,c) \Big )^2 \end{aligned}$$

2.2.2 JPEG Artifact

In some portrait mode implementations, such as the iPhone, the option to save both an original and a portrait mode image of the same scene suggests that post-processing is applied after JPEG compression. Importantly, both the original JPEG image and the processed version are saved in the JPEG format without resizing. Hence, Discrete Cosine Transform (DCT) coefficients representing un-modified areas will undergo two consecutive JPEG compressions and exhibit double quantization (DQ) artifacts, used extensively in the forensics literature. DCT coefficients of locally blurred areas, on the other hand, will result from non-consecutive compressions and will present weaker artifacts.

We follow the work of Bianchi et al. (2011) and use the Bayesian inference to assign to each DCT coefficient a probability of being doubly quantized. Accumulated over each \(8 \times 8\) block of pixels, the DQ probability map allows us to distinguish original areas (having high DQ probability) from tampered areas (having low DQ probability). The probability of a block being tampered can be estimated as

$$\begin{aligned} p&= 1 / \bigg ( \prod _{i|m_i \ne 0} \big ( R(m_i) - L(m_i) \big )*k_g(m_i) +1 \bigg ) \nonumber \\ R(m)&= Q_1 \Bigg ( \bigg \lceil \frac{Q_2}{Q_1} \bigg ( m - \frac{b}{Q_2} - \frac{1}{2} \bigg ) \bigg \rceil - \frac{1}{2} \bigg ) \nonumber \\ L(m)&= Q_1 \Bigg ( \bigg \lfloor \frac{Q_2}{Q_1} \bigg ( m - \frac{b}{Q_2} + \frac{1}{2} \bigg ) \bigg \rfloor + \frac{1}{2} \bigg ) \end{aligned}$$

where m is the value of the DCT coefficient; \(k_g(\cdot )\) is a Gaussian kernel with standard deviation \(\sigma _e/Q_2\); \(Q_1\) and \(Q_2\) are the quantization steps used in the first and second compression, respectively; b is the bias; and u is the unquantized DC coefficient.

2.3 Focus Manipulation Detection

To summarize the analysis above, we adopt five types of features: color variance (VAR), image gradient (GRAD), double quantization (ADQ) (Bianchi et al. 2011), color filter artifacts (CFA) (Dirik and Memon 2009), and noise inconsistencies (NOI) (Mahdian and Saic 2009) for refocusing detection. Each of these features is computed densely at each location in the image, and Fig. 3.2 illustrates the magnitude of these features in a feature map for an authentic image (top row) and a portrait mode image (middle row). Though there are notable differences between the feature maps in these two rows, there is no clear indication of a manipulation except, perhaps, the ADQ feature. And, as mentioned above, the ADQ cue is fragile because it depends on whether blurring is applied after an initial compression.

Fig. 3.2
figure 2

Feature maps and histogram for authentic and manipulated images. On the first row are the authentic image feature maps; the second row shows the corresponding maps for the manipulated image. We show scatter plots relating the features to intensity in the third row, where blue sample points correspond to the authentic image, and red corresponds to a manipulated DoF image (which was taken with an iPhone)

As mentioned in Sect. 3.2.1, the noise cues are signal-dependent in the sense that blurring introduces an inconsistency between intensity and noise levels. To illustrate this, Fig. 3.2’s third row shows scatter plots of the relationship between intensity (on the horizontal axis) and the various features (on the vertical axis). In these plots, particularly the columns related to noise (Variance, Gradient, and NOI), the distinction between the statistics of the authentic image (blue symbols) and the manipulated image (red symbols) becomes quite clear. Noise is reduced in most of the images, though the un-modified foreground region (the red bowl) maintains relatively higher noise because it is not blurred. Note also that the noise levels across the manipulated image are actually more consistent than in the authentic image, showing that previous noise-based forensics (Mahdian and Saic 2009) are ineffective.

Fig. 3.3
figure 3

Manipulated refocusing image detection pipeline. The example shown is an iPhone7plus portrait mode image

Figure 3.3 shows our portrait mode detection pipeline, which incorporates these five features. In order to capture the relationship between individual features and the underlying image intensity, we employ an intensity versus feature bivariate histogram—which we call the focus manipulation inconsistency histogram (FMIH). We use FMIH for all five features for defocus forgery image detection, each of which is analyzed by a neural network called FMIHNet. These five classification results are combined by a majority voting scheme to determine a final classification label.

After densely computing the VAR, GRAD, ADQ, CFA, and NOI features for each input image (shown in the first five columns of the second row of Fig. 3.3), we partition the input image into superpixels and, for each superpixel \(i_{sp}\), we compute the mean \(F(i_{sp})\) of each feature measure and its mean intensity. Finally, we generate the FMIH for each of the five figures, shown in the five columns of the third row of Fig. 3.3. Note that the FMIH are flipped vertically with respect to the scatter plots shown in Fig. 3.2. A comparison of the FMIH extracted features from the same scene captured with different cameras is shown in Fig. 3.4.

Fig. 3.4
figure 4

Extracted FMIHs for the five feature measures with images captured using Canon60D, iPhone7Plus, and HuaweiMate9 cameras

Fig. 3.5
figure 5

Network architecture: FMIHNet1 for Var and CFA features; FMIHNet2 for Grad, ADQ, and NOI features

2.3.1 Network Architectures

We have designed a FMIHNet, illustrated in Fig. 3.5, for the five histogram features. Our network is a VGG (Simonyan and Zisserman 2014) style network consisting of convolutional (CONV) layers with small receptive fields (\(3 \times 3\)). During training, the input to our FMIHNet is a fixed-size \(101 \times 202\) FMIH. The FMIHNet is a fusion of two relatively deep sub-networks: FMIHNet1 with 20 CONV layers for VAR and CFA features, and FMIHNet2 with 30 CONV layers for GRAD, ADQ, and NOI features. The CONV stride is fixed to 1 pixel; the spatial padding of the input features is set to 24 pixels to preserve the spatial resolution. Spatial pooling is carried out by five max-pooling layers, performed over a \(2 \times 2\) pixel window, with stride 2. A stack of CONV layers followed by one Fully Connected (FC) layer performs two-way classification. The final layer is the soft-max layer. All hidden layers have rectification (ReLU) non-linearity.

There are two reasons for the small \(3 \times 3\) receptive fields: first, incorporating multiple non-linear rectification layers instead of a single one makes the decision function more discriminative; secondly, this reduces the number of parameters. This can be seen as imposing a regularization on a later CONV layer by forcing it to have a decomposition through the \(3 \times 3\) filters.

Because most of the values in our FMIH are zeros (i.e., most cells in the 2D histogram are empty) and because we only have two output classes (authentic and portrait mode), more FC layers seem to degrade the training performance.

2.4 Portrait Mode Detection Experiments

Having introduced a new method to detect focus manipulations, we first demonstrate that our method can accurately identify manipulated images even if they are geometrically correct. Here, we also show that our method is more accurate than both past forensic methods (Bianchi et al. 2011; Dirik and Memon 2009; Mahdian and Saic 2009) and the modern vision baseline of CNN classification applied directly to the image pixels. Second, having claimed that the photometric relationship of noise cues with the image intensity is important, we will show that our FMIH histograms are a more useful representation of these cues.

To demonstrate our performance on the hard cases of geometrically correct focus manipulations, we have built a focus manipulation dataset (FMD) of images captured with a Canon 60D DSLR and two smartphones having dual lens camera-enabled portrait modes: the iPhone7Plus and the Huawei Mate9. Images from the DSLR represent real shallow DoF images, having been taken with focal lengths in the range 17–70 and f numbers in the range F/2.8-F/5.6. The iPhone was used to capture aligned pairs of authentic and manipulated images using portrait mode. The Mate9 was also used to capture authentic/manipulated image pairs, but these are only approximately aligned due to its inability to save the image both before and after portrait mode editing.

We use 1320 such images for training and 840 images for testing. The training set consists of 660 authentic images (220 from each of the three cameras) and 660 manipulated images (330 from each of iPhone7Plus and Huawei Mate9). The test set consists of 420 authentic images (140 from each of the three cameras) and 420 manipulated images (140 from each of iPhone7Plus and HuaweiMate9).

Figure 3.4 shows five sample images from FMD and illustrates the perceptual realism of the manipulated images. The first row of Table 3.1 quantifies this performance and shows that a range of CNN models (AlexNet, CaffeNet, VGG16, and VGG19) have classification accuracies in the range of 76–78%. Since our method uses five different feature maps which can easily be interpreted as images, the remaining rows of Table 3.1 show the classification accuracies of the same CNN models applied to these feature maps. The accuracies are slightly lower than for the image-based classification.

In Sect. 3.2.1, we claimed that a proper accounting for signal-dependent noise via our FMIH histograms improves upon the performance of the underlying features. This is demonstrated by comparing the image- and feature map-based classification performance of Table 3.1 with the FMIH-based classification performance shown in Table 3.2. Using FMIH, even the relatively simple SVMs and LeNet CNNs deliver classification accuracies in the 80–90% range. Our FMIHNet architecture produces significantly better results than these, with our method’s voting output having a classification accuracy of 98%.

Table 3.1 Image classification accuracy on FMD
Table 3.2 FMIH classification accuracy on FMD

2.5 Conclusions on Detecting Geometrically Correct Synthetic Blur

We have presented a novel framework to detect focus manipulations, which represent an increasingly difficult and important forensics challenge in light of the availability of new camera hardware. Our approach exploits photometric histogram features, with a particular emphasis on noise, whose shapes are altered by the manipulation process. We have adopted a deep learning approach that classifies these 2D histograms separately and then votes for a final classification. To evaluate this, we have produced a new focus manipulation dataset with images from a Canon60D DSLR, iPhone7Plus, and HuaweiMate9. This dataset includes manipulations, particularly from the iPhone portrait mode, that are geometrically correct due to the use of dual lens capture devices. Despite the challenge of detecting manipulations that are geometrically correct, our method’s accuracy is 98%, significantly better than image-based detection with a range of CNNs, and better than prior forensics methods.

While these experiments make it clear that the detection of imagery generated by first-generation portrait modes is reliable, it’s less clear that this detection capability will hold up as algorithmic and hardware improvements are made to mobile phones. This raises the question of how detectable digital blurring (i.e., blurring performed in software) is compared to optical blur.

3 Differences Between Optical and Digital Blur

As mentioned in the noise analysis earlier in this chapter, the shape of the Camera Response Function (CRF) shows up in image statistics. The non-linearity of the CRF impacts more than just noise and, in particular, its effect is now well-understood on motion deblurring (Chen et al. 2012; Tai et al. 2013). While that work showed how an unknown CRF is a noise source in blur estimation, we will demonstrate that it is a key signal helping to distinguish authentic edges from software-blurred gradients, particularly at splicing boundaries. This allows us to identify forgeries that involve artificially blurred edges, in addition to ones with artificially sharp transitions.

We first consider blurred edges: for images captured by a real camera, the blur is applied to scene irradiance as the sensor is exposed, and it is then mapped to image intensity via the CRF. There are two operations involved, CRF mapping and optical/manual blur convolution. By contrast, in a forged image, the irradiance of a sharp edge is mapped to intensity first and then blurred by the manual blur point spread function (PSF). The key to distinguishing these is that the CRF is a non-linear operator, so it is not commutative, i.e., applying the PSF before the CRF leads to a different result than applying the CRF first, even if the underlying signal is the same.

The key difference is that the profile of an authentic blurred edge (CRF before PSF) is asymmetric w.r.t the center location of an edge, whereas an artificially blurred edge is symmetric, as illustrated in Fig. 3.6. This is because, due to its non-linearity, the slope of the CRF is different across the range of intensities. CRFs typically have a larger gradient in low intensities, small gradient in high intensities, and approximately constant slope in the mid-tones. In order to capture this, we use the statistical bivariate histogram of pixel intensity versus gradient magnitude, which we call the intensity gradient histogram (IGH), and use it as the feature for synthetic blur detection around edges in an image. The IGH captures key information about the CRF without having to explicitly estimate it and, as we show later, its shape differs between natural and artificial edge profiles in a way that can be detected with existing CNN architectures.

Fig. 3.6
figure 6

Authentic blur (blue), authentic sharp (cyan), forgery blur (red) and forgery sharp (magenta) edge profiles (left), gradients (center), and IGHs (right)

Before starting the theoretical analysis, we clarify our notation. For simplicity, we study the role of the operations ordering on a 1D edge. We denote the CRF, which is assumed to be a non-linear monotonically increasing function, as f(r), normalized to satisfy \(f(0)=0\), \(f(1)=1\). And the inverse CRF is denoted as \(g(R)=f^{-1}(R)\), satisfying \(g(0)=0\), \(f(g(1))=1\). r represents irradiance, and R intensity. We assume that the optical blur PSF is a Gaussian function, having confirmed that the blur tools in popular image manipulation software packages use a Gaussian kernel,

$$\begin{aligned} K_{g}(x) = \frac{1}{\sigma \sqrt{2\pi }} e^{\frac{x^2}{-2\sigma ^2}} \end{aligned}$$

When the edge in irradiance space r is a step edge, like the one shown in green in Fig. 3.6,

$$\begin{aligned} H_{step}(x) = {\left\{ \begin{array}{ll} a &{} \quad x < c\\ b &{} \quad x \geqq c \\ \end{array}\right. } \end{aligned}$$

where x is the pixel location. If we use a unit step function,

$$\begin{aligned} u(x) = {\left\{ \begin{array}{ll} 0 &{} \quad x < 0\\ 1 &{} \quad x \geqq 0 \\ \end{array}\right. } \end{aligned}$$

The step edge can be represented by

$$\begin{aligned} H_{step}(x) = (b-a)u(x-c)+a \end{aligned}$$

3.1 Authentically Blurred Edges

An authentically blurred (ab) edge is

$$\begin{aligned} I_{ab} = f(K_{g}(x) *H_{step}(x)) \end{aligned}$$

where \(*\) represents convolution. The gradient of this

$$\begin{aligned} \nabla I_{ab}&= f'(K_{g}(x) *H_{step}(x)) \cdot \frac{d [K_{g}(x) *H_{step}(x)]}{d x} \\&= f'(K_{g}(x) *H_{step}(x)) \cdot K_{g}(x) *\frac{d H_{step}(x)}{d x} \end{aligned}$$

Because the differential of the step edge is a delta function,

$$\begin{aligned} \frac{d H_{step}(x)}{d x} = (b-a)\delta (x-c) \end{aligned}$$

We have

$$\begin{aligned} \nabla I_{ab}&= f'(K_{g}(x) *H_{step}(x)) \cdot K_{g}(x) *(b-a)\delta (x-c) \\&= (b-a) f'(K_{g}(x) *H_{step}(x)) \cdot K_{g}(x-c) \\ \end{aligned}$$

Substituting (3.9) into above equation, we have the relationship between \(I_{ab}\) and gradient \(\nabla I_{ab}\)

$$\begin{aligned} \nabla I_{ab} = (b-a) f'(f^{-1}(I_{ab})) \cdot K_{g}(x-c) \end{aligned}$$

And \(K_{g}(x-c)\) is just shifting the blur kernel to the location of the step. Because f is non-linear and its gradient is large at lower irradiance and small at higher irradiance, \(f'\) is asymmetric. Therefore, IGH of authentically blurred edge is asymmetric, as shown in blue in Fig. 3.6.

3.2 Authentic Sharp Edge

In real images, the assumption that a sharp edge is a step function does not hold. Some (Ng et al. 2007) assume a sigmoid function, for simplicity, whereas we choose a small Gaussian kernel to approximate the authentic sharp (as) edge.

$$\begin{aligned} I_{as} = f(K_{s}(x) *H_{step}(x)) \end{aligned}$$

The IGH is the same as that of an authentic blur edge (3.11), with the size and \(\sigma \) of the kernel being smaller. Because the blur extent is very small, there is a small transition edge region, and the effect of CRF will not be very obvious. The IGH remains symmetric shown as cyan in Fig. 3.6.

3.3 Forged Blurred Edge

The model for a forged blurred (fb) edge is

$$\begin{aligned} I_{fb}&= K_{g}(x) *f(H_{step}(x)) \end{aligned}$$

The gradient of this is

$$\begin{aligned} \nabla I_{fb}&= \frac{d [K_{g}(x) *f(H_{step}(x))]}{d x} \\&= K_{g}(x) *[f'(H_{step}(x)) \cdot H_{step}'(x)] \end{aligned}$$


$$\begin{aligned} f'(H_{step}(x)) = (f'(b)-f'(a))u(x-c)+f'(a) \end{aligned}$$


$$\begin{aligned} f(x) \cdot \delta (x) = f(0) \cdot \delta (x), \end{aligned}$$

we have

$$\begin{aligned} \nabla I_{fb}&= K_{g}(x) *[f'(b) \delta (x-c)] \\&= (b-a) f'(b) K_{g}(x) *\delta (x-c) \\&= (b-a) f'(b) K_{g}(x-c). \end{aligned}$$

Clearly, \(\nabla I_{fb}\) has the shape the same as the PSF kernel, which is symmetric w.r.t. the location of the step c, shown as red in Fig. 3.6.

3.4 Forged Sharp Edge

A forged sharp (fs) edge appears as an abrupt jump in intensity as

$$\begin{aligned} I_{fs} = f(H_{step}(x)) \end{aligned}$$

The gradient of the spliced sharp boundary image is

$$\begin{aligned} \nabla I_{fs}&= \frac{d I_{fs}}{d x} = \frac{d [f(H_{step}(x))]}{d x} \\&= f'(H_{step}(x)) \cdot (b-a)\delta (x-c) \\&= f'(b) \cdot \delta (x-c) \end{aligned}$$

There are only two intensities a and b, both having the same (large) gradient. The IGH of a forged sharp edge will only have all pixels fall in only two bins shown as magenta in Fig. 3.6.

Fig. 3.7
figure 7

The figure shows the four different edge classes and their IGH

3.5 Distinguishing IGHs of the Edge Types

To validate the theoretical analysis, we show IGHs of the four types of edges in Fig. 3.7. An authentically blurred edge is blurred first, inducing intensity values between the low and high values in irradiance. This symmetric blur profile then is mapped through the CRF, becoming asymmetric.

In the forged blur edge case, by contrast, the irradiance step edge is mapped by CRF first to a step edge in intensity space. Then the artificial blur (via PhotoShop, etc.) induces values between the two intensity extrema, and the profile reflects the symmetric shape of PSF.

The forged sharp edge (from, for instance, a splicing operation) is an ideal step edge, whose nominal IGH has only two bins with non-zero counts. However, due to the existence of noise in images captured by cameras, shading, etc., the IGH of forgery sharp edge appears a rectangular shape as shown in Fig. 3.7. The horizontal line shows that the pixels fall into bins of different intensities with the same gradient, which is caused by the pixel intensity varying along the sharp edge. The two vertical lines show that the pixels fall into bins of different gradients with similar intensity values, which is caused by the pixels intensity varying on the constant color regions.

As for the authentic sharp edge, the IGH is easily confused with a forged blurred edge, in that both are symmetric. If we only consider the shape of IGH, this would lead to a large number of false alarms. To disambiguate the two, we add an additional feature: the absolute value of the center intensity and gradient for each bin. This value helps due to the fact that at the same intensity value, the gradient of the blurred edge is always smaller than the sharp edge. With our IGH feature, we are able to detect splicing boundaries that are hard for prior methods, such as spliced regions having constant color or those captured by the same camera.

3.6 Classifying IGHs

Having described the IGH and how these features differ between the four categories of edges, we now consider mechanisms to solve the inverse problem of inferring the edge category from an image patch containing an edge. Since our IGH is similar to a bag of words-type features used extensively in computer vision, SVMs are a natural classification mechanism to consider. In light of the recent success of powerful Convolutional Neural Networks (CNNs) applied directly to pixel arrays, we consider this approach, but find a limited performance that may be due to relatively sparse training data over all combinations of edge step height, blur extent, etc. To address this, we map the classification problem from the data-starved patch domain to a character shape recognition in the IGH domain, for which we leverage existing CNN architectures and training data.

3.6.1 SVM Classification

As in other vision applications, SVMs with the histogram intersection kernel (Maji et al. 2008) perform best for IGH classification. We unwrap our 2D IGH into a vector and append the center intensity and gradient value of each bin. Following the example of SVM-based classification on the bag of words features, we use a multi-classification scheme by training four one-vs-all models: authentic versus others, authentic sharp versus others, forged sharp versus others, and forged blur versus others. Then we combine the scores of all four models to classify the IGH as either authentic or a forgery.

3.6.2 CNN on Edge Patches

We evaluate whether a CNN applied directly to edge patches can out-perform methods involving our IGH. We use the very popular Caffe (Jia et al. 2014) implementation for the classification task.

We first train a CNN model on edge patches. In order to examine the difference between authentic and forged patches, we synthesized the authentic and forged blur processes on white, gray, and black edges as shown in Fig. 3.8. Given the same irradiance map, we apply the same Gaussian blur and CRF mapping in both orderings. The white region in the authentically blurred image always appears to be larger than in the forgery blur image, because the CRF has a higher slope in the low-intensity region. That means all intensities of an authentically blurred edge are brought to the white point faster than a forgery. This effect can also be observed in real images, such as the Skype logo in Fig. 3.8, where the white letters in the real image are bolder than in the forgery. Another reason for this effect is that cameras have limited dynamic range. A forged sharp edge will appear like the irradiance map with step edges, while an authentically sharp edge would be easily confused with forgery blur since this CRF effect is only distinguishable with a relatively large amount of blur. Thus, the transition region around the edge potentially contains a cue for splicing detection in a CNN framework.

Fig. 3.8
figure 8

Authentic versus forgery images and edge profiles. The synthesized images are using the same CRF and Gaussian kernel. The real images are captured by Canon 70D. The forgery image is generated by blurring a sharp image the same amount to match the authentic image

3.6.3 CNN on IGHs

Our final approach marries our IGH feature with the power of CNNs. Usually, people wouldn’t use a histogram with CNN classification because spatial arrangements are important for other tasks, e.g., object recognition. But, as our analysis has shown, the cues relevant to our problem are found at the pixel level, and our IGH can be used to eliminate various nuisance factors: the orientation of the edge, the height of the step, etc. This has an advantage of reducing the training data dimensionality and, thus, the large number of training data needed to produce accurate models. Lacking an ImageNet-scale training set for forgeriesis a key advantage of our method.

In the sense that a quantized IGH looks like various U-shaped characters, our approach reduces the problem of edge classification into a handwritten character recognition problem, which has been well studied (LeCun et al. 1998). Since we are only interested in the shape of the IGH, the LeNet model (LeCun et al. 1998) is very useful for the IGH recognition task.

3.7 Splicing Logo Dataset

To validate our IGH-based classification approach, we built our own Splicing Logo Dataset (SpLogo) containing 1533 authentic images and 1277 forged blur images of logos with different colors and different amounts of blur. All the images are taken by a Canon70D. Optical blur is controlled via two different settings: one is by changing the focal plane (lens moving) and the other is by changing the aperture with the focal plane slightly off the logo plane. The logos are printed on a sheet of paper, and the focal plane is set to be parallel to the logo plane to eliminate the effect of depth-related defocus. Next, the digitally blurred images are generated to match the amount of optical blur through a optimization routine.

3.8 Experiments Differentiating Optical and Digital Blur

We use a histogram intersection kernel (Maji et al. 2008) for our SVM and LeNet (LeCun et al. 1998) for CNN with a \(1e^{-6}\) base learning rate and \(1e^{6}\) maximum number of iterations. The training set contains 1000 authentic and 1000 forged patches.

Table 3.3 compares the accuracy of the different approaches described in Sect. 3.3.6. Somewhat surprisingly, the CNN applied directly to patches does the worst job of classification. Among the classification methods employing our hand-crafted feature, CNN classifiers obtain better results than SVM, and adding the absolute value of intensity and gradient to IGH increases classification accuracy.

Table 3.3 Patch classification accuracy for SpLogo data and different classification approaches

3.9 Conclusions: Differentiating Optical and Digital Blur

These experiments show that, at a patch level, the IGH is an effective tool to detect digital blur around edge regions. How these patch-level detections relate to a higher-level forensics application will differ based on the objectives. For instance, to detect image splicing (with or without subsequent edge blurring), it would suffice to find strong evidence of one object with a digitally blurred edge. To detect globally blurred images, on the other hand, it would require evidence that all of the edges in the image are digitally blurred. In either case, the key to our detection method is that a non-linear CRF leads to differences in pixel-level image statistics depending on whether it is applied before or after blurring. Our IGH feature captures these statistics and provides a way to eliminate nuisance variables such as edge orientation and step height so that CNN methods can be applied despite a lack of a large training set.

4 Additional Forensic Challenges from Computational Cameras

Our experiments with blurred edges show that it’s possible to differentiate blur created optically from blur added post-capture via signal processing. The key to doing so is understanding that the non-linearities of the CRF are not commutative with blur, which is modeled as a linear convolution. But what about the CRF itself? Our experiments involved imagery from a small number of devices, and past work (Hsu and Chang 2010) has shown that different devices have different CRF shapes. Does that mean that we need a training set of imagery from all different cameras, in order to be invariant to different CRFs? In subsequent work (Chen et al. 2019), we showed that CRFs from nearly 200 modern digital cameras had very similar shapes. While the CRFs between camera models differed enough to recognize the source camera when a calibration target (a color checker chart) was present, we later found that the small differences did not support reliable source camera recognition on ‘in the wild’ imagery. For forensic practitioners, the convergence of CRFs represents a mixed bag: their similarity makes IGH-like approaches more robust to different camera types, but reduces the value of CRFs as a cue to source camera model identification.

Looking forward, computational cameras present additional challenges to forensics practitioners. As described elsewhere in this book, the development of deep neural networks that generate synthetic imagery is an important tool in the fight against misinformation. While Generative Adversarial Network (GAN)-based image detectors are quite effective at this point, they may become less so as computational cameras increasingly incorporate U-Net enhancements (Ronneberger et al. 2015) that are architecturally similar to GANs. It remains to be seen whether U-Net enhanced imagery will lead to increased false alarms from GAN-based image detectors, since the current generation of evaluation datasets don’t explicitly include such imagery.

To the extent that PRNU and CFA-based forensics are important tools, new computational imaging approaches challenge their effectiveness. Recent work from Google (Wronski et al. 2019), for instance, uses a tight coupling between multi-frame capture and image processing to remove the need for demosaicing and breaks the sensor-to-image pixel mapping that’s needed for PRNU to work. As these improvements make their way into mass market smartphone cameras, assumptions about the efficacy of classical forensic techniques will need to be re-assessed.