Robustness of digital camera identification with convolutional neural networks

This paper considers the area of digital forensics (DF). One of the problem in DF is the issue of identification of digital cameras based on images. This aspect has been attractive in recent years due to popularity of social media platforms like Facebook, Twitter etc., where lots of photographs are shared. Although many algorithms and methods for digital camera identification have been proposed, there is lack of research about their robustness. Therefore, in this paper the robustness of digital camera identification with the use of convolutional neural network is discussed. It is assumed that images may be of poor quality, for example, degraded by Poisson noise, Gaussian blur, random noise or removing pixels’ least significant bit. Experimental evaluation conducted on two large image datasets (including Dresden Image Database) confirms usefulness of proposed method, where noised images are recognized with almost the same high accuracy as normal images.


Introduction
Digital forensics is a popular area that attracts many scientific attention. Problems like identification of imaging sensors are especially interesting. One of the most challenging issue in digital forensics (and also in image processing) is identification of camera based on images and considering it as a "digital fingerprint" or a proof of presence. This domain is called by a term hardwaremetry [13]. Camera identification may be realized in two aspects. First is called the individual source camera identification (ISCI) which distinguishes a certain camera among cameras of both the same and the different camera models. The second aspect is called the source camera model identification (SCMI) that distinguishes a certain camera model among the different models but not distinguishes a certain copy of camera among other cameras of the same model. As example, for the following cameras: Sony A7 (0), Sony A7 (1), ... Sony A7 (n), Nikon D750 (0), Nikon D750 (1), ... , Nikon D750 (n), the ISCI distinguishes all cameras as different (Sony A7 (0), Sony A7 (1), ...), while the SCMI distinguishes only the general model (Sony A7, Nikon D750). Therefore, the ISCI is much stronger than SCMI aspect, therefore much research has been conducted in this domain [16-18, 22-24, 27]. One of the most popular algorithm for individual source camera identification is proposed by Lukás et al. [27]. This algorithm aims to identify cameras on so called Photo-Response Nonuniformity Noise (PRNU) which is also called sensor pattern noise or noise residual. The goal is to calculate N = I − F (I) where N is a noise residual, I is an input image and F is a denoising filter [21]. The N is unique for every camera and may be used for identification. Experimental evaluation confirmed very high camera identification accuracy. Recent years have shown another interest in field of camera identification thanks to convolutional neural networks (CNN) [4,11,31,38,40,43]. Due to their nature, CNNs offer almost perfect classification accuracy in different subjects, as text or image classification and pattern recognition.
In this paper the robustness of digital camera identification with the use of a proposed convolutional neural network is discussed. The robustness is understood as a recognition of a camera based on visually affected images. More precisely, the network is learned by "normal" images of some camera and tested by applying to it images of the same camera degraded by Poisson noise, Gaussian blur, random noise and removing least significant bit (LSB) of pixel intensities. Results indicate that network successfully identifies even strongly affected images as coming from a particular camera. Discussed CNN may be also used for a digital camera identification. For evaluation, two large image datasets are used. First dataset includes modern cameras including latest digital single lens reflex/mirrorless, compact cameras and smartphones; second one is a Dresden Image Database [14] that is often used for benchmarking. This paper is a continuation of research presented in [3] and in [2]. In [3] there have been proposed two algorithms called PSNR-CT and DEPECHE. The PSNR-CT algorithm is used for an ultra-fast camera identification (compared to Lukás et al. [27]). The DEPECHE algorithm is used for prevention of camera identification based on the analysis of image histogram. In [2] the robustness of camera identification in terms of Lukás et al. [27] by analysing degraded images has been checked. The degradation techniques included noising, blurring, removing least significant bit; also an algorithm to bypass the identification of Lukás et al.'s algorithm has been proposed. However, in [2] the impact of image degradation techniques to camera identification of other than Lukás et al.'s algorithm -for example convolutional neural networks was not examined. Therefore this is the motivation for this paper.

Contribution
The primary contribution of this paper is a study of robustness of digital camera identification with a convolutional neural network (CNN). This analysis covers an interference with the image quality by applying strategies like Poisson noise, Gaussian blur, random noise and removing pixels' least significant bit (LSB). It is showed that even strongly degraded images are still recognized by a CNN, therefore the identification of the original camera that produced the image is possible. Experiments are performed with the use of a proposed CNN that also might be used for a digital camera identification on not degraded images.

Organization of the paper
In next section the previous and related work are recalled. Section 3 depicts proposed convolutional neural network architecture. In Section 4 the experimental evaluation of proposed method is described. Finally, the last section concludes this work. In the Appendix there is presented the code for implementation of proposed convolutional neural network under Python programming language. Everywhere in the paper, bold font denotes matrices or vectors.

Previous work
One of the most popular algorithm for digital camera fingerprinting was proposed by Lukás et al. [27]. This algorithm utilizes a Photo-Response Nonuniformity Noise (PRNU) which is unique for each camera and may serve as a fingerprint. The PRNU is also called as sensor pattern noise or noise residual. The PRNU is calculated as N = I − F (I) where N is a noise residual, I is an input image and F is a denoising filter. This method was further researched in [16-18, 22, 29]. In Marra et al. [29] image residuals were utilized for camera identification. The co-occurrence matrices of selected neighbors were used for features extraction. Residual images were calculated simply as the difference between the input image I and its denoised version F (I); F was the denoising filter. Obtained features were applied as input of SVM classifier.
In Morshedi et al. [20] an algorithm for recognizing camera's sensor from High Dynamic Range (HDR) images was described. It was proposed to invert geometric transformations for enabling proper PRNU detection. Considered were reversal of upsampling and the patchwork. Experiments held on the UNIFI dataset [35] of HDR images from number of modern smartphones confirmed high accuracy of camera identification which was at least of 95%.
Agarwal et al. [1] described an algorithm for iris sensor identification. Image feature selections were collected by a Block Image Statistical Measure (BISM), High Order Wavelet Entropy (HOWE), Texture Measure (TM), Single-level Multi-orientation Wavelet Texture (SlMoWT) and image quality measures.
In Li et al.'s [26] investigation of camera identification by compact representation of fingerprints was discussed. Proposed algorithm generates compact representation of camera's fingerprints with the use of random projections strategy. Experiments showed that such approach may be practical and efficient, however robustness of proposed method was not checked.
Goljan et al. [15] discussed the effect of compression on camera identification using sensor fingerprint. Results indicated that the JPEG compression both increased the variance of the normalized correlation and the variance of peak-to-correlation energy (PCE).
Taspinar et al. [36] considered seam-carved images. Seam-carving is understood as removing some parts of the image. Results pointed that sensor recognition could be realized if the image block was even less than 50×50px. Another strategy includes the analysis of the generalized noise in natural images [37]. Proposed model utilizes parameters of the image that may be used as camera fingerprint. Tuama et al. [39] proposed training a machine learning classifier on the concatenation of the co-occurrences of color band noise residuals with features computed with a Markovian model in discrete cosine transform (DCT) domain. These features include also conditional probability statistics. Such model gives high order statistics which supplement and enhance the identification rate.
Vulnerability of deep learning approach to adversarial attacks was examined in Marra et al. [28]. It was discussed whether it is possible to deceive a CNN-based classifier in order to make camera classification incorrect. Attacking a CNN-based classifier was performed by the following methods: The Fast Gradient Sign Method (FGSM) [19], DeepFool [32] and Jacobian-based Saliency Map Attack (JSMA) [33]. The goal of FGSM was very simple and relied on adding to images an additive noise. DeepFool is relied on a local linearization of the classifier. The JSMA was a greedy iterative procedure that detecting and replacing the pixels that contribute most to the correct classification of the image.
Obregon et al. [12] presented a fully connected network. Feature maps were obtained as a convolution of image I and kernel K. The network utilized two convolutional layers, a max pooling layer and three fully connected layers with ReLU function used for activation. Experiments held on MICHE dataset [7] with used hardware nVidia Tesla K80 (24GB) confirmed high accuracy of classification. Convolutional Neural Networks were also discussed in [5,8,25,30,34,41]. In Yang et al. [42] a concept of using content-adaptive fusion residual networks was proposed. Images were divided into three categories: saturation, smoothness and others. For each image category a fusion residual network was trained by transform learning approach. Chen et al. [6] proposed a residual neural network (ResNet). The properties of camera's lens system were used for training the network. Discussed method was evaluated of individual source camera's identification in Dresden Image Database [14]. However, none of listed papers investigate the aspect of robustness of digital camera identification. Thus if the image was degraded by for example an adversarial attack in order to fool the classifier, the response of the classifier is not known.

Convolutional Neural Networks (CNN) -the background
Convolutional neural networks (CNN) are recently very popular in many fields. They are used for natural language processing, object/pattern recognition, different classification tasks including text or image classification. The general structure of a convolutional neural network includes layers containing the neurons. A neuron simply takes some value as input, does computations and returns the results to the next layer. Let us shortly recall the idea of CNNs. In contrary to traditional multilayer perceptron architecture, it uses two operations called convolution and pooling to reduce an image into essential features for further understanding and classifying the image. The general blocks of CNNs are convolution, activation, pooling and fully connected layers. The convolution layer (also named a filter) is passed over the image, viewing a few pixels at a time (for instance, 3 × 3 or 5× 5). The convolution operation is a dot product of the input pixel values with weights defined in the filter. The results are summed up into one number that represents all the pixels observed by the filter. The result of convolution layer processing is passed to the activation layer. The activation layer takes as input the result of the convolution layer to find non-linearity in order to train the network itself using backpropagation. The most common activation function is Rectified Linear Units (ReLU) function, defined as f (x) = max(0, x). The activation function is applied to each value of the input image. The pooling layer stands for downsampling and reducing the size of the matrix. A filter is passed over the results of the previous layer and takes one number of each group, usually the maximum (often named a max-pooling layer), but in some cases the average. The goal of this operation is to focus on the most important information in each feature of the image, what allows to train the network much faster. Finally, the fully connected layers stand for a traditional multilayer perceptron architecture which input is a one dimensional vector representing the output of the previous layers. The output of the fully connected layer is a list of probabilities for different possible labels assigned to the image, usually calculated by the softmax function. The label with highest probability is the classification decision. The idea of CNN is presented in Fig. 1.  Fig. 1 The concept of Convolutional Neural Networks (CNN)

Proposed CNN
The digital camera identification can be realize with a following convolutional neural network. In contrary to [38], where the network is learned with N = I − F (I) (N is a noise residual, I is an input image and F is a denoising filter), proposed network may be learned directly with JPEG images without any additional procedures. The proposed network has three convolutional and two fully connected layers. As may be found in papers by Bondi et al.'s [4] and Yao et al.'s [43], we propose taking patches of size 64 × 64 × 3 as input. The network structure is depicted below, full Keras implementation source code (under Python programming language) is presented in the Appendix.
1. First convolutional layer of 32 filters with kernel 5 × 5 and stride 1 with ReLU as an activation method; 2. A max-pooling layer with pool size of 2 × 2 and stride 2; 3. A second convolutional layer of 64 filters with kernel 5 × 5 with ReLU as an activation method; 4. A max-pooling layer with pool size 2 × 2; 5. A third convolutional layer of 128 filters with kernel 5 × 5 with ReLU as an activation method; 6. A max-pooling layer with pool size 2 × 2; 7. Two fully connection layers for classification: first fully connected layer with 4096 neurons with ReLU as an activation function and second fully connected layer with the output followed by the softmax function.
An input image I is passed to the first convolutional layer, consisting of 32 filters with kernel 5 × 5 with stride 1. Then, ReLU function is used as an activation method and a maxpooling layer with a pool size of 2 × 2 and stride 2 is applied. The second convolutional layer consists of 64 filters with kernel 5 × 5. Also the ReLU is used as an activation method and the max-pooling layer with pool size of 2 × 2. The third convolutional layer consist of 128 filters of kernel 5 × 5 with ReLU as activation function and max-pooling of size 2 × 2. Results are passed to the fully connected layers to obtain the final classification. First fully connected layer consists of 4096 neurons and ReLU is applied as an activation function to its output. Second fully connected layer activated with the softmax function provides the final classification.

Experimental evaluation
The proposed CNN has been evaluated in two experiments. Firstly (Experiment I), the robustness of the proposed CNN in terms of image degradation was examined. The network was learned by normal images. For the classification, the Poisson noised, Gaussian blurred, random noised and least significant bit-removed images were applied to check whether the network will correctly identify the devices. Secondly (Experiment II), an experiment for typical individual source camera identification (ISCI) was conducted. Both experiments were performed on two image datasets which are described in next subsection. Images are represented in the RGB model, where pixel values for each color channel R (red), G (green) and B (blue) take values from [0, 255]. Scripts for image noising were implemented in Matlab, using the imnoise function.
As evaluation, the standard accuracy (ACC) and true positive rate (TPR) measures are used, defined as: , TPR = TP TP + FN where TP/TN denotes "true positive/true negative"; FP/FN stands for "false positive/false negative". TP denotes number of cases correctly classified to a specific class; TN are instances that are correctly rejected. FP denotes cases incorrectly classified to the specific class; FN are cases incorrectly rejected. As hardware, a notebook with Intel Core i5-7300HQ@2.5-3.1GHz CPU with 24 gigabytes of RAM (DDR4-2400) and nVidia GeForce GTX1050 GPU with 4 gigabytes of video memory has been used.
Experiments were held with 100 epochs for training and batch size of size 32. The number of training epochs and batch size was defined experimentally. Experiments showed that the number of 100 epochs is sufficient to successfully train the CNN and obtain the satisfactory classification accuracy. Due to large number of tested devices, the full results of camera model identification are not presented for clarity -instead of this there are presented confusion matrices only for brand recognition. In all tables serving as confusion matrices, rows denote the actual classes, columns denote the prediction results.
As mentioned in previous sections, proposed CNN is tested in the aspect of individual source camera identification (ISCI), therefore the different copies of the same camera model, for example Nikon D200 (Ni10) and Nikon D200 (Ni11), etc are distinguished. The CNN was trained twofold, for each dataset independently. The relation between the number of images used for training and testing is the following: 80% of images used for training and 20% of images for testing.

Experiment I -Robustness of proposed CNN to image degradation operations
The analysis of robustness of proposed CNN to image degradation operations such as Poisson noising, Gaussian blurring, adding random noise and removing pixels' least significant bit (LSB) was conducted. In this experiment, the network was learned by normal images and for evaluation, was given images degraded with aforementioned methods. Some examples of image degradation can be seen in Figs. 2, 3, 4 and 5.
Poisson noise Poisson noise (also called quantum noise) is a signal-dependent noise that can be seen on images. Pixels x are generated discretely according to the Poisson distribution P (1): where k is the mean parameter which in case of RGB images takes the same values as processed pixel [9]. A sample Poisson-blurred image can be seen as Fig. 2. Results of identification for Poisson noised images are presented in Tables 1 and 2.
Analysis of classification confirms that Poisson noising cannot be considered as a strategy for ensuring the unlinkability between the camera and the image. The accuracy of 99% for both datasets is the same as identification of normal (not blurred) images.  Tables 3 and 4.
Results clearly indicate that network recognizes particular models with 99% accuracy for both tested image datasets, which can be considered as almost perfect. This means that even strong image degradation obtained during Gaussian blurring does not prevent from linking the image with the camera. Also the image quality of Gaussian blurred images is not satisfactory, because images (espiecially for high σ values), are strongly blurred.

Random noise
Random noise is a technique of image noising, where some pixels are set to distinguishing values (usually 0 or 255 which in RGB model stand for black or white). We propose to replace k pixels in the image, in a manner that k/2 pixels in the picture will be set to 0 and k/2 pixels to 255 in a random way, where k includes 50% of image pixels. An example of such operation is described as Fig. 4. One may assume that replacing such number of pixels will be enough to claim that the image is visually degraded. Results of ISCI identification based on random noised images are presented as Tables 5 and 6.  The symbol * denotes values smaller than 1% The symbol * denotes values smaller than 1%  The symbol * denotes values smaller than 1% Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.