Exposing Digital Forgeries by Detecting a Contextual Violation Using Deep Neural Networks

  • Jong-Uk Hou
  • Han-Ul Jang
  • Jin-Seok Park
  • Heung-Kyu LeeEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10763)


Previous digital image forensics focused on the low-level features that include traces of the image modifying history. In this paper, we present a framework to detect the manipulation of images through a contextual violation. First, we proposed a context learning convolutional neural networks (CL-CNN) that detects the contextual violation in the image. In combination with a well-known object detector such as R-CNN, the proposed method can evaluate the contextual scores according to the combination of objects in the image. Through experiments, we showed that our method effectively detects the contextual violation in the target image.

1 Introduction

In the age of high performance digital camera and the Internet, digital images have become one of the most popular information sources. Unlike text, images remain an effective and natural communication medium for humans, for their visual aspect makes it easy to understand the content. Traditionally, there has been confidence in the integrity of visual data, such that a picture in a newspaper is commonly accepted as a certification of the news. Unfortunately, digital images are easily manipulated, especially since the advent of high-quality image-editing tools such as Adobe Photoshop and Paintshop Pro. Therefore, digital-image forensics, a practice aimed at identifying forgeries in digital images, has become an important field of research.

To cope with image forgeries, a number of forensic schemes were proposed recently. Most of the previous digital image forensics focused on the low level features that include traces of the image modifying history. They are based on detecting local inconsistencies such as resampling artifacts [14], color filter array interpolation artifacts [2], JPEG compression [3], and so on. The pixel photo response non-uniformity (PRNU) is also widely used for detecting digital image forgeries [1, 6, 7]. There are also some methods for detecting identical regions caused by copy-move forgery [8, 16].

Most of the local inconsistency based methods are vulnerable to the general image processing such as JPEG and GIF compression, white balancing, and noise addition. On the other hand, high-level features such as lighting condition [10], shading and shadows [11] provide fairly robust clues for the forgery detection system against the general image processing. In this paper, we present a forensic scheme to detect the manipulation of images through contextual violation based on high-level objects in target image (See Fig. 1). Relationship information between objects in an image is used as a robust feature that is not affected by general image processing.
Fig. 1.

Examples of image forgery detection based on contextual violation. (left) The boat was manipulated on the road, and (right) the lamb was found in the office. The manipulated parts (areas in the red box) can be detected because these objects caused contextual violations. (Color figure online)

Our research proposes a model that can learn the context by learning the image label combination by spatial coordinate. First, we proposed a context learning convolutional neural networks (CL-CNN) that detects the contextual violation in the image. CL-CNN was trained using an annotated large image database (COCO 2014), and learned spatial context that was not provided by existing graph-based context models. In combination with a well-known object detector such as R-CNN, the proposed method can evaluate the contextual scores according to the combination of objects in the image. As a result, our method effectively detects the contextual violation in the target image.

2 Context-Learning CNN

In this research, we proposed Context-Learning convolutional neural networks (CL-CNN) to learn co-occurrence and spatial relationship between object categories. The proposed CL-CNN provides an explicit contextual model through deep learning with large image databases. CL-CNN is learned to return a high-value for the natural (or learned) combination of the object categories, while it returns low value for strange combinations or spatial contexts of the image labels.
Fig. 2.

Input data structure and encoding process

2.1 Input Data Structure

The process of generating input data is as follows. We used an annotated image database for object location and category information. Because size of the image is too large to be used, the size is reduced to \( N \times N \), where N < height and width of the input image, to record the position information of the object. The channel size of the input data structure is the total number of categories to distinguish each category. The label defined part is padded with 1 value and the remaining blank area is filled with 0 values.

In this research, we reduced the size of the image to \( 8 \times 8 \) and chose the category with 80. The final input data size is \( 8 \times 8 \times 80\) and the generation process is shown in Fig. 2.

2.2 CL-CNN Structure

The structure of CL-CNN is as follows (See Fig. 3). It receives input data of \( 8 \times 8 \times 80\), passes through two convolutional layers, and passes through three fully connected layer. Then the fully connected layer finally outputs \( 2 \times 1\) vector. The first value of the output is a score that evaluates how natural the label is in combination with the category and spatial context, and the second value is the score that evaluates how the category combination and spatial context of the label are awkward. The loss function used Euclidian loss L which is defined by
$$\begin{aligned} L = \frac{1}{2}\sum _i(y_i-a_i)^2, \end{aligned}$$
where y is the output of the CL-CNN, and a is the label of the data sample.
Fig. 3.

Overall structure of the context-learning convolutional neural networks

2.3 Dataset Generation

In order to learn the proposed network, it is necessary to acquire a large amount of datasets. We need to have both a collection of natural labels and a collection of unnatural labels. Moreover, we also need data that shows both the location and type of the object. A dataset that meets these criteria is Microsoft COCO: Common Objects in Context [13]. Microsoft COCO 2014 provides 82,783 training image sets and label information, and 40,504 validation images and label information. This can aid in learning detailed object models capable of precise 2D localization and contextual information.

Before we use the dataset, we excluded single-category images since they are useless for learning contextual information. Thus, we used 65,268 multi-category images for learning of the CL-CNN. The categories are divided into 80 categories. A positive set was constructed using label information of t multi-category images, as the way described in Sect. 2.1.

We cannot use negative sets based on the existing databases because they need to learn combinations of unnatural labels that do not actually exist. For this reason, we generated the negative set in two ways. Negative set 1 was created by changing the size and position of the object while maintaining the category combination. Negative set 2 was created by selecting the combination of the less correlated categories. Figure 4 shows the histogram of the co-occurrences between object categories. Using the probability \(P(c_1, c_2)\) from the co-occurrence histogram, combinations of classes \(c_1, c_2\), which has which has a low co-occurrence probability \(P(c_1, c_2)\), was selected to generate a negative dataset. Next, the negative set 2 was modified by changing the size and position of the object while maintaining the category combination.

2.4 Network Training

With a simple learning approach, object combination and location shuffled dataset are trained at the same time. Next, we tested ‘combination and location shuffling’ and ‘location shuffling’ and obtained the accuracy of 0.97 and 0.53, respectively. When learning ‘combination and location shuffling’ at the same time, ‘combination change’ was strongly learned. As a result, learning of the spatial context was ignored by the over-fitted CL-CNN. Therefore, we need to improve the learning method so that the ‘location shuffling’ was learned enough.

Therefore, we trained CL-CNN by learning ‘location shuffling’ of object first, and then fine-tuning the ‘combination and location shuffling’ part sequentially. We set the learning rate to 0.001 for ‘location shuffling’ and set the learning rate to 0.00001 for ‘combination and location shuffling’ learning. We also tested ‘combination and location shuffling’ and ‘location shuffling’ and obtained an accuracy of 0.93 and 0.81, respectively. The test accuracy for ‘location shuffling’ has been greatly improved from 0.53 to 0.81 when compared with the way of learning at the same time.
Fig. 4.

Histogram of the co-occurrences between object categories. Negative set was generated by selecting the combinations of the less correlated categories (e.g. combinations shown in circled regions)

3 Detection of Contextual Violation of Target Image

We propose a method to detect the contextual violation of the target image using CL-CNN. The proposed method operates by combining with the output of the existing object detector such as [4, 5, 15]. Among the above methods, we solved the object detection task based on Faster R-CNN [15]. Using object detection results and probability values, we proposed a system that detects objects that are most inappropriate in the image context. The proposed method is described as follows.

Step 1. Extract objects from the suspicious image: Let I be a suspicious image. Using the image object detector, we extract the area of the object in the image and calculate the category score in each area as follows:
$$\begin{aligned} P_{r_i}(c) = F(I), \end{aligned}$$
where the function \(F(\cdot )\) is the region-based object detector such as Faster R-CNN [15] for a single input image I. \(P \in [0,1]^\mathbf{R \times \mathbf C }\) is the probability of the each object class c from the detected region \(r_i\). Figure 5 shows sample of the object detection result and its details (Fig. 6).
Fig. 5.

Step 1. Extract object region and calculate category score using the object detector such as Faster R-CNN [15].

Fig. 6.

Generate input sets for CL-CNN

Step 2. Generate input sets for CL-CNN: After extracting objects from the image, candidates for the contextual violation check were selected by:
$$\begin{aligned} \mathbb {P} = \{(r_i, c): P_{r_i}(c) > \tau _i \} \end{aligned}$$
where \(\tau _i\) is selection threshold for the raw output. If \( P_{r_i} <\tau _i \), the corresponding object region \(r_i\) is not used. For example, when \( \tau _i = 0.7 \), three candidates: lamb, keyboard, and mouse, are selected in the sample image in Fig. 5. Then, input sets \(\mathbb {S}_i\) for CL-CNN is generated by:
$$\begin{aligned} \mathbb {S}_i = \mathbb {P} \backslash \{(r_i, c)\} \end{aligned}$$
where \(\mathbb {P}\backslash \{x\}\) denotes the set \(\mathbb {P}\) excluding the element x.
Step 3. Evaluate context score of the inputs: Each input sets \(\mathbb {S}_i\) is passed to CL-CNN to generate a result value vector.
$$\begin{aligned} \hat{i} = \mathop {\text{ argmax }}\limits _{i}[ C( \mathbb {S}_i ) ], \end{aligned}$$
where return value of the function \(C(\cdot )\) denotes the positive output value of the CL-CNN. Before calculating \(C(\cdot )\), the input \(\mathbb {S}_i\) is converted according to the input data structure described in Sect. 2.1. Since \(\mathbb {S}_i\) is the set \(\mathbb {P}\) excluding the element \(r_i\), the object class c from the region \(r_{\hat{i}}\) is the most unlikely object in the context of the target image I. Therefore, \(\hat{i}\) indicates the index value of the region that may cause the contextual violation.
Fig. 7.

CL-CNN results with natural image input. The average output value was 0.98 or higher.

In addition, we should consider the case where there is no contextual violation in the suspicious image. In order to reduce false positive error, we should check whether \(C (\mathbb {S}_{\hat{i}})\) value is larger than user defined threshold \(\tau _o\).
$$\begin{aligned} \left\{ \begin{array}{ll} \text {Forgery detected:} &{} \text {if } C (\mathbb {S}_{\hat{i}}) > \tau _o, \\ \text {No detection:} &{} \text {otherwise.} \end{array} \right. \end{aligned}$$
We use value 0.8 for \(\tau _o\). In addition, there may be cases where multiple objects are forged. In this case, we can solve the above problem by checking at the top n results of the Eq. (5).

4 Experimental Results

For the experiment, we used sample images collected from the Microsoft COCO: Common Objects in Context [13] as described in Sect. 2.3. The implementation of CL-CNN is based on Caffe library [9].
Fig. 8.

CL-CNN and forgery detection results with manipulated images. The object in the red box was detected as a manipulated object. (Color figure online)

Experimental results for natural image (positive set) and forged image (negative set) are shown in Figs. 3 and 4. The natural images are from COCO 2014 test database. Since no manipulation was detected, we showed output value with the CL-CNN input contained all the object sets \(\mathcal {P}\) extracted by the detector. For natural image, the average output value was 0.98 or higher. For instance, a combination of vases, indoor tables and chairs is the frequently observed combination in the COCO dataset, so all of them are judged as natural objects as shown in Fig. 7(a). Note that, the appearance rate of people in the training dataset was high, so that an image with some people is tended to evaluate positively by CL-CNN. Therefore, we can see that the output values are somewhat large as shown in Fig. 7(b) and (c).

On the other hand, we made some forged images with arbitrary combination of object classes (See Fig. 4). In Fig. 8(a), a manipulated boat with surrounding cars and trees causes contextual violation. We confirm that the ‘naturalness of the image’ is better when the boat is removed.

However, our framework has some limitations. For example, a cow is manipulated beside a kite in Fig. 8(c). However, in our method, the information that a cow is in the sky is lost during the object detection step. Therefore, both cow and kite are judged to be inappropriate for the image. In Fig. 8(d), horse and baseball bat were manipulated, but only the baseball bat was selected as the awkward object. In this case, other forgery detector such as [2, 3, 14] combined with our method to improve the detection accuracy.

5 Conclusion

In this study, we proposed a model (CL-CNN) that can provide a contextual prior by directly learning the combination of image labels. The trained model provides contextual prior based on convolutional neural networks. In combination with a well-known object detector such as R-CNN [5], the proposed method can evaluate contextual scores according to the combination of objects in the image.

However, the region-based object detector used in this study ignores the background parts, so the context between the object and background cannot be directly evaluated. Therefore, we plan to enhance the accuracy of the study by combining the deep learning based on scene classification such as [12, 17]. In addition, we will improve the model to give more robust and accurate results, with a bit more attention to the generation of negative sets.



This work was supported by the Institute for Information & communications Technology Promotion (IITP) grant funded by the Korean government (MSIP) (2017-0-01671, Development of high reliability elementary image authentication technology for smart media environment).


  1. 1.
    Chen, M., Fridrich, J., Goljan, M., Lukas, J.: Determining image origin and integrity using sensor noise. IEEE Trans. Inf. Forensics Secur. 3(1), 74–90 (2008)CrossRefGoogle Scholar
  2. 2.
    Choi, C.H., Lee, H.Y., Lee, H.K.: Estimation of color modification in digital images by CFA pattern change. Forensic Sci. Int. 226, 94–105 (2013)CrossRefGoogle Scholar
  3. 3.
    Farid, H.: Exposing digital forgeries from JPEG ghosts. IEEE Trans. Inf. Forensics Secur. 4(1), 154–160 (2009)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)Google Scholar
  5. 5.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)Google Scholar
  6. 6.
    Hou, J.U., Jang, H.U., Lee, H.K.: Hue modification estimation using sensor pattern noise. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 5287–5291, October 2014Google Scholar
  7. 7.
    Hou, J.U., Lee, H.K.: Detection of hue modification using photo response non-uniformity. IEEE Transactions on Circuits and Systems for Video Technology (2016)Google Scholar
  8. 8.
    Huang, H., Guo, W., Zhang, Y.: Detection of copy-move forgery in digital images using SIFT algorithm. In: Pacific-Asia Workshop on Computational Intelligence and Industrial Application, PACIIA 2008, vol. 2, pp. 272–276, December 2008Google Scholar
  9. 9.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
  10. 10.
    Johnson, M.K., Farid, H.: Exposing digital forgeries in complex lighting environments. IEEE Trans. Inf. Forensics Secur. 2(3), 450–461 (2007)CrossRefGoogle Scholar
  11. 11.
    Kee, E., O’brien, J.F., Farid, H.: Exposing photo manipulation from shading and shadows. ACM Trans. Graph. 33(5), Article No. 165 (2014)Google Scholar
  12. 12.
    Lin, D., Lu, C., Liao, R., Jia, J.: Learning important spatial pooling regions for scene classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3726–3733 (2014)Google Scholar
  13. 13.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  14. 14.
    Popescu, A., Farid, H.: Exposing digital forgeries by detecting traces of resampling. IEEE Trans. Sig. Process. 53(2), 758–767 (2005)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  16. 16.
    Ryu, S.J., Kirchner, M., Lee, M.J., Lee, H.K.: Rotation invariant localization of duplicated image regions based on Zernike moments. IEEE Trans. Inf. Forensics Secur. 8(8), 1355–1370 (2013)CrossRefGoogle Scholar
  17. 17.
    Zhang, F., Du, B., Zhang, L.: Saliency-guided unsupervised feature learning for scene classification. IEEE Trans. Geosci. Remote Sens. 53(4), 2175–2184 (2015)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Jong-Uk Hou
    • 1
  • Han-Ul Jang
    • 1
  • Jin-Seok Park
    • 1
  • Heung-Kyu Lee
    • 1
    Email author
  1. 1.School of ComputingKAISTDaejeonRepublic of Korea

Personalised recommendations