1 Introduction

As an effective carrier of common information in the information age, a digital image has been widely used in many fields, including scientific research, media, and judicial expertise. The popularity of image editing software has reduced the cost of image content modifying but has led to the dissemination of a large number of tampering images containing false information on the Internet. In general, copy-paste modifies the local information of an image’s region through the covering operation, and its specific operation is to select a region in the original image to copy and paste it into another local regions in the same image, which represents a typical tampering method. The tamper region is homologous to the original region in the image, and thus difficult to be detected [1].

The existing forgery detection methods [2] mainly use the similarity between tampering regions as a detection criterion, and the detection process includes four steps, namely, preprocessing, feature extraction, feature matching, and post-processing, among which, feature extraction and feature matching play a key role. Since the existing methods use a fixed threshold in the detection process, their generalization ability is not strong enough. In addition, most of them rely on a single feature easily disturbed by the natural high-similarity region, which leads to the high false alarm rate. Aiming at mentioned problems, it is of high significance to improve the performance of the forgery detection methods by increasing their feature expression ability and realizing adaptive threshold automation with feature fusion.

In this paper, an improved two-stage [3] detection method of copy-move forgery based on parallel feature fusion is proposed. Firstly, the SILC super-pixel segmentation algorithm is applied to preprocess an image, and then a local region extraction algorithm without threshold is used to obtain a suspected tampering local region with high similarity. To improve the feature expression ability, a new parallel feature combining the SIFT feature and Hu moment feature is used to describe the extracted local regions with high similarity. Finally, the thresholds are generated adaptively according to the histogram of oriented gradient (HOG) features of the suspected tampering regions, which are then used to determine the attributes of local regions and to improve the generalization of the proposed method. The experimental results show that the proposed method can achieve high accuracy of 99.01% and 98.5% on the MICC-F2000 and MICC-F220 datasets respectively, and also shows strong robustness on the COMOFOD dataset.

The rest of this paper is organized as follows. Section 2 describes the current related works of copy-paste forgery detection technology. Section 3 introduces an improved two-stage forgery detection method based on parallel feature fusion. Section 4 presents the verification and analysis results of the proposed scheme from various aspects. Finally, Sect. 5 concludes the paper and gives future work directions.

2 Related works

The existing forgery detection methods are carried out based on the similarity between copy region and paste region, which can be roughly divided into three categories of image partition-based, feature point-based and deep learning-based methods [4].

The image partition-based methods are to segment and sample an image to obtain a local region first, then extract the features of the local region for matching, and finally to obtain the similar local region for identification. The commonly used features include the Discrete Cosine Transform (DCT) coefficients [5,6,7], RGB fusion information [8], Discrete Wavelet Transformation (DWT) [9], Zernike Moment [10, 11], Analytical Fourier-Mellin Transform (AFMT) [12], Hu invariant moment, Polar Cosine Transform (PCT) [13], PCET-SVD [14], CMF-iteMS [15], and Stabilized Wavelet Transform (SWT) [16]. The listed features mainly acquire the descriptor of an image’s local region in the color domain or transform domain. Although these features have many advantages, they also have certain shortcomings. For instance, the DCT and RGB features have low computational complexity, but their robustness is not strong. The Hu feature is composed of seven invariant moments, which can describe the object shape well, achieving certain robustness and low computational complexity. Zernike Monment, PCT and AFMT need to map an image to a higher order, which has higher computational complexity but achieves slightly stronger robustness than the RGB and other features.

The feature points-based methods extract the key points in the high entropy region of an image and construct the descriptor to complete the region matching. Their commonly used features include the SIFT [17,18,19,20,21], SURF [22, 23], ASIFT [24], ORB [25] and LBP [26]. The feature points are distributed where the gray level of an image changes dramatically, and their robustness mainly depends on the formation of a feature descriptor. Among the mentioned features, the SIFT detects key points in the scale space, and forms feature descriptors by using the main direction mechanism, which provides strong robustness and effective resistance to attacks, such as illumination, rotation and scale change. For example, Tahaoglu et al. [27] proposed a new forgery detection and localization method, which does not rely on forgery region features, but obtains SIFT key points based on RGB and LAB color space. However, its eigenvector has 128 dimensions and is of high complexity.

The deep-learning-based methods do not require manually extract features but learn the internal characteristics of a forged image through the training mechanism to complete the forgery detection. Wu et al. [28] designed a deep matching and validation modelbased on a simple convolutional neural network (CNN) for recognition. And they further presented an end-to-end detection model based on a two-branch structure, namely Busternet, including operation detection branch (Mani-Det Branch) and similarity detection branch (Simi-DET Branch) [29]. Jaiswal and Srivastava [30] proposed an encoder-decoder CNN model based on multi-scale input and multi-level convolution layer, which can divide pixels into forged and unforged pixels by the final sigmod activation function. Liu et al. [31] proposed a novel convolutional kernel network (CKN) based on an improved CNN structure that can greatly reduce the training time cost.

Recently, a number of multi-features-based methods have been proposed for improved detection performance. Sunitha et al. [32] presented a keypoint-based method for efficient detecting copy-move forgery with a hybrid feature. Peng et al. [33] proposed a progressive hybrid feature-based method, which uses no threshold in the steps of obtaining similar local regions. Khan et al. [34] develop a detection method by combining the features of block and feature points, but this method has a relatively high false alarm rate. Pun et al. [35, 36] combine the SURF feature and DAMFT features, but their method has a high time complexity and still uses a fixed threshold when judging a region’s attributes.

In summary, the image partition-based methods generally adopt a global search strategy, which can achieve high accuracy but with a high false alarm rate, because the used features like SIFT can be easily disturbed by natural similarity. The deep-learning-based methods can automatically extract features, but their robustness needs to be improved due to the uncertainty of the deep feature generation and needs of large amounts of training data. In order to achieve low false alarm rate and high robustness, we explore how to combine different features and attend to propose an improved method with adaptive thresholds in this paper.

3 Proposed method

In [33], the authors used uniform non-overlapping segmentation in the image pre-processing stage, but this approach did not consider the correlation between pixels, easily resulting in losing the correlation between local regions in the image. The number of regions acquired by the Hu and SIFT features is inconsistent in terms of progressive hybrid features. In addition, a fixed threshold value is still used in the final determination of regional attributes. To solve these problems, this paper proposes an adaptive two-stage forgery detection method based on super-pixel segmentation and parallel feature fusion. The proposed method is shown in Fig. 1, where it can be seen that it includes two stages, coarse-grained detection and fine-grained detection.

Fig. 1
figure 1

The proposed method flowchart

In the first stage, the simple linear iterative cluster (SLIC) algorithm is used to preprocess an image to obtain the set of irregular local regions with semantic information, and the SIFT feature is used to characterize these regions and establish the correlation distribution map. Then, the candidate tampering regions are obtained according to the correlation distribution map.

In the second stage, the candidate tampering regions are combined with some similar local regions first, and then, a parallel fusion feature is extracted to express the characteristics of local regions. Next, the thresholds are adaptively generated according to the HOG feature of matched local regions, which is used to decide whether a local region has been tampered.

3.1 Preprocessing step

In the preprocessing step, a color image is mapped to the gray space first and then segmented by the SLIC algorithm is to obtain the local regions. The SLIC algorithm clusters pixels to generate irregularly shaped super-pixels by using an iterative strategy, which is ideal in running speed, compactness and contours preservation. The main process is to transform RGB color images into 5-dimensional feature vectors in CIELAB color space and XY coordinates, and then construct distance metrics for 5-dimensional feature vectors, and perform local clustering of image pixels. The distance metric of the super-pixels \({D}^{{\prime}}\) is calculated by Eq. (1).

$$\begin{array}{c}{D}^{{\prime}}=\sqrt{{\left({d}_{\mathrm{c}}/{N}_{\mathrm{c}}\right)}^{2}+{\left({d}_{\mathrm{s}}/{N}_{\mathrm{s}}\right)}^{2}}\end{array}$$
(1)

where \({d}_{\mathrm{c}}\) and \({d}_{\mathrm{s}}\) represent the color distance and the spatial distance respectively; \({N}_{\mathrm{s}}\) is the maximum spatial distance within the class, which is applicable to each cluster; \({N}_{\mathrm{c}}\) is the maximum color distance. The pseudocode of the SLIC is given in Algorithm 1.

figure a

The comparison of the segmentation effects of the non-overlapping segmentation and the SLIC segmentation on a gray-scale image is presented in Fig. 2. Traditional segmentation simply divides regions without considering the correlation between pixels. In addition, the local region extraction algorithm with no-threshold in the next step operates based on the correlation between local regions. Therefore, when a tampering region is divided into several different local regions, the correlation between them will be weakened, leading to a failure of the local region extraction algorithm to mine accurately the local information of an image containing the tampering region. Conversely, the SLIC segmentation clusters correlated pixels into a super-pixel block of an irregular shape, which has the visual integrity of an object and can retain tampering region information as much as possible.

Fig. 2
figure 2

Comparison of the segmentation effects of two different segmentation schemes. a Uniformly non-overlapping segmentation, b SLIC segmentation

3.2 Similar local region extraction without threshold

The process of similar local region extraction is to find the pairs of matching regions with a high similarity according to the degree of feature point matching between regions, and denote them as the candidate tampering regions. The specific process is shown in Fig. 3.

Fig. 3
figure 3

The block diagram of the similar local region extraction process

Firstly, each local region is described by the SIFT feature. And the two-nearest neighbor algorithm is used to match the SIFT feature points of local regions. When the Euclidean distance of two feature points is less than the minimum distance and the second smallest (default value is 10), the feature matching is considered successful.

Secondly, we use the RANSAC algorithm to further remove false matching points from the above matching points. Specifically, a corresponding homography matrix containing geometric information such as rotation and scaling of the tampered region is estimated for each matching point using an iterative mechanism. Then, the correlation confidence of the two matching points is calculated based on the upper homography matrices. If the correlation confidence is lower than a fixed threshold (the default value is set as 0.995 in this paper), the two matching points are mismatched.

Next, according to the matching results of feature points, the correlation distribution map between local regions is established, as shown in Fig. 4. This map represents the correlation between two local regions, and the larger its value is, the stronger the correlation between two regions will be. Thus, a pair of two matching regions with the greatest correlation can be a tampering region. Therefore, the local regions that correspond to the peak and sub-peak of correlation are selected as the candidate tampering regions.

Fig. 4
figure 4

Correlation distribution map

The candidate tampering regions obtained by coarse-grained detection is a set of local regions with high correlation, which includes both the real tampering regions and the natural original regions with high similarity. Therefore, it is necessary to accurately determine the candidate tampering super-pixels in the next stage of fine-grain detection.

3.3 Local region combination

To obtain larger receptive fields and to generate more complete candidate tampering regions, the acquired target local regions are needed to be combined. The specific process is shown in Fig. 5, where it can be seen that the target super-pixel is set as a center and the neighborhood super-pixels and the target super-pixel are combined; namely, the pixel labels of the neighborhood super-pixels are modified into that of the target super-pixel.

Fig. 5
figure 5

Illustration of the region combination process

In the process of region combination, if the distance between two super-pixels is too close, local aliasing can be easily caused. Thus, it is necessary to consider a relative position between two suspected tampering regions. First, each super-pixel is labeled and the distance between the super-pixels is calculated by Eq. (2), where d represents a relative positions of two local regions, abs() represents the absolute value function, and \({c}_{i}\) represents the location coding of a local region i.

$$d=\mathrm{abs}\left({c}_{1}-{c}_{2}\right)$$
(2)

According to the relative position between two local regions, there are two possible cases.

  1. (1)

    A region in the matched pair is at the image edge.

  2. (2)

    No region in the matched pair is at the image edge.

In case (1), due to the influence of the edge effect, it is impossible to use eight-neighborhood for the local region on the image edge, so the four-neighborhood association is adopted. In case (2), the n-neighborhood is adopted, and the value of n depends on the relative position d, which is calculated by Eq. (3), where L represents the sampling size. When two local regions are distant, the eight-neighborhood combination mode can be chosen, otherwise, the six-neighborhood combination mode can be chosen.

$$n=\left\{\begin{array}{ll}6, &\quad d=1,L, L+1,L-1,L+2, L-2,2L+1,2 L-1,2 L+2,2 L-2\\ 8,& \quad {\text{else}}\end{array}\right.$$
(3)

3.4 Parallel feature fusion

To eliminate the interference introduced by natural highly correlated regions, Peng et al. [33] proposed a forgery detection method based on progressive fusion features. In this method, first, the SIFT feature of each local region is extracted and matched, and then the Hu moment feature is extracted from the neighborhood of each pair of matched feature points, which are required to be secondly matched. Finally, the attribute of the local region is judged according to the rules. The fusion feature with a progressive structure can effectively combine the SIFT feature with the Hu feature to enhance the detection robustness and to avoid the interference of a similar natural region caused by illumination invariance, however, this approach has certain problems. First, the scheme needs quadratic matching of feature points, which is highly time-consuming to calculate. Secondly, the expression ability of the progressive fusion features is not strong enough to make full use of the SIFT or Hu features, thus leading to a relatively high false alarm rate.

Moreover, the Hu moment feature is calculated for the neighborhood pixels of the SIFT-based matched key points, so its scope is limited to the matched feature points, which makes it difficult to describe the local region accurately, leading to the phenomenon of "missing matching" in the feature point matching algorithm. As shown in Fig. 6, some of the discrete feature points are not judged as matching points.

Fig. 6
figure 6

Scope of the Hu moment

To solve the above problems, a new parallel fusion feature is proposed to describe a local region with suspected tampering. The block diagram of the fusion process is shown in Fig. 7. First, a set of SIFT feature points is obtained from the candidate local regions, and then the SIFT and HU features are calculated and combined simultaneously in the neighborhood of the feature points, and the final descriptor corresponding to the pair of matched regions is constructed after normalization.

Fig. 7
figure 7

The block diagram of the parallel feature fusion process

For each feature point, a seven-dimension vector of the Hu moment feature (\({\mathrm{Hu}}_{7}\)) and a 128-dimension vector of the SIFT feature of the neighborhood pixels (\({\mathrm{SIFT}}_{128}\)) are generated. Since the first to fourth components in the vector of the Hu moment feature have strong invariance, only the first four-dimension vector of the Hu feature (\({\mathrm{Hu}}_{4}\)) and \({\mathrm{SIFT}}_{128}\) are combined to generate the final 134-dimension feature vector. Then, it is normalized to eliminate the dimensionality effect and used as a feature descriptor in region matching, namely parallel fusion feature (\({\mathrm{ParallelF}}_{134}\)) expressed by Eq. (4), where concat(\(\cdot\)) denotes a combinatorial function and Normalize(\(\cdot\)) represents a normalized function provided by OpenCV.

$${\mathrm{ParallelF}}_{134}=\mathrm{Nomailze}(\mathrm{concat}(({\mathrm{SIFT}}_{128}, {\mathrm{Hu}}_{4}))$$
(4)

The way of progressive feature fusion extracts the Hu moment feature from the matched feature points and performs the secondary matching on this feature. The parallel feature fusion directly extracts the SIFT and Hu moment features from the extracted feature point set and combines them, which expands the extraction range of the Hu moment feature and enhances the expression ability of the parallel fusion feature.

3.5 Adaptive threshold generation based on HOG level

Generally, traditional methods use thresholds in two situations: (1) to measure whether there is a similarity between local regions or features, and (2) to measure whether the similarity of regions meets the standard of a copy-move forgery. At present, there have been no uniform standards for selecting a fixed threshold. In addition, different images have different characteristics such as color, illumination, or texture, so it is challenging to choose a threshold that will be suitable for most images. In this paper, an adaptive threshold generation algorithm based on the HOG level is adopted. After the description and matching of super-pixels by the parallel fusion feature, a threshold is automatically generated to determine whether two matched super-pixels denote tampering regions. The schematic of this process is shown in Fig. 8.

Fig. 8
figure 8

The block diagram of the adaptive threshold generation process

The HOG is a feature descriptor representing the texture information of an image’s local regions through the gradient information. If the super-pixels A and B contain a tampering region, their texture information for both should be the same.

For super-pixels A and B containing the candidate tampering regions, their HOG features are extracted, and the corresponding HOG levels representing their texture richness are calculated by Eqs. (5) and (6), respectively, where \(i=\mathrm{1,2},3,\dots ,n\), and \(c=\mathrm{1,2}\); n represents the total dimension of the HOG feature, \({x}_{i}\) represents the ith component of the HOG feature, and \({E}_{c}\) represents the average gradient intensity of a local region.

$${E}_{c}=\sum {x}_{i}/n$$
(5)
$$\mathrm{HOG}\_\mathrm{Level}={E}_{1}+{E}_{2}/2$$
(6)

Most of the SIFT feature points exist in the high entropy region of the image, which is the region with rich texture, and the texture richness of a region will directly affect the number of SIFT feature points. When the tampering region is relatively flat, the feature points will attenuate to a certain extent. But when the texture of the tampering region is rich, it will have more feature points. That is, the number of feature points directly affects the amount of information available for similarity calculation. Thus, the HOG level of the pair of matched local regions can be used for dynamically adjusting the threshold T, which can be calculated by Eq. (7), where m is the proportionality factor, with the default value of one. The pseudocode of the adaptive threshold generation algorithm is given in Algorithm 2.

$$T={\mathrm{HOG}}_{\mathrm{Level}}\times m$$
(7)
figure b

The adaptive threshold generation algorithm sets the thresholds according to the characteristics of the local regions, which increases the generalization and robustness of the proposed detection method. The generated threshold value is used as a criterion to determine whether the local region is a tampering region. Namely, the correlation between super-pixels A and B is compared with the threshold value T, and if the similarity is greater than T, the region is considered as a tampering region.

4 Results and discussion

4.1 Datasets and evaluation metrics

In the experiments, the hardware includes a PC with an Intel I7-9700K CPU and Nvidia Tesla P40, running on Windows 10 operating system. The software is Microsoft Visual Studio 2019. The true positive rate (TPR), false positive rate (FPR) and F-measure (F1) are used as the evaluation metrics of the proposed algorithm's detection performance.

The performance of the proposed method is verified by experiments on three public datasets, namely, MICC-F220, MICC-F2000 and COMOFOD datasets [17, 37]. The basic information on the datasets is given in Table 1. And some examples of tampered images in the datasets are shown in Fig. 9. where the green boxes are the source target regions and the red boxes are the tampered regions.

Table 1 Basic information on datasets used in the experiments
Fig. 9
figure 9

Some examples in test images of three differents datasets

The MICC-F220 and MICC-F2000 datasets are used to verify the robustness of the proposed method against geometric attacks, including translation, rotation, and stretch and the different combinations of the above three operations. According to the degrees of rotation, stretch and translation, there are different requirements for algorithm robustness. The scaling scales in the x-axis direction and y-axis direction are denoted as \({S}_{x}\) and \({S}_{y}\) respectively; the rotation angle of a local region is denoted as \(\theta\); the attack degrees of the MICC-F220 and MICC-F2000 datasets are H and J, respectively. The attack degrees of these two datasets are given in Tables 2 and 3, respectively.

Table 2 Type of geometric attack in MICC-F220 dataset
Table 3 Type of geometric attack in MICC-F2000 dataset

The COMOFOD dataset is used to test the robustness of the algorithm from two aspects: attack type and attack degree. Among them, tampered images are accompanied by various post-processing attacks of different degrees, including JPEG compression, blur, contrast change, color adjustment, brightness attack and Gaussian noise. The information on the attack types in this dataset is given in Table 4.

Table 4 Attack types in COMOFOD dataset

4.2 Module validity testing

For verifying the module validity of the SLIC, parallel feature fusion and adaptive threshold generation algorithm, four different schemes are conducted and evaluated on the MICC-F2000 dataset. The method proposed in [33] (Scheme 1) is used as a baseline, Scheme 4 is our proposed method. The experimental results are shown in Table 5.

Table 5 Detection results of different schemes with different module combination

Compared with Scheme 1, the TPR of Scheme 2 increased by 0.3% and FPR is decreased by 0.2%, indicating that SLIC can improve the detection performance to a certain extent. Compared with Scheme 2, the TPR and FPR of Scheme 3 are increased by 0.4% and decreased by 0.7%, respectively, indicating that the parallel fusion feature is stronger than the progressive fusion feature in characterizing the local region and could accurately identify the tampering region without being disturbed by the similar natural region. In Scheme 4, the adaptive threshold generation algorithm is added to improve the detection ability of the tampering region further, and the TPR of this scheme is 0.6% higher than that of Scheme 3. Although the FPR of Scheme 4 is increased by 0.2%, F1 is still 0.2% higher than that of Scheme 3, indicating that the comprehensive performance of proposed Scheme 4 is better than that of other schemes.

4.3 Comparison with other methods

The proposed method is compared with other methods, and the comparison results are given in Tables 6 and 7. Table 6 shows the effects of different methods on the MICC-F220 dataset. As given in Table 6, the TPR of the proposed method is 99.1%, which is consistent with that of Alberry’s method [34]. However, for the proposed method, the FPR is 2% lower and F1 is 8% higher than those of Alberrys’ method. Although the FPR of the proposed method is 1.2% and 3.2% higher than those of Resmi [21] and Das [16], its TPR is 8.2% higher, and its F1 is 4.6% and 2.9% higher than those of these two methods, respectively. Thus, the proposed method can guarantee higher accuracy, lower false alarm rate, and the better comprehensive detection performance. As shown in Table 7, the TPR, FPR and F1 of the proposed method are the highest on the MICC-F2000 dataset among all the methods.

Table 6 Comparison of different methods on the MICC-F220 dataset
Table 7 Comparison of different methods on the MICC-F2000 dataset

As shown in Tables 6 and 7, the detection effect of the proposed algorithm on the MICC-F220 dataset is not significantly improved compared with the results on the MICC-F2000 dataset. The main reason for this is that the degree of post-processing attacks in the MICC-F220 dataset is not as rich as that in the MICC-F2000 dataset. Namely, there is only a small amount of equal stretching in the MICC-F200 dataset, while the functions of unequal stretching and combination are included in MICC-F2000, which requires higher robustness of the detection methods. The existing methods have certain robustness against small scale equal stretching, but the resistance against unequal stretching and its combination attacks still needs further improvement. Therefore, the performance of the existing methods on the MICC-F2000 dataset is slightly lower than those on the MICC-F220 dataset.

4.4 Robustness analysis of different methods

The detection accuracy comparison of the proposed method and the method presented in [33] under different degrees of geometric attacks is displayed in Fig. 10, where H1–H9 contains equal rotation and equal proportional pressure and their combination, and the value range of the scale factor is 1–1.5. As shown in Fig. 10, the methods performed well against the H1–H9 attack types on the MICC-F220 dataset. The results show that with the deepening of the attack degree, the proposed method could still remain the accuracy higher than 92% under the H10 attack.

Fig. 10
figure 10

Robustness comparison on the MICC-F220 dataset

The MICC-F2000 dataset contains tampering images of both equal and unequal stretches, and the scale factor range is wider than that of the MICC-F220 dataset. In addition, unequal stretches have different scaling factors in different directions, resulting in a relatively large distortion in the target region, which affects the similarity between local regions and thus the detection effect, so a detection method with high robustness is required to recognized the geometric attack. As shown in Fig. 11, the TPR of the proposed method is improved obviously in three levels of J11, J12 and J13. Compared with Peng’s method [33], the increase is 2%, 2% and 4%, respectively. Therefore, the proposed method shows stronger robustness against the combined geometric attacks of unequal scale transformation and rotation than Peng’s method.

Fig. 11
figure 11

Robustness comparison on the MICC-F2000 dataset

The resistance performances of different methods to the post-processing attacks are presented in Fig. 11. For the convenience of observation, the compared methods are marked in Fig. 11 using the labels shown in Table 8. Based on the results, in the case of different attacks, the detection ability of the proposed methods is better than those of the other methods.

Table 8 Labels of different existing methods

As shown in Fig. 12a, the recognition ability of the proposed method is excellent for the JPEG attack, and the accuracy is about 40% higher than that of the BusterNet in the case of the JC4 attack. In addition, the increase in the compression factor does not significantly affect the TPR of the proposed method, indicating that the proposed method also has a strong resistance to JEPG compression. As shown in Fig. 12b, our method also has a good resistance to brightness changes; namely, its TPR does not decrease with the increase in the brightness adjustment space, which is due to the illumination invariance of the SIFT features. For the contrast attack, according to Fig. 12c, the TPR of the proposed method decreases with the expansion of the contrast region, and its accuracy is low in the case of the CA3 attack. The contrast transformation is mainly implemented by adopting the gray histogram equalization method for local regions and thus leads to the difference in features between the copied region and pasted region. Therefore, the proposed method is more sensitive to the contrast attack than the other methods, but it could still achieve a good accuracy. In addition, Fig. 12d shows that the proposed method can effectively resist the color attack.

Fig. 12
figure 12

Robustness comparison of different methods for different attack types. a JPEG attack, b brightness change, c contrast change, d color change, e fuzzy attack, f noise addition

However, Fig. 12e shows that under the fuzzy attack, the accuracy of the proposed method is low, which is in a suboptimal position at the IB1 level, and its TPR is 4% lower than that of the BusterNet; also, TPR decreases with the increase in blur intensity, and only 35% accuracy could be achieved under the IB3 attack. Namely, the essence of the blur attack is the weighted average of the pixels in the local region, which greatly weakens the gradient information of local regions, especially those with a rich texture. The key point of the SIFT is the extreme value in the local region, which mostly has a rich texture. Therefore, the reduction in regional gradient information will result in a decrease in the number of key points of the SIFT; that is, the amount of information used to describe the similarity of local regions will decrease sharply, leading to a significantly decrease in the detection performance of the proposed method. As shown in Fig. 12f, the proposed scheme also has a strong resistance to noise.

As analyzed above, the proposed method achieves a good robustness to the JPEG compression, color change, brightness change and noise addition, and could withstand a certain degree of contrast change, noise addition and fuzzy attack, among which, the ability to resist the fuzzy attack still could be further improved.

4.5 Comparison with deep-learning-based methods

The recent success of deep learning in the field of pattern recognition has motivated many scholars to try to apply deep learning technology to the field of image forensics. In the copy-move forgery detection applications, both the BusterNet and the convolutional kernel network (CKN) proposed by Liu et al. [29]. have been used in recent years. The BusterNet uses a two-stream network structure to identify the tampering regions of an image. The robustness of the BusterNet is stronger than that of the proposed method only under the fuzzy attack but slightly weaker than the proposed scheme under other attacks, as shown in Fig. 12.

The CKN represents a variant of the CNN network, which accelerates the CNN's training speed. The comparison results of the proposed method and the CKN are given in Table 9, where it can be seen that in the experiments, the proposed scheme is superior to the CKN network in terms of TPR, FPR and F1 metrics. Compared with the CKN, the TPR of the proposed method is 5.5% higher and its FPR is 5.3% lower, indicating that the overall performance of the proposed method is better than that of the CKN.

Table 9 Comparison of different methods

Although deep learning technology has the advantages of automatic feature extraction and strong generalization, it still has certain technical difficulties in the field of copy-move forgery detection, which can be summarized as follows.

  1. (1)

    Fewer features to be learned. The traditional recognition task is mainly to detect various objects in an image, and a set of object features of objects that can be learned in the training process is relatively rich, including eyes, hair, and contour in the task of cat and dog recognition. However, in the task of copy-move forgery detection, the tampering regions can be randomly selected, and the training dataset cannot provide significant training features to the network.

  2. (2)

    Fewer public datasets. Network model training requires a labeled dataset of a certain size. At present, the application research on deep learning technology in the field has still been in the preliminary stage. In addition, typically used databases, such as GRIP, CASIA, MICC-F220, and MICC-F2000, include a small number of images and contain a variety of post-processing methods, which makes it difficult to provide enough valuable learning information to the model. In contrast, personally-made datasets, while being a good choice, are difficult to label effectively.

  3. (3)

    Forensics technology based on deep learning has a strong dependence on the training dataset and requires that the training set and testing sets have the same data distribution. However, in the practical application, the consistency between a random image to be detected and data used to train the network cannot be guaranteed.

  4. (4)

    In addition, the training process of a deep learning-based model has a high time cost. Conversely, the proposed scheme neither depends on the training set, nor needs to train the detection model, but it has strong robustness, which has a certain practical significance.

4.6 Comparison of running times

This section compares the running times of different methods. In the test, 25 original images and 25 tampering images are randomly selected from the MICC-F2000 dataset. The proposed method is used to detect these 50 images, and the average running time of processing an image is calculated. The running time comparison of the proposed method and the existing methods is shown in Fig. 13.

Fig. 13
figure 13

Running time comparison of different methods

The results show that compared with Peng’s method [33], the running time of the proposed method ranks in the middle, which increases mainly in two situations. In the preprocessing step of the first stage, instead of using the uniform-type segmentation, the proposed method adopts the SLIC algorithm with an iterative mechanism to cluster pixel values and to obtain irregular and meaningful super-pixels. The clustering process involves complex steps such as feature construction, distance measurement, and seed point updating, which is time-consuming. In the second stage, the processes of parallel feature fusion and adaptive threshold generation are needed to extract three types of features, which requires much time. To improve the detection accuracy and robustness, the SLIC segmentation is introduced, which represents adaptive threshold generation algorithm with high complexity. However, as observed in Fig. 13, the time complexity of the proposed method is within a tolerable range, so this method has a strong practical significance.

4.7 Localization performance analysis of tampering regions

This subsection evaluates the tampered region location performance of the proposed method. As shown in Fig. 14, the first and second rows are the results of locating tampered regions on FICC-F220 and FICC-F2000 datasets by the proposed method respectively. The results show that the proposed method can detect copy-move forgery and mark forgery feature points accurately. Figure 15 give the results of tampered region localization by different methods, a is Forged image, b is Ground truth, c is the results of CKN [29], d is the results of BusterNet [28], and e is Ours.

Fig. 14
figure 14

The results of proposed method on FICC-F220 and FICC-F2000 dataset

Fig. 15
figure 15

The results of copy-move forgery region localization by different methods. a is Forged image, b is Ground truth, c is the results of CKN [29], d is the results of BusterNet [28], and e is Ours

As illustrated in Fig. 15, we observe that BusterNet and CKN models based on deep learning can realize the location of tampered regions by using real labelled data for training. Although there are certain noises and misidentified regions, the source/target locations are roughly accurate. The proposed method can also accurately obtain the content information of the tampered region by obtaining the tampered feature points, but does not accurately give the location of the tampered source/target. Therefore, in the future, we will explore the combination of the technology in this paper and semantic segmentation technology based on deep learning, and study how to achieve a more accurate location of source/target.

5 Conclusion

In this paper, an improved two-stage forgery detection method based on parallel fusion feature and an adaptive threshold generation algorithm, which includes coarse-grained detection and fine-grained detection. In the coarse-grained detection stage. the SLIC algorithm is used to preprocess an image and to divide the image into irregular super-pixels, solving the problem of local regional correlation attenuation caused by uniform segmentation. In the fine-grained detection stage, a parallel fusion feature is used to enhance the feature expression ability of a local region. To improve the robustness of the proposed method, an adaptive threshold generation algorithm based on the HOG level is designed to generate a suitable threshold conforming to the characteristics of different local regions for the final detection of suspected tampering regions. The proposed method is verified by experiments and compared with the other methods. The experimental results show that the proposed method achieves highest accuracy among all compared methods, and it has higher robustness which can resist several common attacks such as noise and brightness change.

However, there is room for further improvement of this method's resistance to fuzzy attack, which needs further study. Compared with the deep learning methods, the proposed method is still weak in locating the tampered region, and it is difficult to accurately give the specific coordinates and contours of the tampered region. Thus, It is necessary to combine with deep learning methods to achieve accurate detection and positioning. In the future, it is also needed to continue to explore different types of feature fusion and new feature mining to further improve detection capabilities.