Two-Stream Xception Structure Based on Feature Fusion for DeepFake Detection

DeepFake may have a crucial impact on people’s lives and reduce the trust in digital media, so DeepFake detection methods have developed rapidly. Most existing detection methods rely on single-space features (mostly RGB features), and there is still relatively little research on multi-space feature fusion. At the same time, a lot of existing methods used a single receptive field, which leads to models that cannot extract information of different scales. In order to solve the above problems, we propose a two-stream Xception network structure (Tception) that fused RGB spatial feature and noise-space feature. This network structure consists of two main parts. The first part is a feature fusion module, which can adaptively fuse RGB feature and noise-space feature generated by RGB images through SRM filters. The second part is the two-stream network structure, which utilizes a parallel structure of convolutional kernels of different sizes allowing the network to learn features of different scales. The experiments show that the proposed method improves performance compared to the Xception network. Compared to SSTNet, the detection accuracy of the Neural Textures is improved by nearly 8%.


Introduction
The rapid development of DeepFake techniques has fueled the sharp increase of forgery face images and videos, and the fake images and videos created by these techniques are becoming increasingly realistic.Falsified content of videos and images raises various disconcerting problems within wide spread social media, such as fake news dissemination, and fraud.Therefore, there has been an explosive increase in the demand for DeepFake detection methods to counteract its impacts [1][2][3].
In fact, DeepFake detection is a challenging classification problem.The most important aspect of DeepFake detection is to find the differences between real and fake images.In this problem, artificial neural networks have made outstanding achievements especially convolutional neural networks (CNNs) [4][5][6][7].However, most existing models (shown in Sect.2.2) used RGB images for detection, which led to a limited amount of information in the final detector and makes it difficult to detect images or videos by different domains of information.At the same time, many of the existing models used a single receptive field neural network to classify images, which made it difficult to extract information at different scales.The use of multiple receptive fields to extract information at different scales had become an important method to improve the ability of the model.
To address the above shortcomings, we propose a twostream Xception framework (namely Tception).The Tception structure could obtain a wider range of receptive field than the original Xception structure, which could improve the network performance.At the same time, to address the problem that existing networks mostly process images in RGB space only, we have considered fusing features from the original RGB spatial image with the Fourier transformed image, taking into account that forgery traces are mostly at the edges of the image.However, it is difficult to use a unified feature fusion method for integrating as the images in the frequency domain do not have a one-to-one correspondence location with the spatial image.Therefore, Tception structure used the SRM (Steganalysis Rich Model) space filter to process images, and then used the feature fusion module to adaptively fuse the RGB space with the feature maps in SRM space, so that the network obtains richer features and improves the network performance.
The main contributions of this paper can be summarized as follows: (1) We propose a new two-stream Xception (Tception) structure.The Tception structure can expand the receptive field of the network to better perceive the nuances of real and fake images, resulting in better results.(2) We design the feature fusion module to fuse the features in RGB space and SRM space.In this way, the network can obtain richer features for discrimination.(3) We combine feature fusion module and Tception structure to let the network access more information.Experiments show that the proposed method has better performance.

DeepFake Datasets
There are many datasets in the field of DeepFake tampering forensics, e.g., UADFV [8], Celeb-DF [9], DFDC [10], and FaceForensics++ [11].Among them, FaceForensics++ is a popular dataset in the field of DeepFake due to its comprehensive video content and its classification according to video quality.Therefore, our work will be experimented on the FaceForensics++ dataset mainly.
The Celeb-DF dataset contains both real and DeepFake synthesized videos with similar video quality to those disseminated online.We also completed partial comparison experiments on the Celeb-DF dataset to demonstrate the good general applicability of our method.

DeepFake Detection
DeepFake detection generally includes extracting features manually and extracting features automatically using deep networks.Extracting features manually are interpretable but often less accurate than extracting features automatically using deep networks, while methods using deep networks have improved detection accuracy at the expense of some interpretability.The accuracy of the methods using deep networks has been improved at the expense of some interpretability.In general, the automatic feature extraction methods based on neural networks are better than the manual feature extraction methods and are gradually becoming the mainstream methods of DeepFake detection.
Cozzolino et al. [16] proposed a residual-based local descriptor approach and allowed for better performance with fine-tuned networks on small datasets.Bayar and Stamm [17] proposed a method based on deep Siamese CNNs to detect not only the tampering traces but also the kind of tampering is taking place.Rahmouni et al. [18] proposed a novel method for classifying computer graphics and real photographic images that integrates a statistical feature extraction to a CNN framework and the method could find the best features for efficient boundary.Darius Afchar et al. [19] modified MesoNet that is a light-weight network specifically for face tampering detection and able to train better models on a relatively small number Fig. 1 Selected video cutouts from the FaceForensics++ dataset.FaceForensics++ contains videos that have been tampered with using different human face tampering methods such as DeepFake (DF), Face2Face (F2F), FaceSwap (FS) and Neural Textures (NT).Each video in turn contains a corresponding different quality of network layers.Chai et al. [20] assembled the Xception network and it had become one of the baselines in the field because of its good performance.These methods used neural networks and achieve better detection results on the DeepFake detection task.However, these methods only used RGB spatial images for relevant feature extraction operations, which could only extract information within a single space, and the discriminator had a relatively limited basis for discrimination.
There are also many methods based on feature fusion for DeepFake detection.Zhao et al. [21] proposed frequency-aware discriminative feature learning for face forgery detection.Zekun Sun et al. [22] proposed an efficient and robust framework (LRNet) to detect DeepFake videos through temporal modeling of precise geometric features.Yuval Nirkin et al. [23] modified two-stream residual structures, as a new idea for improving networks.Li et al. [24] proposed a fusion of the spatial and frequency domains to perform forgery detection.Although these methods fused different feature space information and feed the fused information to a discriminator for discrimination, these methods used a single size of convolutional kernel for the network structure, making it difficult to extract information of different sizes.
One of the most inspiring aspects for our work is Xception network, a convolutional neural network architecture based entirely on deeply separable convolutional layers.It is based on the assumption that the mapping of crosschannel correlation and spatial correlation in the feature map of a convolutional neural network can be completely decoupled.Overall, the Xception architecture is a linear stack of deeply separable convolutional layers with residual connections.This makes the architecture very easy to define and modify.

Overall Framework
The proposed overall framework is shown in Fig. 2. First, we processed the RGB space image by the SRM filters to produce a noise-space image (Sect.3.2).Second, we used point convolution to fuse the RGB space information with the SRM noise-space information (Sect.3.3).Finally, we fed the fused information to the proposed Tception structure (Sect.3.4) to infer the input image being real or fake.

SRM Noise Space
Sometimes, RGB channels were not sufficient to solve all the different tampering situations.Since forged faces often produced differences at the edges of the face, detecting images that had been carefully modified after tampering is a challenge for the RGB stream.
Previous research found that most of the artifacts produced by forged faces were high-frequency noise at the pixel level, so we can effectively compensate for the disadvantage of highly correlated RGB spatial features with image content by extracting the high-frequency noise components of the image (noise residues) rather than its content.
SRM (steganalysis rich model) [25] filters were proposed to collect the underlying noise features, quantifies and truncates the output of these filters, and extracts nearby cooccurrence information as the final features.SRM has become a common method for extracting noise features.
Inspired by [24], we exploit the local noise distribution of the image to provide additional features.Compared to RGB streams, noise streams are designed to focus more on noise rather than semantic image content, which gives Fig. 2 Overview of our method.Our method is divided into two parts: first, the information in the corresponding noise space is obtained through the SRM filter, and the information in RGB space is fused with the information in SRM noise space by point convolution; subsequently, the fused information is fed into our proposed Tception structure, a network of Xception-like structures containing multiple sensory fields, for the final inference the possibility to construct more sophisticated forgery detectors.We use SRM filters to extract local noise features from RGB images as the input for noise streams.
Our aim is to extract high-frequency noise from the image.However, we are unsure of the exact pattern of noise from the different tampering methods, we choose a set of filters that extract only the high-frequency noise at the edges of the image (the convolution kernel weights sum to 0) and are relatively symmetrical.The weights of our filters are shown in Eq. ( 1): (1) The 3 channels of RGB space were passed through 3 filters, respectively.Each channel produces a corresponding 3-channel feature map, and after this SRM transformation, a 9-channel feature map can be produced in the end.It emphasize local noise rather than image content and clearly reveal traces of tampering that may not be visible in the RGB channels.

Fusion of RGB Space and SRM Noise Space
The difference between a real face and a fake face is mainly in the edge part of the face.As mentioned in the previous section, the single RGB space has the limited feature information, while the noise space can better highlight the edge information of the image.Therefore, we propose the feature fusion module which can adaptively fuse RGB feature and SRM space feature.
We first use the SRM for each channel of RGB images to produce a 9-channel feature map.If the feature map in RGB space and the feature map in noise space are directly concatenated together, the network can pay more attention to the feature map in RGB space due to the RGB space having more information.Therefore, we transform the 9-channel feature map in SRM space into a 3-channel feature map using point convolution.Then, performed a point convolution operation on the 3-channel feature map in SRM space and the corresponding channel of the 3-channel feature map in RGB space.The output is still a 3-channel feature map as shown in Fig. 3.
The 3-channel feature map contains the information from the RGB space and the noise space, and we use this image as the input to the neural network.

Tception Structure
In the Tception structure, we add a 5 × 5 separable convolutional stream to the original Xception structure module, as shown in Fig. 4. In this way, the problem of insufficient receptive field of the original Xception can be effectively solved.At the same time, we retain the residual structure of the original Xception, thus the problem of gradient drop or gradient disappearance due to overly large and deep network.It also can retain the integrity of the information.Specifically, the fused feature map is convolutionally transformed in two layers to produce a 64-dimensional feature map X .This feature map X is fed into a block of the entry flow, which use two separable convolutional flows of 3 × 3 and 5 × 5 and a 1 × 1 convolutional flow, respectively, and the three resulting feature maps are summed.As shown in Eq. ( 2), Fig. 3 Feature fusion module.The RGB space feature map is first filtered with three SRM filters to transform it into a 9-channel feature map in SRM space, then it is transformed into a 3-channel feature map using point convolution.Then, perform a point convolution operation on the 3-channel feature map in SRM space and the corresponding channel of the 3-channel feature map in RGB space.The output is a 3-channel feature map with fused features where F 1 (⋅) is the result of a 1 × 1 convolution, F 3 (⋅) is the result of a 3 × 3 convolution flow and F 5 (⋅) is the result of a (2) 5 × 5 convolution flow.In addition, the Z 1 is the output of the entry flow, which is a 728-dimensional feature map.
Next, the output Z 1 enters the middle flow residual block.Unlike the block in the entry flow, the residual where Z 2 is the output of the middle flow, which is a 728-dimensional feature map.Finally, in exit flow, after a block similar to the block in entry flow, two separable convolution layers are passed to obtain 2048-dimensional features.After averaging pooling, it can be discriminated.
It is worth mentioning that our proposed Tception structure does not use the ReLU activation function of the original structure.Instead, we use the GELU activation function, which adopted the idea of stochastic regularization.Compared to the ReLU function, the GELU function has another non-zero gradient in the negative region, thus avoiding the problem of dead neurons.In addition, GELU is smoother around 0 than ReLU, so it is easier to converge during the training process.

Datasets
We use the videos from HQ (c23) and LQ (c40) of the four tampering methods DeepFake (DF), Face2Face (F2F), Fac-eSwap (FS) and Neural Textures (NT) in the FaceForen-sics++ dataset after pre-processing to produce the dataset.We also complete partial comparison experiments on the Celeb-DF dataset to demonstrate the good general applicability of our method.

Data Pre-processing
First, the videos in the dataset are sampled every 16 frames to convert the video information into image information.Next, the 64 feature points of the face in the image are identified using the Dilb library, and the face image is truncated using these 64 feature points.The specific processing method is shown in Fig. 5.

Implementation Detail
He proposed model implement using the PyTorch framework and trained using the Adam optimizer (the default parameter).The learning rate is set to 0.001.A NVIDIA  Tesla V100 GPU is used to the experiments.In our experiments, we use cross-entropy loss.

Comparative Experiments
We conduct experiments on the FaceForensics++ dataset and used accuracy (ACC) the evaluation metric.We conduct experiments at different compression rates of c23 and c40, respectively.The final experimental results are shown in Tables 1 and 2.
Our method has improved detection on the F2F, FS and NT methods on HQ quality.Our method offers a nearly 8% improvement in NT forgery detection compared to SSTNet, with similar detection accuracy of other forgery methods on LQ quality.We also used AUC as the evaluation metric.The results of the experiment are shown in Tables 3 and 4.
To further validate the robustness of proposed model, we test our model on a dataset with a mixture of the four tampering methods.We still use the FaceForensics++ dataset, where the real faces are keep constant and the forged faces account for about 1/4 of each of the four forgery methods.The results of the experiment are shown in Table 5.
Our method offers a nearly 3% improvement compared to Xception on the mixed dataset.
We also conduct experiments on the Celeb-DF dataset.The results are shown in Table 6.Our method has similar results compared to Xception.
Through comparative experiments, we find that our proposed method has a degree of improvement in both ACC and AUC metrics compared to other methods for image detection tasks generated by different tampering methods, and our method performs better on mixed datasets, reflecting the better robustness of our proposed method.[16] 78.45 Bayar and Stamm [17] 82.97 Rahmouni et al. [18] 79.08 MesoNet [19] 83.10 Xception [22] 84.11 Tception (ours) 87.61

Ablation Study
In this section, we perform a number of ablation studies to better understand the contribution of each component in our Tception structure.We set up the following experimental groups.X denotes Xception without Feature fusion module.XF denotes Xception with Feature fusion module.T denotes Tception without Feature fusion module.TF denotes Tception with Feature fusion module.The specific experimental setup is shown in Table 7.The experimental results are shown in Tables 8 and 9.
We find that when our structure containing both Feature fusion module and Two-stream structure has mostly achieved the highest accuracy.No matter which part is missing, the effect will decrease to varying degrees, which verified the rationality of our method.
We also find that the dual-stream network module is more effective than the feature fusion module in improving the original model, probably because the information in SRM space fused by the feature fusion module is obtained by transforming the information in RGB space, which is some kind of information enhancement of the information in RGB space, and the source of information is the same, and the information provided to the neural network learning may still be limited; the parallel structure of multisensory convolutional kernels can extract information at different scales, which is relatively more beneficial to the algorithm.

Comparison Experiments with Different Receptive Fields
The results of our experiments using different combinations of perceptual fields (convolutional kernel sizes) on the FF++ mixed dataset are shown in Table 10.
Table 10 shows that the Tception network consisting of two branches with 3 × 3 and 5 × 5 convolutional kernels is the best overall for face forgery detection.At the same time, we find that the overall network performance may not be satisfactory when the size of the convolutional kernels differs significantly.The reason may be the fact that after the convolution operation is performed, a padding operation is often required in order to unify the size of the feature maps.For networks consisting of branches composed of two convolutional kernels with large differences, the difference in the area filled by the Padding operation is also larger than the difference in the position of the unified features in the feature map, and the feature map may cause a feature shift when the Add operation is performed, thus reducing the detection performance of the network.

Comparison Experiments with Different High-Pass Filters
We conduct experiments on the FF++ hybrid dataset using different high-pass filters and the results are shown in Table 11.
We find that each high-pass filter actually contributes to the performance improvement.The SRM filter works relatively well.

Comparison Experiments with Different SRM Filters
We conduct experiments on the FF++ hybrid dataset using different SRM filters and the results are shown in Table 12.
In Table 12, SRM_1 and SRM_2 used SRM filters with a single convolutional kernel size, and SRM_3 and SRM_4 used SRM filters with different convolutional kernel sizes.The specific filter weights are as followed.
The weights of SRM_1 are shown in Eq. ( 4): The weights of SRM_2 are shown in Eq. ( 5): The weights of SRM_3 are shown in Eq. ( 6): The weights of SRM_4 are shown in Eq. ( 7): (4) We find that the filter combinations we used are slightly better than the other filter combinations.We also find hat using filters with different convolutional kernel sizes is generally better than using filters with a single convolutional kernel size.The reason for this may be that filters with different convolutional kernel sizes could extract features of different fineness.

Comparison Experiments with Different Activation Functions
We conducted experiments on the FF++ hybrid dataset using different activation functions and the results are shown in Table 13.
In Table 13, we find an improvement of about 2.5% on the FF++ mixed data set using the GELU activation function compared to the ReLU activation function.

Visualization of Result
To further demonstrate the validity of our model, we give the CAM heat maps on a subset of the test samples to investigate the discriminatory basis of the neural network, and the results are shown in Fig. 6.
Through visual analysis, we find that our model can activate a wider range of features compared to the original Xception, resulting in a more well-founded and effective discriminant.

Conclusions
We propose a Tception structure that builds on Xception and expands the receptive field of the network by adding convolutional kernels of different sizes.At the same time, RGB streams and noise streams are used to learn rich features for image tampering detection.We extract noise feature through an SRM filter layer to extract noise features and fuse them with features in RGB space adaptively, retaining features in RGB space and introducing features in noise space to achieve better results.The experiments show that our proposed method has improved performance compared to the Xception network.Compared to SSTNet, the detection accuracy of the Neural Textures is improved by nearly 8%.In the future, we will continue to investigate other feature fusion methods and carry out related work in other more complex cases (e.g., higher compression rates).
It is worth noting that our proposed method does not perform best on all data generated by the falsification method and the generality of the model still needs to be improved.

Fig. 4
Fig. 4 Tception structure we proposed.Similar to the Xception structure, our proposed Tception structure is still divided into Entry Flow, Middle Flow and Exit Flow.We keep the separable convolution and residual structure of the Xception structure.Different from

Fig. 5
Fig. 5 Data pre-processing.First, we sample the frames of the video to obtain the image in the video.Then, we cut the image with the aid of Dilb detection of faces to obtain an image containing only faces

Table 1
Comparative experimentsCompare our model on HQ quality videos.The evaluation metric is ACC.Bold represents the best result

Table 2
Comparative experimentsCompare our model on LQ quality videos.The evaluation metric is ACC.Bold represents the best result

Table 3
Comparative experimentsCompare our model on HQ quality videos.The evaluation metric is AUC.Bold represents the best result

Table 5
Experiments on the mixed dataset

Table 8
Results (HQ) of the ablation studyThe evaluation metric is ACC.Bold represents the best result

Table 9
Results (LQ) of the ablation study

Table 13
Comparison experiments with different activation functionsThe evaluation metric is ACC.Bold represents the best result