Deepfake detection using convolutional vision transformers and convolutional neural networks

Soudy, Ahmed Hatem; Sayed, Omnia; Tag-Elser, Hala; Ragab, Rewaa; Mohsen, Sohaila; Mostafa, Tarek; Abohany, Amr A.; Slim, Salwa O.

doi:10.1007/s00521-024-10181-7

Deepfake detection using convolutional vision transformers and convolutional neural networks

Original Article
Open access
Published: 08 August 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Deepfake detection using convolutional vision transformers and convolutional neural networks

Download PDF

Ahmed Hatem Soudy ORCID: orcid.org/0009-0007-6919-3962¹,
Omnia Sayed¹,
Hala Tag-Elser¹,
Rewaa Ragab¹,
Sohaila Mohsen¹,
Tarek Mostafa¹,
Amr A. Abohany² &
…
Salwa O. Slim¹

396 Accesses
Explore all metrics

Abstract

Deepfake technology has rapidly advanced in recent years, creating highly realistic fake videos that can be difficult to distinguish from real ones. The rise of social media platforms and online forums has exacerbated the challenges of detecting misinformation and malicious content. This study leverages many papers on artificial intelligence techniques to address deepfake detection. This research proposes a deep learning (DL)-based method for detecting deepfakes. The system comprises three components: preprocessing, detection, and prediction. Preprocessing includes frame extraction, face detection, alignment, and feature cropping. Convolutional neural networks (CNNs) are employed in the eye and nose feature detection phase. A CNN combined with a vision transformer is also used for face detection. The prediction component employs a majority voting approach, merging results from the three models applied to different features, leading to three individual predictions. The model is trained on various face images using FaceForensics++ and DFDC datasets. Multiple performance metrics, including accuracy, precision, F1, and recall, are used to assess the proposed model’s performance. The experimental results indicate the potential and strengths of the proposed CNN that achieved enhanced performance with an accuracy of 97%, while the CViT-based model achieved 85% using the FaceForences++ dataset and demonstrated significant improvements in deepfake detection compared to recent studies, affirming the potential of the suggested framework for detecting deepfakes on social media. This study contributes to a broader understanding of CNN-based DL methods for deepfake detection.

A Review on Deepfakes Detection Using Machine Learning Techniques

SegNet: a network for detecting deepfake facial videos

Article 10 January 2022

A Machine Learning Based Approach for Deepfake Detection in Social Media Through Key Video Frame Extraction

Article 14 February 2021

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Deepfake is a deep learning-based approach that can add facial images of a target person instead of the main person in a video to create a video of the target person doing or saying things the main person did or said. Deepfake methods generate harm because they can be used to create videos defaming celebrities, creating confusion and chaos in economic markets by developing fake news and misleading individuals. In 2017, the first deepfake video was released when a Reddit user changed superstars’ faces into porn videos. Thus, several methods were introduced to detect deepfake videos.

The technologies used to edit pictures, videos, and voice-overs are developing rapidly. Techniques for creating and manipulating digital content and technical know-how are also easily accessible. It is now possible to easily create hyper-realistic digital pictures using a few tools and straightforward how-to guides that are easily available online [9]. Deepfake technology aims to create compelling fake videos that can be difficult to distinguish from real ones. While this technology has the potential to be used for legitimate purposes, it also presents significant challenges in detecting the spread of misinformation and other forms of malicious content. As the use of deepfakes continues to increase, there is a need for effective methods of deepfake detection to protect society as a whole [27].

Since the spread of deepfake, many researchers have researched deepfake algorithms to search for weaknesses in those algorithms. However, each of the existing solutions failed to work when either the progress of the current algorithms, the emergence of new algorithms, the unavailability of a large amount of data, or the stage of data processing applied [34].

The application of machine learning (ML) methods across various disciplines has been escalating, underscoring their adaptability and efficiency. Notably, these methods have been instrumental in medical diagnostics, particularly in the early detection of chronic kidney disease [3] and the enhancement of heart disease prediction algorithms [4]. Beyond the medical field, numerical and computational methodologies have seen extensive utilization in addressing intricate mathematical challenges. This is exemplified in the resolution of convection-diffusion equations ([20], the investigation of fractional Stokes problems [22], and the approach to solving time-dependent partial differential equations ([21]. These applications highlight the broad scope and transformative impact of ML and computational techniques in both scientific research and practical problem-solving.

Deep learning methods have been used to effectively and efficiently detect fake videos [24]. The large-scale and large-dimensional data volume of deepfake videos is the main reason for effective results by DL methods. Today’s system has noticed the rapid development of social media, and users rely on social media to get the latest updates. Thus, social media platforms such as WhatsApp, Twitter, Facebook, and YouTube isolate fake videos and unnatural information from massive users’ content [18]. There is a potential risk that these Manufactured Videos will be shared and disseminated across social media platforms [11]. Working in these regions shows multiple challenges, including (i) selecting the considerably important features, (ii) videos with increased heterogeneity and dimensionality, and (iii) choosing the proper DL model [17].

One of the well-known DL methods is the convolutional neural network (CNN), which is frequently utilized because of its sophisticated performance that automatically selects low and high-level features from datasets. Therefore, these methods have attracted worldwide researchers’ interest [24]. Li et al. [17] utilize different CNN structures: MesoInception4, InceptionV3, ResNet50, GoogLeNet, XceptionNet, Meso4, FWA-based Dual Spatial Pyramid, and VGG19-based CapsuleNet. These different structures are then trained on various deepfake datasets, and the Celeb-DF dataset is used to test these architectures. Kumar et al. [15] extracting regions of the face from the frames of the video of a Celeb-DF dataset utilizing a multitasking CNN and then implementing the XceptionNet structure. Wodajo et al. [25] extract facial regions utilizing three DL face detection techniques: face recognition, multitask CNN, and BlazeFace. Then, a set of convolution blocks is utilized as a feature extractor, followed by a vision transformer relying on an attention technique to detect deepfake videos.

This research introduces a deep learning (DL)-based method for detecting deepfakes. The proposed system comprises three components: preprocessing, detection, and prediction. The preprocessing step includes frames extraction, face detection, face alignment, face crop, eye crop, and nose crop. In the detection step, we use CNN-based architecture for eye and nose feature detection, and a CNN combined with a vision transformer component is used for all face detection. In the prediction component, we apply a majority voting approach by merging the results from the previous three models applied to three different features, resulting in three individual predictions.

1.1 Motivations

The rapid evolution of deepfake technology presents a critical challenge in distinguishing authentic visual content from sophisticated fakes. As these manipulated videos infiltrate online platforms, the risk of misinformation dissemination and societal discord escalates. The pervasive nature of deepfakes, especially on social media, undermines trust and integrity in digital media, necessitating urgent intervention. The escalating threat of deepfake proliferation demands innovative solutions capable of combating misinformation. Leveraging advanced DL techniques like convolutional neural networks (CNNs) and convolutional vision transformers (CVTs), this research endeavors to devise an effective methodology to detect deepfakes. By integrating these cutting-edge technologies, organizations can enhance their capabilities in identifying and mitigating the risks posed by deepfake content. Investing in the development and implementation of deepfake detection systems utilizing CNNs and CVTs is crucial for staying ahead in the rapidly evolving digital landscape, particularly for sectors vulnerable to deepfake-related risks such as media, politics, and finance. The quest for reliable techniques is pivotal to safeguarding information integrity, bolstering trust in visual media, and strategically managing digital risks.

1.2 Contributions

The main contributions of our work were:

A combined framework of three models, each of which works to detect deepfakes through the following regions: the entire face, eye, and nose.
Developing a customized data preprocessing stage for each model to detect deep aircraft from the faking process to achieve the most reliable results, avoid limitations in detecting a single algorithm, and identify deepfake produced in various settings, environments, and orientations.
We train our model on various face images using FaceForensics++ and DFDC datasets.
A comparison with different DL methods used in detecting deepfakes is presented in terms of accuracy, precision, recall, and F-measure.

The rest of this paper is organized as follows. Firstly, Sect. 2 shows the related work. Section 3 introduces the main concepts of the utilized methods CNN and vision transformer. Section 4 describes the methodology used in this project, including data preprocessing and model development. Then, in Sect. 5, the experiments were conducted to evaluate the proposed approach. In Sect. 6, the experimental results are discussed. Finally, a conclusion for the entire research and future work is in Sect. 8.

2 Related work

The emergence of deepfake videos as a significant threat to online security and privacy has spurred considerable research into developing methods for detecting them. This section reviews some of the most relevant work in this field [35].

One of the notable studies in deepfake technology was conducted by Nguyen et al. [23], which provides a comprehensive overview of deepfake technology. They explored the risks posed by deepfakes and focused on surveying the algorithms used in their creation and the methods employed for detecting them. The paper examined the challenges, research trends, and future directions in deepfakes. By analyzing state-of-the-art deepfake detection methods and reviewing the background of deepfakes, the study offered valuable insights into the current landscape of deepfake technology. The paper’s primary objective was to facilitate the development of more robust methods for addressing the increasing sophistication and prevalence of deepfakes. By understanding the advancements in deepfake detection, researchers and practitioners can work toward effectively countering the threats associated with this technology.

Li et al. [16] discussed the challenges posed by AI-generated fake face videos and the need for effective detection methods. They highlighted the ease of creating and spreading manipulated videos due to advancements in camera technology and the popularity of social networks. They specifically focused on the emergence of deepfake, a technique that uses generative adversarial networks (GANs) to create realistic fake videos by replacing human faces. Traditional forensic methods face difficulties detecting AI-generated fake face videos, prompting them to propose a novel forensic method based on detecting the absence of physiological signals, such as eye blinking. They introduced a DL model combining a CNN with a recursive neural network (RNN) to capture the temporal regularities of eye blinking. The long-term recurrent convolutional neural network (LRCN) method leverages previous temporal knowledge to predict eye states accurately. The evaluation of the method using benchmark eye-blinking detection datasets shows promising results. LRCN outperforms CNN and EAR (eye aspect ratio) methods, achieving a higher accuracy of 0.99 compared to CNN’s 0.98 and EAR’s 0.79. While CNN performs well within individual frames, it lacks temporal knowledge, making it sometimes less reliable. LRCN, with its consideration of long-term dynamics, provides smoother and more accurate predictions, even in challenging scenarios.

Krandikar et al. [12] addressed the significant issue of deepfakes, which are realistic but deceptive images and videos created using artificial intelligence. These deepfakes pose risks such as spreading false information, political bias, defamation, and piracy. The paper focuses on detecting face manipulations, specifically expressions and identity swaps commonly used in deepfakes. The proposed method involves training a classifier using video frames. The frames undergo face extraction and alignment to address faults introduced during deepfake creation. Face extraction captures the relevant area, while face alignment adjusts for different head positions. The classifier uses a fine-tuned convolutional model, utilizing the VGG-16 architecture with additional layers. Dataset preprocessing involved face alignment and extraction, enhancing the data for training. Transfer learning was employed to leverage learned features for temporal analysis and improve deepfake detection. The model achieves an accuracy of approximately 70% based on image analysis features. The paper discussed training challenges, such as low-resolution images and compression artifacts. The model performs well with low-resolution images, but enhancing the dataset’s resolution is an area for improvement. Compression artifacts are addressed through temporal analysis techniques to mitigate errors during learning.

Karandikar et al. [13] discussed the prevalence of deepfake videos and the potential harm they can cause, including the spread of fake news and misinformation. They focused on detecting deepfake videos using residual neural network (ResNet50) and long short-term memory (LSTM) models. Deepfakes were created using GANs, where a generator network produces fake data, and a discriminator network distinguishes between real and fake data. Various techniques can detect flaws in deepfake videos, such as phoneme-viseme mismatches, appearance analysis, eye-blinking patterns, and facial artifact analysis. The proposed approach utilized a learning-based method, where a model was trained to learn features from natural and fake videos. The dataset was preprocessed to extract faces at the frame level, and ResNet50 was used for feature extraction. LSTM was then employed to handle the sequential nature of video frames. The Softmax function was used to classify videos as genuine or fake. The model architecture consists of ResNet50 for feature extraction, LSTM for sequence processing, and Softmax for video classification. The trained model achieves high accuracy on both the training and validation sets.

Wodajo and Atnafu [30] developed a deepfake detection framework employing a convolutional vision transformer (CViT) architecture, demonstrating significant efficacy in their approach. Their model, trained on a comprehensive dataset comprising both manipulated and authentic videos, achieved a noteworthy accuracy of 98.5%. While their results are impressive, it is important to note that their methodology primarily focuses on analyzing entire facial regions. This approach differs from ours, which extends beyond the whole face to include detailed examinations of the eye and nose regions. Additionally, the CViT architecture, while effective, necessitates substantial computational resources, including a high-performance GPU, for efficient model training. This requirement could potentially limit the applicability of their framework in resource-constrained environments.

Yang et al. [33] presented a novel view by developing deepfake identification as a graph classification problem in which each face area is associated with a vertex. However, the high-redundancy relational information hinders the expressions of the graphs. Given the masked modeling success, they proposed masked relation learning, which decreases the redundancy in learning informative relational features. A relation learning module hides the partial correlations between regions to decrease redundancy and then spreads the relational information across regions to detect the abnormality from a global graph view.

This study aims to investigate such an optimal model before the section on deepfake detection that outperforms previous works in the literature. Therefore, we deployed different DL models.

3 Background

This section introduces the main concepts of the methods used, CNN and vision transformer.

3.1 Convolutional neural network

CNN is a DL algorithm commonly used in computer vision tasks, such as image recognition and object detection [10].

CNNs are designed to automatically learn and extract relevant features from input data, particularly images [29]. The organization and functioning of the visual cortex in animals inspire them. The critical component of CNNs is the convolutional layer, which performs convolution operations on the input data using a set of learnable filters or kernels [2].

The convolution layer applies these filters across the input data to detect patterns and features at different spatial locations. It captures local dependencies and spatial hierarchies, allowing the network to learn complex representations of the input images. Pooling layers are often used in CNNs to reduce the spatial dimensions and extract the most relevant information [5, 32]. Figure 1 shows the CNN architecture.

CNNs also consist of fully connected layers responsible for making predictions based on the learned features. These layers take the output of the convolutional layers, flatten it, and pass it through one or more fully connected layers, ultimately producing the final classification or regression output [7].

3.2 Vision transformers

The vision transformer (ViT) model architecture, introduced in research paper [6] extends the Transformer architecture introduced in paper [28] to effectively handle the image domain. Developed by the Google Research Brain Team, the ViT model modifies the implementation to handle image data.

In the ViT model, an input image is divided into visual tokens and patches of the image. These visual tokens are embedded into fixed-dimensional encoded vectors, and the position information of each patch is also embedded and combined with the encoded vectors. The transformer encoder network processes this combined representation, similar to text inputs [31].

The ViT encoder consists of multiple blocks, each composed of layer normalization, multi-head self-attention, and multilayer perceptron (MLP) components. Layer normalization stabilizes the training process and allows the model to adapt to variations among training images. The multi-head self-attention network generates attention maps from the embedded visual tokens, helping the network focus on essential regions in the image. The MLPs serve as a two-layer classification network, and the final MLP block, known as the MLP head, serves as the transformer’s output. Applying softmax to this output provides classification labels, such as in the case of Image Classification [19].

The architecture of a ViT involves the sequential processing of visual tokens through the ViT encoder blocks, ultimately leading to the final MLP head for classification. Figure 2 shows the ViT architecture.

4 Methodology

Our approach consists of three primary phases: preprocessing, detection, and prediction. These phases are illustrated in Fig. 3. Within the preprocessing phase, we extract frames from the video, improve each frame’s quality, distinguish the background from the foreground, and then align them accordingly. The subsequent stage is detection, during which the regions encompassing the face, nose, and eyes are identified and cropped from the frame. The cropped face then undergoes detection through three distinct pathways: the first focuses on eye detection, the second on nose detection, and the third on face detection.

Within both eye and nose pathways, the eyes and nose are extracted from the face, and after cropping, they are passed to two models, A and B. Each model utilizes a different architecture and possesses a unique layer configuration, which will be expounded upon in the subsequent sections. The outcomes of these models are integrated into the final prediction. The face is directed to model C in the face pathway, which employs an alternative architecture and layers’ number. The results of this model contribute to the overall prediction.

To ensure reliability, despite the capability of the eye, nose, and face pathways to generate predictions individually, we implement a majority voting approach to consolidate all results into a single outcome. Consequently, predictions can be made independently for each pathway or using the majority voting approach.

4.1 The preprocessing component

The initial data preparation phase involves converting the raw dataset into suitable formats for training, validation, and testing purposes. Our model training and evaluation were conducted using the FaceForensics++ dataset, which comprises authentic and manipulated facial videos. Within the preprocessing stage, four distinct subcomponents are employed: frames extraction, improving each frame’s quality, distinguishing the background from the foreground, and then aligning them accordingly.

Figure 4 shows an example of how the preprocessing work

4.2 The detection component

Frames extraction entails isolating individual frames from video files, while face detection leverages multitask cascaded convolutional networks (MTCNNs) to pinpoint faces within each frame. Face alignment corrects variations in head pose and facial expression by standardizing the alignment of each face.

Subsequently, face cropping trims the aligned face images to a consistent size. Meanwhile, the extraction and cropping of eyes and nose involve identifying and isolating corresponding regions from the aligned face images.

The proposed model for identifying deepfakes is composed of three primary models. These models encompass a CNN-based design tailored for extracting features related to the eyes and nose, an additional CNN-based structure serving the same purpose, and a fusion of a CNN module with a ViT module to analyze the entire face comprehensively. The assessment of machine learning model performance is carried out using K-fold cross-validation.

4.2.1 CNN-based architecture for eye and nose regions (Model A)

Model A is a DL architecture with 12 layers that utilizes a CNN-based approach. It consists of three blocks containing three Conv2D layers with ReLU activation for introducing nonlinearity. Batch normalization, Max Pooling, and Dropout layers are incorporated to enhance performance and prevent overfitting. The model employs an image size of 50 and is trained on eye and nose features. The dataset is split into 80% for training and 20% for testing. The kernel size is (3, 3), the pool size is (2, 2), and the dropout rate is 0.3. The architecture includes a fully connected dense layer with 512 units, a dropout layer, and an output layer with two dense units using the softmax activation function. The model uses the Adam optimizer with a learning rate of 0.0001 and is trained for 100 epochs. The loss function employed is sparse categorical cross-entropy. Our model is shown in Fig. 5 and in the following pseudocode.

4.2.2 CNN-based architecture for eye and nose regions (Model B)

Model B exhibits a more streamlined architecture than Model A, comprising six layers encompassing three blocks of Conv2D layers. These layers employ the ReLU activation function and incorporate Max Pooling and Dropout layers. The training of Model B is conducted on eye and nose features, utilizing a 50-pixel image size, and follows the same dataset partition as Model A. Similarities persist in terms of kernel size, pool size, activation function, dropout rate, and optimizer shared between Model A and Model B. Specifically; Model B undergoes 150 epochs of training for the eye region and 200 epochs for the nose region.

While both models adhere to a similar framework, Model A boasts additional layers and integrates batch normalization layers. The training process remains consistent across both models, with minor epoch adjustments applied to specific regions of interest (eye and nose). Our model’s architecture is visualized in Fig. 6 and the following pseudocode.

4.2.3 CNN-based architecture combined with vision transformer (Model C)

Model C represents a distinctive architecture for image classification, characterized by an enlarged image size of 224 and the absence of data augmentation. It incorporates convolutional layers utilizing a kernel size of (3, 3) and a patch size of 7. Engineered for binary classification encompassing two classes, the model integrates 512 channels within the intermediate feature maps. Including eight heads in the multi-head self-attention mechanism captures extensive dependencies, while a multilayer perceptron (MLP) component, endowed with a 2048-dimensional space, orchestrates nonlinear transformations.

The network manifests a depth of 6, featuring repeated layers, and applies a weight decay of 0.0000001 to prevent overfitting. Employing the CrossEntropyLoss as its loss function, the model is trained with a batch size of 32. Its architecture unfolds as follows: a feature learning (FL) component boasting 17 convolutional layers is succeeded by a ViT module. The ViT component fragments the input feature map into patches and employs a transformer-based encoder.

In the final step, the outcomes of the three models are harmoniously integrated through a majority voting mechanism, culminating in a final prediction that distinguishes between genuine and manipulated videos, thereby enhancing the precision of deepfake detection. Refer to the provided figure for a visual representation of our model’s architecture, Fig. 7, and the following pseudocode (Tables 1 and 2).

Table 1 Hyper-parameters for both Model A & B (Eye & Nose)

Full size table

Table 2 Hyper-parameters for Model C (Face)

Full size table

4.3 The predicting component

To determine the authenticity of a video, we employed a majority voting approach by merging the results obtained from three models applied to three different features, which resulted in a total of three individual predictions. By considering the collective opinion of multiple models, our approach aims to enhance the accuracy and robustness of deepfake detection. This comprehensive method considers various aspects and characteristics of the video, increasing the likelihood of accurate classification. Using majority voting can improve the effectiveness of deepfake detection and contribute to more reliable identification of fake content.

5 Experiment

These experiments were conducted to identify the key facial regions that effectively determine whether a video is fake. Three regions were targeted: eyes, nose, and the entire face. Features of these three regions were extracted and evaluated using three different models, A, B, and C, each in a separate experiment. The final experiment combined the three experiments and leveraged the features from all three regions by employing an assembly technique.

5.1 Dataset description

The suggested model is assessed in this paper using two datasets: FaceForencies++,^{Footnote 1} and Deep Fake Detection Challenge (DFDC).^{Footnote 2}

FaceForencies++ is an exhaustive and various dataset especially curated for the research of deepfake detection. It is a modified variant of the original FaceForencies dataset, developed to manage the challenges of increasingly deepfake development methods. The dataset includes a comprehensive collection of manipulated face images and corresponding authentic face photos. These pictures are developed utilizing different progressive deepfake methods, including but not limited to GANs and deep neural networks.

The FaceForencies++ dataset is meticulously labeled, supplying background facts and information for each image to encourage supervised learning. It contains a variety of poses, facial expressions, backgrounds, and lighting conditions to guarantee the generalizability and diversity of deepfake detection methods.

The DFDC dataset is a widely recognized standard dataset in deepfake detection. The DFDC dataset includes a large-scale collection of deepfake and real videos containing different identities, scenarios, and individuals.

The dataset is meticulously labeled, showing the authenticity of each video and enabling the assessment of deepfake detection methods through supervised learning.

5.2 Performance measurement

We employed several metrics and techniques to evaluate our models’ performance. Firstly, we utilized the ROC curve to assess the models’ performance regarding the true and false favorable rates. The ROC curve provided valuable insights into the models’ ability to discriminate between different classes. Additionally, we examined the confusion matrix, which provided a detailed breakdown of the models’ classifications, including true positives, true negatives, false positives, and false negatives. We also monitored the accuracy and loss curves during the training process. These curves allowed us to track the models’ learning progress and identify signs of overfitting or underfitting. We aimed for high accuracy and low loss values to ensure the models’ effectiveness in detecting deepfakes. Furthermore, we utilized the classification report, which provided a comprehensive overview of the models’ performance across different classes. It included metrics such as precision, recall, F1 score, and support for each class. This report enabled us to assess the models’ effectiveness in detecting deepfakes in a more detailed manner. By employing these evaluation techniques, we gained valuable insights into the performance of our models. This information allowed us to measure their effectiveness and make informed decisions about their deployment in detecting deepfakes.

5.3 Eye region experiment

This experiment aimed to create deepfake detection models targeting the eye region. We assessed the effectiveness of features extracted from the eye region using two distinct CNN-based models, Models A and B. The outcomes of both Model A and Model B are illustrated in Table 3 shows the results of eye region experiments.

Table 3 Result of eye region experiment

Full size table

Also, the accuracy and loss curves, confusion matrix, and ROC curves for models A and B are shown in Figs. 8, 9, respectively.

5.4 Nose region experiment

The goal of the experiments was to evaluate and compare the performance of Model A and Model B in detecting deepfakes, specifically in the nose region. The aim was to assess the effectiveness and capabilities of these models in detecting manipulated facial features in the nose area and determine their suitability for deepfake detection tasks in this specific region. In this experiment, two models, Model A and Model B, are evaluated. For Model A, the experiment was conducted for 100 epochs, while Model B was trained for 200 epochs. The results of this experiment are shown in Table 4.

Table 4 Result of nose region experiments

Full size table

The experiments provide insights into the effectiveness and capabilities of the models, helping us determine their suitability for deepfake detection tasks in this specific region.

Also, the accuracy and loss curves, confusion matrix, and ROC curves for models A and B are shown in Figs. 10, 11, respectively.

5.5 Face region experiments

This experiment aimed to evaluate and compare the performance of deepfake detection models in detecting facial manipulations in the face region. It aimed to assess the effectiveness of various preprocessing techniques, datasets, and training configurations in detecting and classifying deepfakes based on facial features. We aim to gain insights into the models’ performance and identify the most effective approaches for detecting deepfakes in the face region. We conducted six experiments using various datasets, preprocessing techniques, and training configurations. Experiment 1 involved training the model on the DFDC dataset and using the BlazeFace model for video preprocessing. The model was trained with a learning rate of 0.0001 for 100 epochs. Experiment 2 also utilized the DFDC dataset but employed both the BlazeFace and Face Recognition models for video preprocessing. The model was trained with a learning rate of 0.001 for 50 epochs. Experiment 3 utilized the DFDC dataset and employed the MTCNN model for video preprocessing. The model was trained with a learning rate of 0.001 for 100 epochs using the Adam optimizer. Experiment 4 utilized the FaceForensics++ dataset and employed the MTCNN and Face Recognition models for video preprocessing. The model was trained with a learning rate of 0.001 for 50 epochs using the SGD optimizer. Experiment 5 focused on training the model specifically on the FaceForensics++ dataset. The MTCNN model was used for video preprocessing, and the dataset was subjected to cross-validation. The model was trained for 25 epochs with a learning rate of 0.0001. Experiment 6 also utilized the FaceForensics++ dataset and employed the MTCNN model for video preprocessing. The model was trained for 100 epochs with a learning rate of 1e-7 and a batch size of 32. The Early Stopping technique was employed to prevent overfitting. By comparing the results of these experiments, shown in Table 5, we can gain insights into the performance of our deepfake detection models in the face region. The experiments highlight the effectiveness of different preprocessing techniques, datasets, and training configurations in detecting and classifying deepfakes based on facial features.

Table 5 Result of face region experiments

Full size table

Also, the accuracy curve, loss curves, confusion matrix, and classification results of Model C on face region are shown in Fig. 12.

5.6 Assembly model

To enhance the efficiency of our models, we utilize one of the assembly models, which is majority voting. To attain optimal results, employ this technique on 100 videos. The confusion matrix and Roc curve results are shown in Fig. 13.

6 Discussion

6.1 Overview of the existing deepfake detection techniques

Deepfake detection has been a hot topic in the research community, and several techniques have been proposed. Some existing deepfake detection techniques are based on ML algorithms, such as CNNs, RNNs, and autoencoders. These techniques work by extracting features from the deepfake images or videos and comparing them with the features of the original images or videos. However, these techniques have several limitations, such as the need for large training data [26, 37], the susceptibility to adversarial attacks [14], and the inability to detect unseen deepfakes [1, 36].

6.2 Experiments analysis

In the context of the Eye region, Model A achieved a commendable accuracy of 96% after 100 epochs, while Model B demonstrated an even higher accuracy of approximately 97% following 150 epochs. [8] Comparatively, the approach presented in [32] yielded a 90% accuracy after 50 epochs, which subsequently escalated to an impressive 98.3% after 200 epochs, utilizing the same dataset, FaceForences++. This underscores the capacity of the proposed technique to potentially surpass the 98.3% threshold with a more extended training duration, contingent upon the availability of adequate computational resources.

In the domain of DeepFake detection, our investigation revealed an absence of previously described or implemented methodologies centered on the Nose region. Consequently, the proposed technique presents a novel and distinctive approach.

In the broader scope of the Face region, the utilization of six distinct experiments holds inherent value and should not be regarded as a redundant effort. Each experiment involves variations in datasets, preprocessing models, learning rates, epochs, optimizers, and other relevant parameters. Notably, the final experiment achieved a test accuracy of 85%, effectively curbing overfitting and underfitting concerns while minimizing train loss.

Incorporating multiple subsets from the dataset during the training phase further contributed to the robustness of the outcome. It is essential to acknowledge that an increased allocation of computational resources could potentially lead to training across the entire dataset and an even greater variety of subsets, consequently augmenting accuracy levels. Worth noting is that the methodology employed in [30] attained accuracies of 69% for FaceForences++ FaceSwap and 93% for FaceForences++ DeepFake datasets. This underscores the reliance on individual training for each collection, thereby hindering accurate detection across different collections.

6.3 Advantages of the proposed methodology

Our paper proposes a novel technique for deepfake detection that combines three models based on different features, including the entire face, eyes, and nose. While this combination of multiple models only slightly affected overall accuracy, it improves the accuracy of deepfake detection, reducing the impact of weaknesses in a single algorithm. Additionally, we develop a customized data processing stage for each model to detect deepfakes with high reliability. Our proposed technique also benefits from the large amount of data used for training, including datasets like FaceForensics++.

6.4 Comparison with the state-of-the-art methodologies

Our proposed technique has several advantages over the existing deepfake detection techniques. For example, the technique proposed in [17] works by extracting features from the image and then applying a transformer-based model to classify the image. Although this technique has shown promising results, it requires much training data and computational resources. In contrast, our proposed technique combines multiple models that work on different features, increasing the accuracy to 85% of detection while reducing dependence on any single algorithm.

Similarly, the technique proposed in [11] is based on a CNN trained specifically on the eye region of the face. Although this technique is effective in detecting deepfakes that involve changes in the eye region, it may be less effective in detecting deepfakes that involve changes in other parts of the face.

A comparison between recent deepfake detection techniques is shown in Table 6 Module C utilizes the CViT algorithm as implemented in paper [30]. In paper [30], the CViT algorithm was implemented in the entire face region, resulting in a 69% accuracy on the FaceForences++ database. However, when we applied the same algorithm to the entire face, eyes, and nose regions instead of just the entire face, we achieved a 97% accuracy. Additionally, in paper [32], utilizing CNN with features from the eye region resulted in a 90% accuracy. However, when applying the same algorithm to different regions (entire face, eyes, nose), we achieved an acceptable level of accuracy. As shown in the Table 6, the baseline of comparison between our system and other references lies in using the same database and algorithms. However, our approach differs by including multiple regions of the face, eyes, nose instead of just the face, leading to higher accuracy.

6.5 Limitations and future work

Our proposed technique has certain limitations, such as the need for high-computational resources for training and inference. Additionally, the technique may not be effective in detecting deepfakes that involve changes in parts of the face other than the eyes, nose, and entire face. Future research could focus on developing methods that require less data while maintaining high accuracy rates. We also plan to investigate the use of other features for deepfake detection.

Table 6 Comparison between recent deepfake detection techniques

Full size table

7 Managerial implications

There are some key managerial implications for this study:

1.
Investment in Advanced AI Technologies: Organizations should consider investing in advanced AI technologies like convolutional vision transformers and convolution neural networks. This investment is crucial for staying ahead in the rapidly evolving digital landscape, especially for sectors vulnerable to deepfake-related risks like media, politics, and finance.
2.
Training and Skill Development: The adoption of these technologies necessitates specialized skills. Therefore, managers should focus on training programs for their technical staff to handle and utilize these advanced systems effectively.
3.
Enhancing Digital Security Protocols: Integrating these deepfake detection methods into existing digital security protocols can significantly boost an organization’s ability to combat misinformation and protect its digital assets.
4.
Ethical and Legal Compliance: Managers must ensure the ethical use of AI for deepfake detection, complying with privacy laws, and avoiding biases in AI algorithms.
5.
Strategic Decision-Making: Leveraging insights from these technologies can aid in strategic decision-making, especially in content verification and public relations.
6.
Collaborative Efforts: Collaborating with technology providers, academic researchers, and industry peers can enhance understanding and efficacy using these deepfake detection methods.
7.
Adaptation to Technological Advancements: Given the rapid advancement in deepfake technologies, organizations must stay updated with the latest developments to ensure their defensive measures remain effective.

These implications underscore the importance of a proactive approach in adopting and integrating advanced AI technologies for digital security and integrity.

8 Conclusion

In this study, we introduced a groundbreaking method for deepfake detection, leveraging a fusion of distinct facial features and a comprehensive dataset enhanced by meticulous preprocessing. Our strategy entailed the development of a composite model, integrating three sub-models, each specializing in the recognition of deepfakes by analyzing specific facial elements: the entire face, the eyes, and the nose. Our tailored data processing techniques for each sub-model further strengthen this multifaceted approach, circumventing the constraints typically encountered in single-algorithm detection methods. Our training regimen utilized an expansive array of facial images from the most extensive dataset, such as FaceForensics++. This extensive dataset was pivotal in refining our model/s ability to discern physical anomalies indicative of deepfakes. The empirical evidence from our tests revealed a significant enhancement in accuracy and efficiency over existing deepfake detection methods, thereby establishing the superiority of our approach. A standout feature of our method is its robust performance across diverse scenarios, encompassing various environmental conditions and facial orientations, illustrating its practical applicability in real-world settings. This adaptability underscores our model’s ability to identify deepfakes with high physical fidelity, an essential attribute in the current digital era. The implications of our work are far-reaching, addressing the pressing demand for reliable deepfake detection to thwart the proliferation of misinformation and other harmful digital content. The application of our approach has the potential to safeguard individuals, organizations, and society at large from the adverse impacts of deepfakes, thereby contributing significantly to digital security and integrity. Although our results are promising, we recognize the scope for further enhancement. Future research could delve into integrating additional facial features or employing alternative datasets, aiming to augment the physical accuracy and operational efficiency of deepfake detection. Such advancements will fortify our method’s effectiveness and contribute to the broader field of digital media authenticity.

Data availibility

Data are available on request from the authors.

Notes

References

Al-Hussein M, Venkataraman S, Jawahar C (2020) Deepfake detection for video: an open source challenge. arXiv preprint arXiv:2006.06058
Altwaijry N, Al-Turaiki I (2021) Arabic handwriting recognition system using convolutional neural network. Neural Comput Appl 33(7):2249–2261
Article Google Scholar
Arif MS, Mukheimer A, Asif D (2023) Enhancing the early detection of chronic kidney disease: a robust machine learning model. Big Data Cogn Comput 7(3):144
Article Google Scholar
Asif D, Bibi M, Arif MS, Mukheimer A (2023) Enhancing heart disease prediction through ensemble learning techniques with hyperparameter optimization. Algorithms 16(6):308
Article Google Scholar
Desai M, Shah M (2021) An anatomization on breast cancer detection and diagnosis employing multi-layer perceptron neural network (mlp) and convolutional neural network (cnn). Clin eHealth 4:1–11
Article Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Dua M, Shakshi Singla R, Raj S, Jangra A (2021) Deep cnn models-based ensemble approach to driver drowsiness detection. Neural Comput Appl 33:3155–3168
Article Google Scholar
Hasan MJ et al (2020) Understanding the influence of epochs and learning rate on deep learning-based sentiment analysis. In: Proceedings of the international conference on artificial intelligence and applications, pp 553–561
Ismail A, Elpeltagy M, Zaki SM, Eldahshan K (2021) A new deep learning-based methodology for video deepfake detection using xgboost. Sensors 21(16):5413
Article Google Scholar
Jena B, Nayak GK, Saxena S (2022) Convolutional neural network and its pretrained models for image classification and object detection: a survey. Concurr Comput Pract Exp 34(6):e6767
Article Google Scholar
Jing TW, Murugesan RK (2021) Protecting data privacy and prevent fake news and deepfakes in social media via blockchain technology. In: Advances in cyber security: second international conference, ACeS 2020, Penang, Malaysia, December 8–9, 2020, Revised Selected Papers 2. Springer, Berlin, pp 674–684
Karandikar A, Deshpande V, Singh S, Nagbhidkar S, Agrawal S (2020) Deepfake video detection using convolutional neural network. Int J Adv Trends Comput Sci Eng 9(2):1311–1315
Article Google Scholar
Karandikar A, Thakare Y, Sah O, Sah R, Nafde S, Kumar S (2023) Detection of deepfake video using residual neural network and long short-term memory. Int J Next-Gener Comput. https://doi.org/10.47164/ijngc.v14i1.1046
Article Google Scholar
Kfir O, Kfir A, Friedman A (2019) Generative adversarial attacks in social media. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & sata mining, pp 2558–2566
Kumar A, Bhavsar A, Verma R (2020) Detecting deepfakes with metric learning. In: 2020 8th international workshop on biometrics and forensics (IWBF). IEEE, pp 1–6
Li Y, Chang MC, Lyu S (2018) In ictu oculi: exposing ai created fake videos by detecting eye blinking. In: 2018 IEEE International workshop on information forensics and security (WIFS). IEEE, pp 1–7
Li Y, Yang X, Sun P, Qi H, Lyu S (2020) Celeb-df: a large-scale challenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3207–3216
Mitra A, Mohanty SP, Corcoran P, Kougianos E (2021) A machine learning based approach for deepfake detection in social media through key video frame extraction. SN Comput Sci 2:1–18
Article Google Scholar
Mogan JN, Lee CP, Lim KM, Ali M, Alqahtani A (2023) Gait-cnn-vit: multi-model gait recognition with convolutional neural networks and vision transformer. Sensors 23(8):3809
Article Google Scholar
Nawaz Y, Arif MS, Shatanawi W, Nazeer A (2021) An explicit fourth-order compact numerical scheme for heat transfer of boundary layer flow. Energies 14(12):3396
Article Google Scholar
Nawaz Y, Arif MS, Abodayeh K (2022) An explicit-implicit numerical scheme for time fractional boundary layer flows. Int J Numer Methods Fluids 94(7):920–940
Article MathSciNet Google Scholar
Nawaz Y, Arif MS, Abodayeh K (2022) A third-order two-stage numerical scheme for fractional stokes problems: a comparative computational study. J Comput Nonlinear Dyn 17(10):101004
Article Google Scholar
Nguyen TT, Nguyen QVH, Nguyen DT, Nguyen DT, Huynh-The T, Nahavandi S, Nguyen TT, Pham QV, Nguyen CM (2022) Deep learning for deepfakes creation and detection: a survey. Comput Vis Image Underst 223:103525
Article Google Scholar
Rafique R, Gantassi R, Amin R, Frnda J, Mustapha A, Alshehri AH (2023) Deep fake detection and classification using error-level analysis and deep learning. Sci Rep 13(1):7422
Article Google Scholar
Sabir E, Cheng J, Jaiswal A, AbdAlmageed W, Masi I, Natarajan P (2019) Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI) 3(1):80–87
Google Scholar
Singh KA, Vats O, Shankar M (2022) Deepfake detection using deep learning methods: a systematic and comprehensive review. Wiley Interdiscip Rev Data Min Knowl Discov 14(3):e1520. https://doi.org/10.1002/widm.1520
Taeb M, Chi H (2022) Comparison of deepfake detection techniques through deep learning. J Cybersecur Priv 2(1):89–106
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems 30 (NIPS 2017)
Wang Y, Zhang Z, Feng L, Ma Y, Du Q (2021) A new attention-based cnn approach for crop mapping using time series sentinel-2 images. Comput Electron Agric 184:106090
Article Google Scholar
Wodajo D, Atnafu S (2021) Deepfake video detection using convolutional vision transformer. arXiv preprint arXiv:2102.11126
Xie J, Hua J, Chen S, Wu P, Gao P, Sun D, Lyu Z, Lyu S, Xue X, Lu J (2023) Hypersformer: a transformer-based end-to-end hyperspectral image classification method for crop classification. Remote Sens 15(14):3491
Article Google Scholar
Yamashita R, Nishio M, Do RKG, Togashi K (2018) Convolutional neural networks: an overview and application in radiology. Insights Imaging 9:611–629
Article Google Scholar
Yang Z, Liang J, Xu Y, Zhang XY, He R (2023) Masked relation learning for deepfake detection. IEEE Trans Inf Forensics Secur 18:1696–1708
Article Google Scholar
Yu P, Xia Z, Fei J, Lu Y (2021) A survey on deepfake video detection. IET Biometrics 10(6):607–624
Article Google Scholar
Zhang T (2022) Deepfake generation and detection, a survey. Multimed Tools Appl 81(5):6259–6276
Article Google Scholar
Zhao M, Qin W, Yu W, Li X, Gao B (2020) Adversarial examples: attacks on deepfake detection models. IEEE Trans Multimed 23(8):1295–1305
Google Scholar
Zhou XY, Jiang WJ, Yu ZH (2020) Deepfakes detection using deep learning methods: a survey. Int J Imaging Robot 14(3):196–211
Google Scholar

Download references

Funding

Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).

Author information

Authors and Affiliations

Department of Computer Science, Faculty of Computers and Artificial Intelligence, Helwan University, Cairo, Egypt
Ahmed Hatem Soudy, Omnia Sayed, Hala Tag-Elser, Rewaa Ragab, Sohaila Mohsen, Tarek Mostafa & Salwa O. Slim
Department of Information Systems, Faculty of Computers and Information, Kafrelsheikh University, Kafrelsheikh, Egypt
Amr A. Abohany

Authors

Ahmed Hatem Soudy
View author publications
You can also search for this author in PubMed Google Scholar
Omnia Sayed
View author publications
You can also search for this author in PubMed Google Scholar
Hala Tag-Elser
View author publications
You can also search for this author in PubMed Google Scholar
Rewaa Ragab
View author publications
You can also search for this author in PubMed Google Scholar
Sohaila Mohsen
View author publications
You can also search for this author in PubMed Google Scholar
Tarek Mostafa
View author publications
You can also search for this author in PubMed Google Scholar
Amr A. Abohany
View author publications
You can also search for this author in PubMed Google Scholar
Salwa O. Slim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmed Hatem Soudy.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Soudy, A.H., Sayed, O., Tag-Elser, H. et al. Deepfake detection using convolutional vision transformers and convolutional neural networks. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-10181-7

Download citation

Received: 19 August 2023
Accepted: 01 July 2024
Published: 08 August 2024
DOI: https://doi.org/10.1007/s00521-024-10181-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Deepfake detection using convolutional vision transformers and convolutional neural networks

Abstract

Similar content being viewed by others

A Review on Deepfakes Detection Using Machine Learning Techniques

SegNet: a network for detecting deepfake facial videos

A Machine Learning Based Approach for Deepfake Detection in Social Media Through Key Video Frame Extraction

Explore related subjects

1 Introduction

1.1 Motivations

1.2 Contributions

2 Related work

3 Background

3.1 Convolutional neural network

3.2 Vision transformers

4 Methodology

4.1 The preprocessing component

4.2 The detection component

4.2.1 CNN-based architecture for eye and nose regions (Model A)

4.2.2 CNN-based architecture for eye and nose regions (Model B)

4.2.3 CNN-based architecture combined with vision transformer (Model C)

4.3 The predicting component

5 Experiment

5.1 Dataset description

5.2 Performance measurement

5.3 Eye region experiment

5.4 Nose region experiment

5.5 Face region experiments

5.6 Assembly model

6 Discussion

6.1 Overview of the existing deepfake detection techniques

6.2 Experiments analysis

6.3 Advantages of the proposed methodology

6.4 Comparison with the state-of-the-art methodologies

6.5 Limitations and future work

7 Managerial implications

8 Conclusion

Data availibility

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation