1 Introduction

Deepfake is a deep learning-based approach that can add facial images of a target person instead of the main person in a video to create a video of the target person doing or saying things the main person did or said. Deepfake methods generate harm because they can be used to create videos defaming celebrities, creating confusion and chaos in economic markets by developing fake news and misleading individuals. In 2017, the first deepfake video was released when a Reddit user changed superstars’ faces into porn videos. Thus, several methods were introduced to detect deepfake videos.

The technologies used to edit pictures, videos, and voice-overs are developing rapidly. Techniques for creating and manipulating digital content and technical know-how are also easily accessible. It is now possible to easily create hyper-realistic digital pictures using a few tools and straightforward how-to guides that are easily available online [9]. Deepfake technology aims to create compelling fake videos that can be difficult to distinguish from real ones. While this technology has the potential to be used for legitimate purposes, it also presents significant challenges in detecting the spread of misinformation and other forms of malicious content. As the use of deepfakes continues to increase, there is a need for effective methods of deepfake detection to protect society as a whole [27].

Since the spread of deepfake, many researchers have researched deepfake algorithms to search for weaknesses in those algorithms. However, each of the existing solutions failed to work when either the progress of the current algorithms, the emergence of new algorithms, the unavailability of a large amount of data, or the stage of data processing applied [34].

The application of machine learning (ML) methods across various disciplines has been escalating, underscoring their adaptability and efficiency. Notably, these methods have been instrumental in medical diagnostics, particularly in the early detection of chronic kidney disease [3] and the enhancement of heart disease prediction algorithms [4]. Beyond the medical field, numerical and computational methodologies have seen extensive utilization in addressing intricate mathematical challenges. This is exemplified in the resolution of convection-diffusion equations ([20], the investigation of fractional Stokes problems [22], and the approach to solving time-dependent partial differential equations ([21]. These applications highlight the broad scope and transformative impact of ML and computational techniques in both scientific research and practical problem-solving.

Deep learning methods have been used to effectively and efficiently detect fake videos [24]. The large-scale and large-dimensional data volume of deepfake videos is the main reason for effective results by DL methods. Today’s system has noticed the rapid development of social media, and users rely on social media to get the latest updates. Thus, social media platforms such as WhatsApp, Twitter, Facebook, and YouTube isolate fake videos and unnatural information from massive users’ content [18]. There is a potential risk that these Manufactured Videos will be shared and disseminated across social media platforms [11]. Working in these regions shows multiple challenges, including (i) selecting the considerably important features, (ii) videos with increased heterogeneity and dimensionality, and (iii) choosing the proper DL model [17].

One of the well-known DL methods is the convolutional neural network (CNN), which is frequently utilized because of its sophisticated performance that automatically selects low and high-level features from datasets. Therefore, these methods have attracted worldwide researchers’ interest [24]. Li et al. [17] utilize different CNN structures: MesoInception4, InceptionV3, ResNet50, GoogLeNet, XceptionNet, Meso4, FWA-based Dual Spatial Pyramid, and VGG19-based CapsuleNet. These different structures are then trained on various deepfake datasets, and the Celeb-DF dataset is used to test these architectures. Kumar et al. [15] extracting regions of the face from the frames of the video of a Celeb-DF dataset utilizing a multitasking CNN and then implementing the XceptionNet structure. Wodajo et al. [25] extract facial regions utilizing three DL face detection techniques: face recognition, multitask CNN, and BlazeFace. Then, a set of convolution blocks is utilized as a feature extractor, followed by a vision transformer relying on an attention technique to detect deepfake videos.

This research introduces a deep learning (DL)-based method for detecting deepfakes. The proposed system comprises three components: preprocessing, detection, and prediction. The preprocessing step includes frames extraction, face detection, face alignment, face crop, eye crop, and nose crop. In the detection step, we use CNN-based architecture for eye and nose feature detection, and a CNN combined with a vision transformer component is used for all face detection. In the prediction component, we apply a majority voting approach by merging the results from the previous three models applied to three different features, resulting in three individual predictions.

1.1 Motivations

The rapid evolution of deepfake technology presents a critical challenge in distinguishing authentic visual content from sophisticated fakes. As these manipulated videos infiltrate online platforms, the risk of misinformation dissemination and societal discord escalates. The pervasive nature of deepfakes, especially on social media, undermines trust and integrity in digital media, necessitating urgent intervention. The escalating threat of deepfake proliferation demands innovative solutions capable of combating misinformation. Leveraging advanced DL techniques like convolutional neural networks (CNNs) and convolutional vision transformers (CVTs), this research endeavors to devise an effective methodology to detect deepfakes. By integrating these cutting-edge technologies, organizations can enhance their capabilities in identifying and mitigating the risks posed by deepfake content. Investing in the development and implementation of deepfake detection systems utilizing CNNs and CVTs is crucial for staying ahead in the rapidly evolving digital landscape, particularly for sectors vulnerable to deepfake-related risks such as media, politics, and finance. The quest for reliable techniques is pivotal to safeguarding information integrity, bolstering trust in visual media, and strategically managing digital risks.

1.2 Contributions

The main contributions of our work were:

  • A combined framework of three models, each of which works to detect deepfakes through the following regions: the entire face, eye, and nose.

  • Developing a customized data preprocessing stage for each model to detect deep aircraft from the faking process to achieve the most reliable results, avoid limitations in detecting a single algorithm, and identify deepfake produced in various settings, environments, and orientations.

  • We train our model on various face images using FaceForensics++ and DFDC datasets.

  • A comparison with different DL methods used in detecting deepfakes is presented in terms of accuracy, precision, recall, and F-measure.

The rest of this paper is organized as follows. Firstly, Sect. 2 shows the related work. Section 3 introduces the main concepts of the utilized methods CNN and vision transformer. Section 4 describes the methodology used in this project, including data preprocessing and model development. Then, in Sect. 5, the experiments were conducted to evaluate the proposed approach. In Sect. 6, the experimental results are discussed. Finally, a conclusion for the entire research and future work is in Sect. 8.

2 Related work

The emergence of deepfake videos as a significant threat to online security and privacy has spurred considerable research into developing methods for detecting them. This section reviews some of the most relevant work in this field [35].

One of the notable studies in deepfake technology was conducted by Nguyen et al. [23], which provides a comprehensive overview of deepfake technology. They explored the risks posed by deepfakes and focused on surveying the algorithms used in their creation and the methods employed for detecting them. The paper examined the challenges, research trends, and future directions in deepfakes. By analyzing state-of-the-art deepfake detection methods and reviewing the background of deepfakes, the study offered valuable insights into the current landscape of deepfake technology. The paper’s primary objective was to facilitate the development of more robust methods for addressing the increasing sophistication and prevalence of deepfakes. By understanding the advancements in deepfake detection, researchers and practitioners can work toward effectively countering the threats associated with this technology.

Li et al. [16] discussed the challenges posed by AI-generated fake face videos and the need for effective detection methods. They highlighted the ease of creating and spreading manipulated videos due to advancements in camera technology and the popularity of social networks. They specifically focused on the emergence of deepfake, a technique that uses generative adversarial networks (GANs) to create realistic fake videos by replacing human faces. Traditional forensic methods face difficulties detecting AI-generated fake face videos, prompting them to propose a novel forensic method based on detecting the absence of physiological signals, such as eye blinking. They introduced a DL model combining a CNN with a recursive neural network (RNN) to capture the temporal regularities of eye blinking. The long-term recurrent convolutional neural network (LRCN) method leverages previous temporal knowledge to predict eye states accurately. The evaluation of the method using benchmark eye-blinking detection datasets shows promising results. LRCN outperforms CNN and EAR (eye aspect ratio) methods, achieving a higher accuracy of 0.99 compared to CNN’s 0.98 and EAR’s 0.79. While CNN performs well within individual frames, it lacks temporal knowledge, making it sometimes less reliable. LRCN, with its consideration of long-term dynamics, provides smoother and more accurate predictions, even in challenging scenarios.

Krandikar et al. [12] addressed the significant issue of deepfakes, which are realistic but deceptive images and videos created using artificial intelligence. These deepfakes pose risks such as spreading false information, political bias, defamation, and piracy. The paper focuses on detecting face manipulations, specifically expressions and identity swaps commonly used in deepfakes. The proposed method involves training a classifier using video frames. The frames undergo face extraction and alignment to address faults introduced during deepfake creation. Face extraction captures the relevant area, while face alignment adjusts for different head positions. The classifier uses a fine-tuned convolutional model, utilizing the VGG-16 architecture with additional layers. Dataset preprocessing involved face alignment and extraction, enhancing the data for training. Transfer learning was employed to leverage learned features for temporal analysis and improve deepfake detection. The model achieves an accuracy of approximately 70% based on image analysis features. The paper discussed training challenges, such as low-resolution images and compression artifacts. The model performs well with low-resolution images, but enhancing the dataset’s resolution is an area for improvement. Compression artifacts are addressed through temporal analysis techniques to mitigate errors during learning.

Karandikar et al. [13] discussed the prevalence of deepfake videos and the potential harm they can cause, including the spread of fake news and misinformation. They focused on detecting deepfake videos using residual neural network (ResNet50) and long short-term memory (LSTM) models. Deepfakes were created using GANs, where a generator network produces fake data, and a discriminator network distinguishes between real and fake data. Various techniques can detect flaws in deepfake videos, such as phoneme-viseme mismatches, appearance analysis, eye-blinking patterns, and facial artifact analysis. The proposed approach utilized a learning-based method, where a model was trained to learn features from natural and fake videos. The dataset was preprocessed to extract faces at the frame level, and ResNet50 was used for feature extraction. LSTM was then employed to handle the sequential nature of video frames. The Softmax function was used to classify videos as genuine or fake. The model architecture consists of ResNet50 for feature extraction, LSTM for sequence processing, and Softmax for video classification. The trained model achieves high accuracy on both the training and validation sets.

Wodajo and Atnafu [30] developed a deepfake detection framework employing a convolutional vision transformer (CViT) architecture, demonstrating significant efficacy in their approach. Their model, trained on a comprehensive dataset comprising both manipulated and authentic videos, achieved a noteworthy accuracy of 98.5%. While their results are impressive, it is important to note that their methodology primarily focuses on analyzing entire facial regions. This approach differs from ours, which extends beyond the whole face to include detailed examinations of the eye and nose regions. Additionally, the CViT architecture, while effective, necessitates substantial computational resources, including a high-performance GPU, for efficient model training. This requirement could potentially limit the applicability of their framework in resource-constrained environments.

Yang et al. [33] presented a novel view by developing deepfake identification as a graph classification problem in which each face area is associated with a vertex. However, the high-redundancy relational information hinders the expressions of the graphs. Given the masked modeling success, they proposed masked relation learning, which decreases the redundancy in learning informative relational features. A relation learning module hides the partial correlations between regions to decrease redundancy and then spreads the relational information across regions to detect the abnormality from a global graph view.

This study aims to investigate such an optimal model before the section on deepfake detection that outperforms previous works in the literature. Therefore, we deployed different DL models.

3 Background

This section introduces the main concepts of the methods used, CNN and vision transformer.

3.1 Convolutional neural network

CNN is a DL algorithm commonly used in computer vision tasks, such as image recognition and object detection [10].

CNNs are designed to automatically learn and extract relevant features from input data, particularly images [29]. The organization and functioning of the visual cortex in animals inspire them. The critical component of CNNs is the convolutional layer, which performs convolution operations on the input data using a set of learnable filters or kernels [2].

The convolution layer applies these filters across the input data to detect patterns and features at different spatial locations. It captures local dependencies and spatial hierarchies, allowing the network to learn complex representations of the input images. Pooling layers are often used in CNNs to reduce the spatial dimensions and extract the most relevant information [5, 32]. Figure 1 shows the CNN architecture.

CNNs also consist of fully connected layers responsible for making predictions based on the learned features. These layers take the output of the convolutional layers, flatten it, and pass it through one or more fully connected layers, ultimately producing the final classification or regression output [7].

Fig. 1
figure 1

An overview of CNN architecture

3.2 Vision transformers

The vision transformer (ViT) model architecture, introduced in research paper [6] extends the Transformer architecture introduced in paper [28] to effectively handle the image domain. Developed by the Google Research Brain Team, the ViT model modifies the implementation to handle image data.

In the ViT model, an input image is divided into visual tokens and patches of the image. These visual tokens are embedded into fixed-dimensional encoded vectors, and the position information of each patch is also embedded and combined with the encoded vectors. The transformer encoder network processes this combined representation, similar to text inputs [31].

The ViT encoder consists of multiple blocks, each composed of layer normalization, multi-head self-attention, and multilayer perceptron (MLP) components. Layer normalization stabilizes the training process and allows the model to adapt to variations among training images. The multi-head self-attention network generates attention maps from the embedded visual tokens, helping the network focus on essential regions in the image. The MLPs serve as a two-layer classification network, and the final MLP block, known as the MLP head, serves as the transformer’s output. Applying softmax to this output provides classification labels, such as in the case of Image Classification [19].

The architecture of a ViT involves the sequential processing of visual tokens through the ViT encoder blocks, ultimately leading to the final MLP head for classification. Figure 2 shows the ViT architecture.

Fig. 2
figure 2

ViT architecture [19]

4 Methodology

Our approach consists of three primary phases: preprocessing, detection, and prediction. These phases are illustrated in Fig. 3. Within the preprocessing phase, we extract frames from the video, improve each frame’s quality, distinguish the background from the foreground, and then align them accordingly. The subsequent stage is detection, during which the regions encompassing the face, nose, and eyes are identified and cropped from the frame. The cropped face then undergoes detection through three distinct pathways: the first focuses on eye detection, the second on nose detection, and the third on face detection.

Within both eye and nose pathways, the eyes and nose are extracted from the face, and after cropping, they are passed to two models, A and B. Each model utilizes a different architecture and possesses a unique layer configuration, which will be expounded upon in the subsequent sections. The outcomes of these models are integrated into the final prediction. The face is directed to model C in the face pathway, which employs an alternative architecture and layers’ number. The results of this model contribute to the overall prediction.

To ensure reliability, despite the capability of the eye, nose, and face pathways to generate predictions individually, we implement a majority voting approach to consolidate all results into a single outcome. Consequently, predictions can be made independently for each pathway or using the majority voting approach.

Fig. 3
figure 3

System architecture for preprocessing, and detection and prediction

4.1 The preprocessing component

The initial data preparation phase involves converting the raw dataset into suitable formats for training, validation, and testing purposes. Our model training and evaluation were conducted using the FaceForensics++ dataset, which comprises authentic and manipulated facial videos. Within the preprocessing stage, four distinct subcomponents are employed: frames extraction, improving each frame’s quality, distinguishing the background from the foreground, and then aligning them accordingly.

Figure 4 shows an example of how the preprocessing work

Fig. 4
figure 4

Example of how our preprocessing steps work

4.2 The detection component

Frames extraction entails isolating individual frames from video files, while face detection leverages multitask cascaded convolutional networks (MTCNNs) to pinpoint faces within each frame. Face alignment corrects variations in head pose and facial expression by standardizing the alignment of each face.

Subsequently, face cropping trims the aligned face images to a consistent size. Meanwhile, the extraction and cropping of eyes and nose involve identifying and isolating corresponding regions from the aligned face images.

The proposed model for identifying deepfakes is composed of three primary models. These models encompass a CNN-based design tailored for extracting features related to the eyes and nose, an additional CNN-based structure serving the same purpose, and a fusion of a CNN module with a ViT module to analyze the entire face comprehensively. The assessment of machine learning model performance is carried out using K-fold cross-validation.

4.2.1 CNN-based architecture for eye and nose regions (Model A)

Model A is a DL architecture with 12 layers that utilizes a CNN-based approach. It consists of three blocks containing three Conv2D layers with ReLU activation for introducing nonlinearity. Batch normalization, Max Pooling, and Dropout layers are incorporated to enhance performance and prevent overfitting. The model employs an image size of 50 and is trained on eye and nose features. The dataset is split into 80% for training and 20% for testing. The kernel size is (3, 3), the pool size is (2, 2), and the dropout rate is 0.3. The architecture includes a fully connected dense layer with 512 units, a dropout layer, and an output layer with two dense units using the softmax activation function. The model uses the Adam optimizer with a learning rate of 0.0001 and is trained for 100 epochs. The loss function employed is sparse categorical cross-entropy. Our model is shown in Fig. 5 and in the following pseudocode.

Algorithm 1
figure a

Model A (convolutional neural network with batch normalization and dropout)

Fig. 5
figure 5

CNN-based architecture (Model A)

4.2.2 CNN-based architecture for eye and nose regions (Model B)

Model B exhibits a more streamlined architecture than Model A, comprising six layers encompassing three blocks of Conv2D layers. These layers employ the ReLU activation function and incorporate Max Pooling and Dropout layers. The training of Model B is conducted on eye and nose features, utilizing a 50-pixel image size, and follows the same dataset partition as Model A. Similarities persist in terms of kernel size, pool size, activation function, dropout rate, and optimizer shared between Model A and Model B. Specifically; Model B undergoes 150 epochs of training for the eye region and 200 epochs for the nose region.

While both models adhere to a similar framework, Model A boasts additional layers and integrates batch normalization layers. The training process remains consistent across both models, with minor epoch adjustments applied to specific regions of interest (eye and nose). Our model’s architecture is visualized in Fig. 6 and the following pseudocode.

Algorithm 2
figure b

Model B (simple convolutional neural network)

Fig. 6
figure 6

CNN-based architecture (Model B)

4.2.3 CNN-based architecture combined with vision transformer (Model C)

Model C represents a distinctive architecture for image classification, characterized by an enlarged image size of 224 and the absence of data augmentation. It incorporates convolutional layers utilizing a kernel size of (3, 3) and a patch size of 7. Engineered for binary classification encompassing two classes, the model integrates 512 channels within the intermediate feature maps. Including eight heads in the multi-head self-attention mechanism captures extensive dependencies, while a multilayer perceptron (MLP) component, endowed with a 2048-dimensional space, orchestrates nonlinear transformations.

The network manifests a depth of 6, featuring repeated layers, and applies a weight decay of 0.0000001 to prevent overfitting. Employing the CrossEntropyLoss as its loss function, the model is trained with a batch size of 32. Its architecture unfolds as follows: a feature learning (FL) component boasting 17 convolutional layers is succeeded by a ViT module. The ViT component fragments the input feature map into patches and employs a transformer-based encoder.

In the final step, the outcomes of the three models are harmoniously integrated through a majority voting mechanism, culminating in a final prediction that distinguishes between genuine and manipulated videos, thereby enhancing the precision of deepfake detection. Refer to the provided figure for a visual representation of our model’s architecture, Fig. 7, and the following pseudocode (Tables 1 and 2).

Algorithm 3
figure c

Model C (vision transformer)

Fig. 7
figure 7

Convolutional vision transformer (Model c) [30]

Table 1 Hyper-parameters for both Model A & B (Eye & Nose)
Table 2 Hyper-parameters for Model C (Face)

4.3 The predicting component

To determine the authenticity of a video, we employed a majority voting approach by merging the results obtained from three models applied to three different features, which resulted in a total of three individual predictions. By considering the collective opinion of multiple models, our approach aims to enhance the accuracy and robustness of deepfake detection. This comprehensive method considers various aspects and characteristics of the video, increasing the likelihood of accurate classification. Using majority voting can improve the effectiveness of deepfake detection and contribute to more reliable identification of fake content.

5 Experiment

These experiments were conducted to identify the key facial regions that effectively determine whether a video is fake. Three regions were targeted: eyes, nose, and the entire face. Features of these three regions were extracted and evaluated using three different models, A, B, and C, each in a separate experiment. The final experiment combined the three experiments and leveraged the features from all three regions by employing an assembly technique.

5.1 Dataset description

The suggested model is assessed in this paper using two datasets: FaceForencies++,Footnote 1 and Deep Fake Detection Challenge (DFDC).Footnote 2

FaceForencies++ is an exhaustive and various dataset especially curated for the research of deepfake detection. It is a modified variant of the original FaceForencies dataset, developed to manage the challenges of increasingly deepfake development methods. The dataset includes a comprehensive collection of manipulated face images and corresponding authentic face photos. These pictures are developed utilizing different progressive deepfake methods, including but not limited to GANs and deep neural networks.

The FaceForencies++ dataset is meticulously labeled, supplying background facts and information for each image to encourage supervised learning. It contains a variety of poses, facial expressions, backgrounds, and lighting conditions to guarantee the generalizability and diversity of deepfake detection methods.

The DFDC dataset is a widely recognized standard dataset in deepfake detection. The DFDC dataset includes a large-scale collection of deepfake and real videos containing different identities, scenarios, and individuals.

The dataset is meticulously labeled, showing the authenticity of each video and enabling the assessment of deepfake detection methods through supervised learning.

5.2 Performance measurement

We employed several metrics and techniques to evaluate our models’ performance. Firstly, we utilized the ROC curve to assess the models’ performance regarding the true and false favorable rates. The ROC curve provided valuable insights into the models’ ability to discriminate between different classes. Additionally, we examined the confusion matrix, which provided a detailed breakdown of the models’ classifications, including true positives, true negatives, false positives, and false negatives. We also monitored the accuracy and loss curves during the training process. These curves allowed us to track the models’ learning progress and identify signs of overfitting or underfitting. We aimed for high accuracy and low loss values to ensure the models’ effectiveness in detecting deepfakes. Furthermore, we utilized the classification report, which provided a comprehensive overview of the models’ performance across different classes. It included metrics such as precision, recall, F1 score, and support for each class. This report enabled us to assess the models’ effectiveness in detecting deepfakes in a more detailed manner. By employing these evaluation techniques, we gained valuable insights into the performance of our models. This information allowed us to measure their effectiveness and make informed decisions about their deployment in detecting deepfakes.

5.3 Eye region experiment

This experiment aimed to create deepfake detection models targeting the eye region. We assessed the effectiveness of features extracted from the eye region using two distinct CNN-based models, Models A and B. The outcomes of both Model A and Model B are illustrated in Table 3 shows the results of eye region experiments.

Table 3 Result of eye region experiment

Also, the accuracy and loss curves, confusion matrix, and ROC curves for models A and B are shown in Figs. 8, 9, respectively.

Fig. 8
figure 8

The accuracy and loss curves, confusion matrix, and ROC curve of model A in (ac), respectively, on eye region features

Fig. 9
figure 9

The accuracy and loss curves, confusion matrix, and ROC curve of model B in (ac), respectively, on eye region features

5.4 Nose region experiment

The goal of the experiments was to evaluate and compare the performance of Model A and Model B in detecting deepfakes, specifically in the nose region. The aim was to assess the effectiveness and capabilities of these models in detecting manipulated facial features in the nose area and determine their suitability for deepfake detection tasks in this specific region. In this experiment, two models, Model A and Model B, are evaluated. For Model A, the experiment was conducted for 100 epochs, while Model B was trained for 200 epochs. The results of this experiment are shown in Table 4.

Table 4 Result of nose region experiments

The experiments provide insights into the effectiveness and capabilities of the models, helping us determine their suitability for deepfake detection tasks in this specific region.

Also, the accuracy and loss curves, confusion matrix, and ROC curves for models A and B are shown in Figs. 10, 11, respectively.

Fig. 10
figure 10

The accuracy and loss curves, confusion matrix, and ROC curve of model A in (ac), respectively, on nose region features

Fig. 11
figure 11

The accuracy and loss curves, confusion matrix, and ROC curve of model B in (ac), respectively, on nose region features

5.5 Face region experiments

This experiment aimed to evaluate and compare the performance of deepfake detection models in detecting facial manipulations in the face region. It aimed to assess the effectiveness of various preprocessing techniques, datasets, and training configurations in detecting and classifying deepfakes based on facial features. We aim to gain insights into the models’ performance and identify the most effective approaches for detecting deepfakes in the face region. We conducted six experiments using various datasets, preprocessing techniques, and training configurations. Experiment 1 involved training the model on the DFDC dataset and using the BlazeFace model for video preprocessing. The model was trained with a learning rate of 0.0001 for 100 epochs. Experiment 2 also utilized the DFDC dataset but employed both the BlazeFace and Face Recognition models for video preprocessing. The model was trained with a learning rate of 0.001 for 50 epochs. Experiment 3 utilized the DFDC dataset and employed the MTCNN model for video preprocessing. The model was trained with a learning rate of 0.001 for 100 epochs using the Adam optimizer. Experiment 4 utilized the FaceForensics++ dataset and employed the MTCNN and Face Recognition models for video preprocessing. The model was trained with a learning rate of 0.001 for 50 epochs using the SGD optimizer. Experiment 5 focused on training the model specifically on the FaceForensics++ dataset. The MTCNN model was used for video preprocessing, and the dataset was subjected to cross-validation. The model was trained for 25 epochs with a learning rate of 0.0001. Experiment 6 also utilized the FaceForensics++ dataset and employed the MTCNN model for video preprocessing. The model was trained for 100 epochs with a learning rate of 1e-7 and a batch size of 32. The Early Stopping technique was employed to prevent overfitting. By comparing the results of these experiments, shown in Table 5, we can gain insights into the performance of our deepfake detection models in the face region. The experiments highlight the effectiveness of different preprocessing techniques, datasets, and training configurations in detecting and classifying deepfakes based on facial features.

Table 5 Result of face region experiments

Also, the accuracy curve, loss curves, confusion matrix, and classification results of Model C on face region are shown in Fig. 12.

Fig. 12
figure 12

The accuracy curve, loss curve, confusion matrix, and classification results of Model C on face region

5.6 Assembly model

To enhance the efficiency of our models, we utilize one of the assembly models, which is majority voting. To attain optimal results, employ this technique on 100 videos. The confusion matrix and Roc curve results are shown in Fig. 13.

Fig. 13
figure 13

Majority voting results

6 Discussion

6.1 Overview of the existing deepfake detection techniques

Deepfake detection has been a hot topic in the research community, and several techniques have been proposed. Some existing deepfake detection techniques are based on ML algorithms, such as CNNs, RNNs, and autoencoders. These techniques work by extracting features from the deepfake images or videos and comparing them with the features of the original images or videos. However, these techniques have several limitations, such as the need for large training data [26, 37], the susceptibility to adversarial attacks [14], and the inability to detect unseen deepfakes [1, 36].

6.2 Experiments analysis

In the context of the Eye region, Model A achieved a commendable accuracy of 96% after 100 epochs, while Model B demonstrated an even higher accuracy of approximately 97% following 150 epochs. [8] Comparatively, the approach presented in [32] yielded a 90% accuracy after 50 epochs, which subsequently escalated to an impressive 98.3% after 200 epochs, utilizing the same dataset, FaceForences++. This underscores the capacity of the proposed technique to potentially surpass the 98.3% threshold with a more extended training duration, contingent upon the availability of adequate computational resources.

In the domain of DeepFake detection, our investigation revealed an absence of previously described or implemented methodologies centered on the Nose region. Consequently, the proposed technique presents a novel and distinctive approach.

In the broader scope of the Face region, the utilization of six distinct experiments holds inherent value and should not be regarded as a redundant effort. Each experiment involves variations in datasets, preprocessing models, learning rates, epochs, optimizers, and other relevant parameters. Notably, the final experiment achieved a test accuracy of 85%, effectively curbing overfitting and underfitting concerns while minimizing train loss.

Incorporating multiple subsets from the dataset during the training phase further contributed to the robustness of the outcome. It is essential to acknowledge that an increased allocation of computational resources could potentially lead to training across the entire dataset and an even greater variety of subsets, consequently augmenting accuracy levels. Worth noting is that the methodology employed in [30] attained accuracies of 69% for FaceForences++ FaceSwap and 93% for FaceForences++ DeepFake datasets. This underscores the reliance on individual training for each collection, thereby hindering accurate detection across different collections.

6.3 Advantages of the proposed methodology

Our paper proposes a novel technique for deepfake detection that combines three models based on different features, including the entire face, eyes, and nose. While this combination of multiple models only slightly affected overall accuracy, it improves the accuracy of deepfake detection, reducing the impact of weaknesses in a single algorithm. Additionally, we develop a customized data processing stage for each model to detect deepfakes with high reliability. Our proposed technique also benefits from the large amount of data used for training, including datasets like FaceForensics++.

6.4 Comparison with the state-of-the-art methodologies

Our proposed technique has several advantages over the existing deepfake detection techniques. For example, the technique proposed in [17] works by extracting features from the image and then applying a transformer-based model to classify the image. Although this technique has shown promising results, it requires much training data and computational resources. In contrast, our proposed technique combines multiple models that work on different features, increasing the accuracy to 85% of detection while reducing dependence on any single algorithm.

Similarly, the technique proposed in [11] is based on a CNN trained specifically on the eye region of the face. Although this technique is effective in detecting deepfakes that involve changes in the eye region, it may be less effective in detecting deepfakes that involve changes in other parts of the face.

A comparison between recent deepfake detection techniques is shown in Table 6 Module C utilizes the CViT algorithm as implemented in paper [30]. In paper [30], the CViT algorithm was implemented in the entire face region, resulting in a 69% accuracy on the FaceForences++ database. However, when we applied the same algorithm to the entire face, eyes, and nose regions instead of just the entire face, we achieved a 97% accuracy. Additionally, in paper [32], utilizing CNN with features from the eye region resulted in a 90% accuracy. However, when applying the same algorithm to different regions (entire face, eyes, nose), we achieved an acceptable level of accuracy. As shown in the Table 6, the baseline of comparison between our system and other references lies in using the same database and algorithms. However, our approach differs by including multiple regions of the face, eyes, nose instead of just the face, leading to higher accuracy.

6.5 Limitations and future work

Our proposed technique has certain limitations, such as the need for high-computational resources for training and inference. Additionally, the technique may not be effective in detecting deepfakes that involve changes in parts of the face other than the eyes, nose, and entire face. Future research could focus on developing methods that require less data while maintaining high accuracy rates. We also plan to investigate the use of other features for deepfake detection.

Table 6 Comparison between recent deepfake detection techniques

7 Managerial implications

There are some key managerial implications for this study:

  1. 1.

    Investment in Advanced AI Technologies: Organizations should consider investing in advanced AI technologies like convolutional vision transformers and convolution neural networks. This investment is crucial for staying ahead in the rapidly evolving digital landscape, especially for sectors vulnerable to deepfake-related risks like media, politics, and finance.

  2. 2.

    Training and Skill Development: The adoption of these technologies necessitates specialized skills. Therefore, managers should focus on training programs for their technical staff to handle and utilize these advanced systems effectively.

  3. 3.

    Enhancing Digital Security Protocols: Integrating these deepfake detection methods into existing digital security protocols can significantly boost an organization’s ability to combat misinformation and protect its digital assets.

  4. 4.

    Ethical and Legal Compliance: Managers must ensure the ethical use of AI for deepfake detection, complying with privacy laws, and avoiding biases in AI algorithms.

  5. 5.

    Strategic Decision-Making: Leveraging insights from these technologies can aid in strategic decision-making, especially in content verification and public relations.

  6. 6.

    Collaborative Efforts: Collaborating with technology providers, academic researchers, and industry peers can enhance understanding and efficacy using these deepfake detection methods.

  7. 7.

    Adaptation to Technological Advancements: Given the rapid advancement in deepfake technologies, organizations must stay updated with the latest developments to ensure their defensive measures remain effective.

These implications underscore the importance of a proactive approach in adopting and integrating advanced AI technologies for digital security and integrity.

8 Conclusion

In this study, we introduced a groundbreaking method for deepfake detection, leveraging a fusion of distinct facial features and a comprehensive dataset enhanced by meticulous preprocessing. Our strategy entailed the development of a composite model, integrating three sub-models, each specializing in the recognition of deepfakes by analyzing specific facial elements: the entire face, the eyes, and the nose. Our tailored data processing techniques for each sub-model further strengthen this multifaceted approach, circumventing the constraints typically encountered in single-algorithm detection methods. Our training regimen utilized an expansive array of facial images from the most extensive dataset, such as FaceForensics++. This extensive dataset was pivotal in refining our model/s ability to discern physical anomalies indicative of deepfakes. The empirical evidence from our tests revealed a significant enhancement in accuracy and efficiency over existing deepfake detection methods, thereby establishing the superiority of our approach. A standout feature of our method is its robust performance across diverse scenarios, encompassing various environmental conditions and facial orientations, illustrating its practical applicability in real-world settings. This adaptability underscores our model’s ability to identify deepfakes with high physical fidelity, an essential attribute in the current digital era. The implications of our work are far-reaching, addressing the pressing demand for reliable deepfake detection to thwart the proliferation of misinformation and other harmful digital content. The application of our approach has the potential to safeguard individuals, organizations, and society at large from the adverse impacts of deepfakes, thereby contributing significantly to digital security and integrity. Although our results are promising, we recognize the scope for further enhancement. Future research could delve into integrating additional facial features or employing alternative datasets, aiming to augment the physical accuracy and operational efficiency of deepfake detection. Such advancements will fortify our method’s effectiveness and contribute to the broader field of digital media authenticity.