Introduction

The increasing availability of low-cost digital intelligent devices, such as smartphones, laptops, desktop computers, and digital cameras, has sparked the development of multimedia content (photos and videos) on the Internet [1,2,3]. Furthermore, the rise of social media in recent decades has enabled individuals to instantly share recorded multimedia material, resulting in a huge increase in multimedia content creation and accessibility [4, 5]. Knowing the truth and trusting the information have become increasingly difficult as the speed with which false data might be created and distributed has increased, potentially ending in devastating consequences. Also, a deepfake is content generated by deep learning (DL) [6] that seems authentic in a person’s eyes [7]. Deepfake is a combination of the terms DL and fake, and it refers to content generated by a deep neural network (DNN), which is a subset of machine learning (ML) [8, 9]. Everyone could alter video and image files. This has been possible for some years because of a variety of user-friendly software packages that allow for video, audio, and image editing [10]. Media manipulation has become easier thanks to the widespread use of smartphone apps that perform automated operations like lip-syncing, audio instrumentation, and face swaps [11, 12]. In addition, DL-driven advancements have led to many artificial intelligence (AI)-driven technologies that make manipulations highly plausible and credible [13]. Each of these approaches appears to be a beneficial addition to a digital artist’s toolkit [14]. However, when they are used deliberately to produce fake media, they may have serious societal and personal consequences. So deepfakes are a well-known example of media that have been intentionally altered and have raised major concerns. In addition, the AI-driven deepfake technology substitutes one person’s identity in a video with another [15]. This frequently disseminated fake information and spread statue-harming info by impersonating politicians [16].

However, deepfake is also a term that refers to a variety of face alteration techniques that use cutting-edge technologies such as computer vision and DL [17]. Full-facial synthesis, attribute manipulation, expression swap, and identity swap are the four kinds of face manipulation. One of the most popular forms of deepfake video is identity swap, often known as face swap, in which the faces of the source people are changed to the faces of the targeted people [18, 19]. Although certain deepfakes may be created using traditional visual effects or computer graphics, DL techniques, such as auto-encoders and generative adversarial networks (GANs), have been the most common underlying mechanism for deepfake production in recent years [20, 21]. Such models synthesize facial pictures of people with comparable expressions and activities and characterize their facial emotions and motions [22, 23]. Such techniques generally require large amounts of audio/video data to train systems to produce realistic images and movies [24]. Celebrities and politicians were the first targets of deepfakes since they have many videos and photos on the Internet. Deepfakes are frequently used to replace the faces of celebrities and politicians with those of the characters in pornographic pictures and films [25]. When deepfake techniques are used to create fake movies of world leaders with fabricated utterances for falsification, it threatens global security [26]. Deepfakes can, therefore, be used to instigate political or religious conflicts between nations, mislead the public and sway election outcomes, or disrupt financial markets by disseminating false information. It may be employed to create phony satellite pictures of the earth with characteristics that do not match reality to fool military analysts, such as a synthetic bridge over a river when there is none. A force may be deceived when crossing a bridge during a battle [27, 28].

The rapid rise of multimedia sharing, particularly on social media, raises concerns about data privacy and reliability [29,30,31,32]. With AI and DL’s growth, convincing multimedia manipulations have become easier, leading to concerns about their malicious use. Deepfakes, covering a range of techniques like facial synthesis and identity swaps, complicate effective detection [33]. Beyond entertainment, deepfakes pose serious security threats, enabling political impersonation and misinformation dissemination. Federated learning (FL) and blockchain [34] offer promising solutions. FL allows ML model training while preserving data privacy [35,36,37], and blockchain ensures data trustworthiness by authenticating source data and enhancing deepfake detection. Nonetheless, FL is a computing paradigm that attempts to train ML models on data from many devices while maintaining data privacy and security. It does not necessitate the transmission of data from the user’s device to a central storage location. So FL is an ML paradigm in which several clients collaborate with the central server to solve an ML issue. Each client’s locally stored data is not shared with other clients; instead, the system concentrates on updating the global model by aggregating local models taught by each client. FL, in general, lets many companies interact and create ML models without having to share data. Also, blockchain provides several trust methods to solve the issue of trust in a decentralized network. The notion of private data exchange served as a foundation for source data authentication. In this work, we created a blockchain-based DL technique for aggregating local DL models, which assures data integrity. Because of a disregard for an individual’s privacy, such data-sharing tactics raise the danger of data leakage. The suggested blockchain-based FL system works together to learn from various sections and video types. To begin, we consider using a data normalization approach to standardize data from diverse sources. Moreover, to detect deepfakes in visual patterns, we utilize DL models. The convolutional neural network (CNN) and SegCaps methods are used to enhance the feature extraction of pictures in the local method [38]. Subsequently, a capsule network (CN) is trained to enhance generalization. So we introduce a global model and solve the privacy challenge via an FL technique. The suggested system gathers data, trains a smart model collectively, and then disseminates this smart model over the public network in a decentralized manner. The weights from many local models are pooled using FL, but the sections’ data privacy is preserved. Resources/clients only exchange gradients with the blockchain network to maintain privacy. The gradients are pooled via blockchain-based FL, and the updated model is then disseminated to approved clients. The decentralized blockchain architecture for data sharing across many sources securely transmits the data without endangering the source’s privacy. As a result, we created a new blockchain FL with DL technique utilizing the TL method for deepfake detection (BFLDL) approach to combat and find deepfake videos. In terms of overall performance, the BFLDL system outperforms numerous cutting-edge methods. The major contributions are characterized as follows to underline the key findings and significance of the proposed work:

  • Proposing a framework named BFLDL for integrating edge and blockchain technologies into deepfake detection methods to guarantee data protection and accuracy

  • Presenting a data normalization approach to train the FL model using data from various sources correctly

  • Suggesting a lightweight method for deepfake detection that makes use of the exceptional performance of the used DL method

  • Utilizing the TL to obtain the highest accuracy and area under curve (AUC) while reducing the training time

The following sections of this work are as follows: In the “Related Work” section, a synopsis of previous studies relevant to the investigation being planned is provided. The system model is defined in the “System Model” section, whereas the BFLDL method is described in the “BFLDL System” section. The study findings of the BFLDL approach and a full comparison with the current methodologies are provided in the “Results of Experiments” section. Finally, the “Conclusion and Future Work” section provides a summary of the entire work and a conclusion.

Related Work

Fake videos swiftly spread, putting national security, prominent celebrities’ social standing, and the safety of eminent political figures in jeopardy. The creation and manipulation of fake videos have become a relatively low-cost task in terms of creating a manipulated sequence due to the wide availability of open-source datasets, significant advancements in research in fields like GANs [39], and significant technological advancements in high-speed computing [40, 41]. A deepfake film might be used to defame and spread false political propaganda by superimposing a politician’s visage over the face of a target actor. Kohli and Gupta [18] presented a strategy for exploiting the frequency domain characteristics of face forgery. A frequency CNN (FCNN) is used in their technique to evaluate and categorize clean and counterfeit faces. To assess the efficacy of the FCNN, the FaceForensics +  + dataset was employed. The study showed that FCNN detects forgeries efficiently in actual circumstances, including high and low video quality. Among all other facial modification techniques, the FCNN detects deepfakes with the greatest recall of 0.9256, 0.8639, and 0.8399 for raw, c23, and c40, respectively. Their technique was also tested on a CelebDF (v2) dataset as well as an automated FaceForensic benchmark. The findings demonstrated the usefulness of the suggested approach for detecting face manipulation. Chen and Tan [42] introduced feature transfer, a two-stage deepfake detection approach that relies on unsupervised domain adaption. The feature vectors derived from CNN are utilized in backpropagation based on domain-adversarial neural networks (BP-DANN) for adversarial TL, which leads to higher efficiency than end-to-end adversarial learning. The face detection network is initially utilized in the preprocessing stage to extract the face region of the video frame, which is then enlarged by 1.2 times to crop and save the face picture. Lastly, the facial images are inputted into the CNN to obtain the 2048-dimensional feature vectors. Furthermore, the features extraction CNN pre-trained on a big deepfake dataset might well be utilized to extract additional transferrable feature vectors, reducing the gap between the source and target throughout unsupervised domain adaptive training.

Hu et al. [43] worked on compressed deepfake movies with a low-quality factor to cater to scenarios commonly seen on social media. In reality, compressed videos are widespread on social media platforms like Instagram, WeChat, and TikTok. As a result, determining ways to detect compressed deepfake films becomes a critical challenge. The temporality-level stream is employed to extract the inconsistency between frames to discover the temporal characteristics of compressed movies. The two streams extract compressed video’s frame-level and temporal-level information. They tested their suggested two-stream technique on FaceSwap, NeuralTextures, Face2Face, deepfakes, and CelebDF datasets, and the results seemed to have outperformed the previous work. The accuracy of cross-compression detection findings demonstrated that their approach is robust to compression factors. Also, Caldelli et al. [44] suggested that optical flow field dissimilarities differentiate between deepfakes and real videos using CNN. Their research is based on the usage of CNNs that have been trained to detect potential motion dissimilarities in the temporal structure of a video clip using optical flow fields. The test results produced on the FaceForensics +  + dataset are intriguing and demonstrate that the method is well suited to extracting distinctive characteristics between the fake and real instances, particularly when dealing with the tough cross-forgery scenario. Moreover, they demonstrated how their technique leveraged discrepancies on the temporal axis, which improves their efficiency when integrated with well-known cutting-edge frame-based approaches.

Liu et al. [45] proposed a lightweight 3D CNN method. Their proposed module extracted the features with fewer parameters at a higher level. The 3D CNNs were used as a spatial–temporal module to merge spatial information in the time dimension. Also, their proposed module sought to extract deep-level features from the outcome of the spatial–temporal module using as few parameters as possible. Results demonstrated that their network has fewer parameters than other networks and outperforms existing cutting-edge deepfake detection techniques on major deepfake datasets. Also, Mitra et al. [46] demonstrated a DL-based method for detecting deepfake videos on social media with excellent accuracy. They categorized and modified footage using a neural network–based technique. A model was developed that consists of a CNN and a classifier network. The CNN modules were chosen from three different structures: InceptionV3, Resnet50, and XceptionNet, and a comparison study was conducted. They analyzed three existing CNN modules and decided on the XceptionNet as the feature extractor for the most accurate model, along with the proposed classifier. They used intermediate compression to train the network and achieved good accuracy even in a high-loss environment. Even without training with extremely compressed videos, their approach is the key to achieving great accuracy.

Suratkar et al. [47] developed a system for detecting fake videos that used a CNN architecture and the TL. Their approach used a CNN to collect features from each movie frame to build a binary classifier that can efficiently distinguish between real and altered videos. Their approach was tested on many deepfake films culled from diverse datasets. The findings demonstrated that using TL to develop a relatively robust model for deepfake detection is doable. With the notion of TL at its helm, the technique employed allows any model to be trained significantly faster, significantly reducing training time. Their paper aimed for maximal generality by combining datasets collected using various methodologies from various sources to obtain generality. Moreover, Heidari et al. [48] used blockchain-based FL to train a global DL model with data from multiple hospitals, ensuring data integrity and privacy. They addressed data variability, classified lung cancer patients using CapsNets, and developed an anonymous global model training approach with blockchain and FL. Real-world lung cancer data were tested on multiple datasets, achieving an impressive 99.69% accuracy with minimal errors, validating the method’s effectiveness.

In our paper, we address the critical challenge of identifying and mitigating deepfake content. This approach involves collecting diverse datasets, extensive preprocessing, and leveraging DL models, particularly SegCaps, which are tailored for the task of multimedia forensics. In contrast to Heidari et al.’s work, we focus on lung cancer detection, utilizing the FBCLC-Rad technique, with an emphasis on image preprocessing, spectral analysis–based feature extraction, and CapsNets for classification. One notable distinction between the proposed method and Heidari’s approach is the choice of DL architecture. We employ CNN with different architecture and SegCaps to enhance deepfake detection, while Heidari et al. utilized CapsNets for lung cancer diagnosis. These differences underscore how the selection of DL architectures can be domain-specific and tailored to address distinct challenges and objectives.

In this study, we propose a novel, comprehensive approach addressing critical gaps in current deepfake detection methods. Unlike prior methods focusing on specific aspects, our approach combines blockchain-based FL for data anonymity, SegCaps, and CNN fusion for robust feature extraction and addresses data heterogeneity using a proposed normalization technique. Additionally, by leveraging TL and preprocessing methods, our approach further enhances DL performance. This approach, supported by blockchain and FL, ensures adaptability to varied video qualities while preserving data privacy. Table 1 summarizes DL-based deepfake video applications.

Table 1 A comparison of the discussed video-deepfake methods and their characteristics

System Model

To address the shortcomings of previous methods, such as high delay, low accuracy, and insecure methods, we present a BFLDL method (a blockchain-based FL method with DL technique) for recognizing deepfake videos that generate an accurate and strong partnership based on data from multiple resources. The BFLDL system, which is based on blockchain and FL, draws insights collectively from multiple data segments, each containing diverse types of videos with varying quality. To commence, we propose a data normalization process aimed at standardizing videos sourced from numerous origins into a uniform format. Within this standardized data, we employ DL models to identify patterns indicative of deepfakes. Moreover, we leverage the FL methodology to train a global model, addressing security concerns along the way. The core of the BFLDL approach lies in its capacity to compile valuable insights from an array of data segments, collaborate in training an intelligent model, and subsequently distribute this model across a decentralized public network. During this collaborative effort, weights from various local models are amalgamated while safeguarding the data sources’ privacy. It is important to note that clients participating in this process only exchange gradients with the blockchain network to ensure the preservation of their anonymity. The FL process gathers these gradients and then communicates the updated model to trusted sources via blockchain technology. This decentralized blockchain architecture facilitates secure data sharing among multiple sources without compromising the security of the providers’ data.

Because it effectively analyzes data utilizing blockchain and DL models, the BFLDL method is helpful for big data research. Imagine the following scenario: sources with updated deepfake samples are used in real time. Also, data must be stored on a decentralized network without endangering the privacy of the contents, and knowledge must be securely exchanged, to uncover new deepfakes or new information about pictures or videos. FL uses a decentralized network to protect data and spread training efforts to train a better model to utilize the most up-to-date video data. The BFLDL architecture gathers data from a variety of sources and trains a collaborative DL model. FL uses the blockchain to integrate all of the independently trained models. Because it has the most up-to-date knowledge on deepfake content, the jointly trained global model offers better and more accurate predictions. We used several sources to share the data in this paper. Each source provides data to train the global model illustrated in Fig. 1. The primary objective of this article is to pool data from many clients and train a DL model cooperatively. We create a normalization approach to cope with diverse types of video material because the data is acquired from many sources. After normalizing the data, we segmented the images and used the CN to train the model for the rearrangement of deepfake suspects. We use a blockchain-based FL framework to train and distribute a collaborative model. Also, FL is used to merge the weights of a model that has been trained locally. The global model is returned to the resources or organization once the locally learned model weights are aggregated.

Fig. 1
figure 1

The BFLDL scheme’s global system

Ensuring the privacy of our data sources is paramount in our approach. Blockchain technology serves as a protective shield, preventing the leakage of sensitive information. Transactions on the blockchain ledger come in two categories: data sharing and retrieval transactions. We implement a permissioned blockchain system to safeguard data privacy, which diligently records all transactions and enables data access within a secure global framework. The second objective is to foster data collaboration. In pursuit of this, we utilize local model weights to enhance the accuracy of the local model iteratively. Additionally, we employ spatial normalization techniques to handle diverse or imbalanced data, developing a more precise local model. The BFLDL model comprises two integral components: a local model and a blockchain-based FL system. The discussion commences with the challenge of identifying fake videos, and ultimately, in the process of training the global model, we facilitate the transfer of local model weights via the blockchain network.

Normalization of Data

FL presents a considerable challenge in dealing with input data from several sources and devices with varied configurations. The majority of current techniques in FL are ineffective in dealing with this problem. To address this issue, we provide a normalization method that takes into account various types of video quality and brings them all up to the same level [49]. So BFLDL could handle the dataset’s diversity and train a better learning model as a consequence of this normalizing process. There are two aspects to the normalizing method: spatial normalization and signal normalization. The dimensions and resolution of the videos are dealt with through spatial normalization. The intensity of each voxel in the images is dealt with by signal normalization. As previously said, high-resolution and low-resolution movies have different qualities. FL was utilized in the method to combine data from multiple sources. We are using the \(299 \times 299 \times 3\) standardized size. Also, we reproduce the normal resolutions using the Lanczos interpolation. Because the variety of the material and data collected from multiple sources varied, we used the signal normalization approach. This strategy also aids the model’s convergence by reducing bias and establishing equitable distribution across the dataset’s contents. The pixel intensity is signified by \(q\). In Eq. (1), \({q}_{{\text{min}}}\) and \({q}_{{\text{max}}}\) are the input file’s minimum and maximum intensity values.

$${q}_{{\text{norm}}}=\frac{q-q_\text{min}}{q_\mathrm{max-}q_\text{min}}$$
(1)

Furthermore, videos differ from images. A 4-s video dataset is created using original and edited video clips. The frames are then taken from each compressed video without decompression. The faces were then identified and extracted from each frame. Finally, the BFLDL input is used to normalize and scale all frames. Also, the image size is kept constant.

Classification Model and Segmentation

Segmentation and grouping are discussed in this section. The segmented images are used to classify deepfakes using the CN. For segmentation, the BFLDL approach uses 2D slices as input and employs a standardized volume. Each volume 3D image has three planes: XY, XZ, and YZ. To recognize deepfake contents, we establish the XZ or YX planes. This form of segmentation could also be effective for detecting medical radiography images created by data processing. In a DL architecture, a feature extraction pipeline is typically used to estimate and extract prominent properties. The retrieved features are used to train a multi-layer perceptron (MLP) learning algorithm to learn the right class. This training begins with preparing a labeled dataset containing input data (features) and corresponding target labels, typically denoting whether an image is a deepfake or not. The data is passed through the MLP during training, which calculates an error by comparing its predictions to the true labels. This error is then propagated backward through the network via a process called backpropagation, facilitating the adjustment of the network’s internal weights and biases using optimization algorithms. This iterative training, spanning multiple epochs, refines the MLP’s accuracy over time. Subsequently, the MLP’s performance is assessed on separate validation and test datasets, ensuring that it can generalize effectively to new, unseen data. This training process equips the MLP with the knowledge and parameters required to make accurate predictions when presented with novel input data, with its efficacy hinging on meticulous data preparation and the continual refinement of model parameters. The system of the modified CN is comparable to Hinton’s CN. The convolutional layer, the hidden layer, the PrimaryCaps layer, and the DigitCaps layer are the four layers that make up the CN [50].

A capsule is created when the input characteristics are on the lowest layer. There are several capsules in each stratum of the CN. The activation layer computes the length of the CN to recompute the feature component scores to train the CN and represents instantiation entity parameters. The capsule takes on the role of a neuron in this situation. CNs describe an input at the component level and assign a vector to each component. This vector’s length indicates the possibility of a component’s existence, and it replaces max pooling with “routing by agreement.” Because capsules are self-contained, the likelihood of accurate classification improves when numerous capsules agree on the same criterion. A weighted matrix \({W}_{i,j}\) rotating and translating a posture vector \({u}_{i,j}\) to a vector \({U}_{i}\) can be used to represent each component. In addition, Eq. (2) can be used to calculate the prediction vector:

$${v}_{i|j}={w}_{i,j}{.v}_{i}$$
(2)
figure a

Algorithm 1 The BFLDL method

With \({c}_{i,j}\) as a coupling coefficient, the next higher-level capsule, \({s}_{u}\), processes the sum of predictions from all lower-level capsules. So capsules \({s}_{u}\) can indeed be written as

$${s}_{u}=\Sigma\;{c}_{i,j}\;{v}_{i|j}$$
(3)

wherein \({c}_{i,j}\) is a softmax routing function with the following formula:

$${c}_{i,j }={q}^{i,j} /\Sigma qi,j$$
(4)

Also, the parameter \(c\) is an important one. The output probabilities are scaled between 0 and 1 using \(Z\) squashing function, which can be written as

$$S=\frac{\left|z\right|\times |z|}{1+\left|z\right|\times |z|} \frac{z}{|z|}$$
(5)

Similar to softmax, \(C\) is an array that may be chosen by consensus via dynamic routing [51]. The basic principle behind this method is that the distribution of the low-level capsule’s output to the high-level capsule is progressively modified based on the high-level capsule’s output over several rounds until an optimal distribution is reached [52]. We use Algorithm 1 to accomplish the BFLDL model.

Each source provides data to train the global model illustrated in Fig. 1. The primary objective of this article is to pool data from many clients in a secure way and train a DL model cooperatively. Multiple clients, each with their own data, train DL models in FL. Instead of sharing their data with a central server, the clients train local models on their data and submit model updates to the server. The server collects and utilizes all client model updates to update a global model. This is often an iterative process, with the server transmitting the most recent version of the global model to the clients, the clients training local models on their data and giving back model updates to the server, and the server aggregating the model updates and updating the global model. This technique is repeated until the global model converges or a different stopping threshold is reached. Typically, the server is in charge of managing the FL process and gathering model changes from clients. On the other hand, clients play a crucial role in the process because they are in charge of training local models on their data and communicating model changes to the server. Overall, DL enables decentralized training of DL models, eliminating the need for all data to be centralized or shared across clients. This is beneficial when data is dispersed across numerous parties, such as in a chain of hospitals or other institutions.

In general, a model trained on a large dataset often outperforms a model trained on a small dataset because it has more instances to learn from and can capture more patterns and nuances in the data [53, 54]. However, there are a few ways in which FL can still be beneficial, even when the training data is small. For starters, FL lets numerous clients contribute their data and computer resources to the training process, which can aid in scaling up the amount of data on which the model is trained. This is extremely helpful when the data is spread across different parties. Second, because the data remains decentralized and is not shared among clients or with a central server, FL can assist in increasing the privacy and security of the training process. This is useful when the data is sensitive or restricted or when clients are unwilling or unable to provide their data. Finally, FL can help increase the model’s generalization capacity by allowing it to learn from a varied variety of data sources and adapt to different distributions and biases in the data. This can be particularly useful when the data is not very diverse or is biased in some way. Overall, while the performance of an ML model is typically related to the amount of data it is trained on, FL can still provide benefits in terms of scalability, privacy, and generalization, even when the training data is not particularly large.

Feature Extraction

Two separate strategies were used to extract features from the faces observed by YOLO: a texture-based analysis tool and a SegCaps-CNN-based method. The integration of both texture-based analysis and the SegCaps-CNN-based method for feature extraction in the YOLO-based face recognition system stems from their complementary strengths. Texture-based analysis excels in capturing fine facial details, including wrinkles and pores, essential for recognizing unique characteristics, while the SegCaps-CNN-based method captures structural information and spatial relationships in facial images, and is crucial for handling pose and expression variations. This combined approach ensures a comprehensive feature set. Empirical evidence supports this choice, demonstrating superior performance compared to using either method alone. Additionally, we introduced preprocessing to optimize CNN learning and applied data augmentation techniques, such as random rotation, horizontal and vertical flipping, and color changes, to enhance robustness to diverse conditions and appearances, resulting in a more accurate and resilient face recognition system. Incoming faces must be preprocessed to train the high-resolution net (HRNet) and provide various examples to the CNN, which speeds the learning process. The recognized faces are normalized and scaled before being randomly cropped to \(299\times 299\), allowing the model to detect altered faces even if they are only present in a small portion of the face. Second, the image is upgraded at random with rotation, flipping horizontally and vertically, and color changes.

Other than that, the original HRNet featured a pooling layer at the end, the basic idea of which was to prevent the loss of any spatial information. It appeared to be a similar system to that of the capsule system. The HRNet’s pooling layer is eliminated, and the basic feature vector created by the HRNet has been used to suit the two ideas best. By linking high-to-low-resolution convolutions in parallel and constructing stable, high-resolution representations while executing frequent fusions across parallel convolutions, HRNet maintains high-resolution representations throughout the feature extraction process. The upsampled representations of all parallel convolutions are mixed to create the output feature maps. The HRNet transforms a \(299\times 299\times 3\) face into a \(64\times 56\times 56\) output shape by passing it through two 3 × 3 convolutions with a stride of size 2. It is divided into phases, each of which is a subnetwork with several parallel branches. Every branch has half the resolution and twice the number of channels as the one before it; for example, if the last branch’s resolution is \(C\), the first branch’s network resolutions are \(8C\), \(4C\), \(2C\), and \(C\). Two residual blocks, each with two \(3\times 3\) convolutions, make up one branch.

The first phase is to generate a \(16\times 56\times 56\) feature vector with a high-resolution subnet comprising one residual block with four distinct size convolutions. The next three steps add high-to-low-resolution subnetworks one at a time. Phase 2 has three campuses that represent two different resolutions: the first includes the downsampled resolution achieved by comparing a \(3\times 3\) convolution with stride 2 to obtain a feature vector of size \(32\times 28 \times 28\) that will also propagate to the end of the stages, whereas the second includes the downsampled resolution obtained by applying a 3 × 3 convolution with stride 2 to obtain a feature vector of size \(32\times 28 \times 28\) that will also propagate to the end. A fusion layer is responsible for merging the distinct feature vectors from each parallel branch between each successive step. Diverse resolutions are downsampled employing strided convolution or upsampled utilizing basic nearest neighbor sampling because all feature maps should indeed have the same size. The output of the first three branches is supplied via the residual lock that was used to limit the number of channels in stage 1 to add the feature maps and acquire the output of the \(512\times 14\times 14\) networks in stage 4.

Texture analysis is performed by extracting the local binary patterns (LBPs) for each channel of two distinct color spaces (HSV and YCbCr) and creating a histogram for each LBP. After that, the histograms are scaled and combined with the HRNet feature maps. The processes for creating LBP histograms are shown in Fig. 2. The luminance and chrominance elements of the identified faces are acquired by changing them to both HSV and YCbCr color spaces for color texture analysis. Due to its inadequate separation of chrominance and luminance data and significant correlation between its color elements, the RGB color space was not relevant in this investigation.

Fig. 2
figure 2

Block diagram for texture analysis

The HRNet characteristics are merged with those created by the LBP in the concatenation step of the feature extraction process. It generates a \(512\times 14\times 14\) feature vector combined with the \(6\times 14 \times 14\) feature vector obtained by the six histograms. A \(518\times 14\times 14\) feature vector is created during the feature extraction stage, which is utilized for training and modifying the capsule system’s network weights. Our proposed method is designed with efficiency in mind at various levels. It utilizes CN-based segmentation and classification for deepfake detection, leveraging the efficiency of CNNs known for their low computational overhead [55]. The use of 2D slices for segmentation minimizes computational demands compared to those of the 3D approaches. Selective plane establishment focuses on relevant planes (\(XZ\) or \(YX\)), avoiding unnecessary computations. Meticulous data preparation enhances training efficiency, while a CN reduces computational costs. The “routing by agreement” mechanism in capsule networks and the adoption of FL ensure efficient feature extraction and decentralized data handling, respectively, reducing overhead.

BFLDL System

In this section, we look at a decentralized data-sharing platform with a lot of resources. The BFLDL technique aids in concealing user data and distributing the model over a decentralized network since each source is ready to contribute its locally learned model weights. FL may also be used to integrate the net effect of several distinct systems from various sources. The primary objective is to use FL to transmit data between sources while securely protecting resource privacy.

Fast and Effective Federated Learning on the Blockchain

Deepfakes’ contents require a large amount of storage. Due to its limited storage space, blockchain data storage is monetarily and computationally costly. Consequently, the original input videos are kept by the sources, while blockchain aids in recovering the trained model. When a new source transmits data, the block records a transaction to validate the data’s owner. The sources include details such as the type of data and its volume. The BFLDL idea was created to fulfill a need to retrieve data shared by many parties. Multiple data sources can work together to share information and train a collaborative model that can anticipate the best outcomes. The source retrieval approach does not violate their privacy. We provide a multi-organization blockchain architecture. Also, \(S\)’s sources are all partitioned, and they all share data for various categories. Each category has its community, which is in charge of keeping \(\mathrm{Log }(n)\) table up-to-date. The blockchain stores the unique ID of each source. Equation (6) expresses the retrieval of data into physically existing nodes. So this is used to determine the distance between two nodes, where \(S\) indicates the data categories used to collect data from the sources. Furthermore, the weight matrix characteristics for the nodes \({s}_{i}\) and \({s}_{j}\) are \((x*\frac{{s}_{i}}{pq}\)+\(\frac{{s}_{j}}{pq})\), and data is extracted by measuring the distance between two nodes \({d}_{i}\) (\({s}_{i}\),\(s_j\)). Each source creates its unique ID based on the logic and distance between the nodes [56].

$$\begin{aligned}{d}_{i} \left({s}_{i},{s}_{j}\right)=&\;\frac{{\Sigma }_{p,q}\in \left\{{s}_{i}\cup {s}_{j} -{s}_{i}\cap {s}_{j}\right\}\left(x*\frac{{s}_{i}}{pq}+\frac{{ s}_{j}}{pq}\right)}{{\Sigma }_{p,q\in {s}_{i}\cup {s}_{j}} \left(x*\frac{{s}_{i}}{pq}+\frac{{ s}_{j}}{pq}*\right)} \\&.{\text{Log}} \left({d}_{p}({s}_{i},{s}_{j})\right)\end {aligned}$$
(6)

The nodes \({s}_{i}\) and \({s}_{j}\) in Eq. (7) have a unique ID \({s}_{i}(id)\) and \({s}_{j}(id)\).

$$d \left({s}_{i},{s}_{j}\right)={s}_{i}(id) \oplus {s}_{j}(id)$$
(7)

The randomized method for two source nodes is presented in Eq. (8) to ensure information privacy in a decentralized way. Where \(Y\) and Y′ are the data records next to each other. The outcome dataset is denoted by the\(O\). The data is kept private thanks to\(A(Y\mathrm{^{\prime}} )\in K\).

$${s}_{r}\left[A\left({\text{Y}}\right)\in K\right]\le \mathrm{ exp}\left(\upvarepsilon \right)\cdot {s}_{r}[A(Y\mathrm{^{\prime}})\in O]$$
(8)

Nevertheless, Laplace is used for local model training (\({m}_{i}\)) to achieve data privacy for many sources:

$${M}_{i}={m}_{i}+\mathrm{ Laplace}(k/\upvarepsilon )$$
(9)

In which, \(k\) denotes the sensitivity as defined by Eq. (10):

$$k={{\text{max}}}_{s,s^{\prime}}|f\left(s\right)-f({s}^{\prime})|$$
(10)

With the aid of the local models, the consensus method is utilized for training the global model. We present proof of work (PoW) to allow data to be exchanged among nodes since all nodes cooperate in training the model. Throughout the training stage, the consensus method analyzes the quality of local models, and accuracy is quantified using mean absolute error (MAE). Prediction data is denoted by \(f({x}_{i})\), while original data is denoted by \({m}_{i}\) and \({w}_{i}\). Also, \({m}_{i}\)’s excellent accuracy reveals its low MAE. Equations (11) and (12) reflect the consensus mechanism, which should be a voting method among the sources. The locally trained model is Eq. (11) MAE (\({m}_{i}\)), while the global model weights are Eq. (12).

$${\text{MAE}}\left({m}_{i}\right)=\frac{1}{n}\Sigma |{w}_{i}-f({x}_{i}) |$$
(11)
$${\text{MAE}}\left({S}_{j}\right)=\gamma . {\text{MAE}}\left({m}_{j}\right)\frac{1}{n}\mathrm{\Sigma\;MAE}\left({m}_{i}\right)$$
(12)

All information is encrypted and signed, employing public and private keys to safeguard the confidentiality of the data sources (\({{\text{PK}}}_{i}\),\({{\text{SK}}}_{i}\)). \({\text{MAE}}\) computes and broadcasts \({S}_{j}\) for all model transactions, whereas \({\text{MAE}}(M)\) computes and broadcasts \({S}_{j}\) for each model transaction. A record is added to the distributed ledger if all transactions are permitted. The training of the consensus algorithm is more thoroughly stated as follows: to begin, the node \({S}_{j}\) transmits to the node \({S}_{j}\) the local model \({m}_{i}\) transaction. Second, the node \({S}_{j}\) delivers the local model \({m}_{i}\) to the leader. The block node is then broadcast to the \({S}_{i}\) and \({S}_{j}\) by the leader. Fourth, double-check the \({S}_{i}\) and \({S}_{j}\), and then, wait for approval. Then, save the blocks in the blockchain retrieval database at the end.

Data-Sharing Process

Encryption is used in today’s data privacy and security domains [57]. Due to special security concerns, it is risky for data providers to offer personal data. In today’s data privacy and security landscape, concerns arise due to the need to safeguard sensitive personal information from data breaches and privacy violations, comply with strict regulatory standards, and maintain trust with data owners. Sharing personal data directly can expose it to malicious actors, making encryption a vital protective measure. Additionally, techniques like sharing model weights or summaries help minimize data exposure, reduce the attack surface, and enhance control over shared information, addressing these concerns while still providing valuable insights [58]. Giving the data to the requester with sufficient details while respecting the data owners’ privacy is a straightforward approach. Data providers, such as resources/clients, just communicate the locally learned model weights with the requester instead of giving the real data. The consensus mechanism learns from the FL data as the nodes interact with one another. On blockchain nodes, the provider and requester look for and store data. Data from several global resources is safely downloaded to integrate the blockchain with FL, allowing for excellent prediction. We provide the trained model rather than the raw image data to safeguard the privacy of the data. The BFLDL architecture aims to use locally learned models to train the global model. We select the training data in the first stage and then utilize the private FL method for collaborative multi-client learning in the second phase. To put it another way, the client uploads locally trained model weights to the blockchain system, and FL mixes the local and global models. We pick the training data in the first stage and then utilize the private FL method for collaborative multi-source learning in the second. Alternatively, the sources broadcast the locally learned model weights to the blockchain system, and FL combines the local and global models.

The deepfake detection model proposed in the study strikes a crucial balance between security and efficiency, encompassing the dual imperative of robust protection against malicious manipulations while maintaining practical applicability. On the security front, the framework leverages cutting-edge CN-based segmentation and classification techniques, ensuring that subtle anomalies inherent in deepfake videos are accurately identified. Emphasis is placed on its effectiveness in enhancing security. Simultaneously, the importance of efficiency is acutely recognized. The model’s computational overhead is extensively evaluated, with comprehensive information about hardware and software configurations, execution times, and resource utilization metrics being provided. This evaluation reflects our commitment to delivering an efficient solution that can be seamlessly integrated into real-world applications, such as video hosting platforms and authentication systems. Additionally, the nuances of the trade-offs between security and efficiency are delved into, with insights into parameter tuning and optimization strategies being offered to cater to specific use cases. Furthermore, potential avenues for future research are discussed, encompassing lightweight models, hardware acceleration techniques, and adaptive strategies that allow processing intensity to be dynamically adjusted to meet varying security requirements. In sum, a versatile and practical solution for deepfake detection is delivered by the proposed model, embodying a holistic approach that carefully weighs the twin considerations of security and efficiency.

Node Selection Using an FL Model

The dataset, which was compiled from various sources, is diverse. As a result, a quick and efficient aggregation approach is required by the global aggregation model. We pick the trained model in the nodes to optimize the correctness of the aggregated global model, i.e., \({S}_{p}\)\({S}_{i}\), to enhance the aggregated global model. We present the \(\uppsi\) = \({[\uppsi }_{t}]\) vector time frame for source state selection and discuss the node selection problem. The source is picked if \(\uppsi\) = 1; else, it is not. We provide cost metrics for the node selection operation. The time slot \(t\) of the source \(i\) local capsule network model is

$$c\left(i\right)=f\left({\omega }_{i},{q}_{i}. \right)={q}_{i}.\frac{{S}_{m}}{{\omega }_{i}(t)}$$
(13)

where \(qi\) is the training data from the source \(i\) and \({S}_{m}\) is the number of CPU cycles required to train the model \(m\). The following is how the cost is calculated (Eq. (14)). The trained model’s size is \(\mu i\), and the period is \(t\). Plus, time cost t is described as Eq. (15).

$${c}_{c}(i)=f\left({\beta }_{i},{\mu }_{i}.t\right)=\frac{{\mu }_{i}}{t}$$
(14)
$${c}_{{\text{time}}}(t)={\text{max }}(c\left(i\right)+{c}_{c}(i))$$
(15)

The time cost function is defined as follows:

$${c}_{{\text{time}}}\left(t\right)=\frac{1}{{|S}_{p}|}\Sigma (c\left(i\right)+{c}_{c}(i))$$
(16)

With time slot \(t\), the learned model accuracy loss is determined as

$${c}_{q}\left(t\right)=\mathrm{\Sigma \sigma }\left({\mu }_{t},{d}_{i}\right)$$
(17)

The aggregated model, on the other hand, is \({\mu }_{t}\), and the loss function is \(L()\). The images for the training data from source \(i\) are \({d}_{i}\)= (\({x}_{j,}{y}_{j}\)). For each source in our system, the quality of the trained model is evaluated. The overall cost of FL with a time slot is also calculated as follows:

$$c\left(\uppsi \left({\text{t}}\right)\right)={c}_{q}\left(t\right)+{c}_{{\text{time}}}\left(t\right)$$
(18)

So we have carefully defined a set of cost metrics. These metrics encompass \(c\left(i\right)\) and \({c}_{c}\left(i\right),\) quantifying resource usage for local model training and communication costs of participating nodes. To derive these metrics, we account for training data quality, model size, CPU cycle requirements, and communication expenses. Additionally, \({c}_{{\text{time}}}(t)\) is introduced to measure the overall time cost, considering the slowest node’s impact. Equation 16 offers an alternative perspective on the \({c}_{{\text{time}}}\left(t\right)\). Model accuracy is a priority and is assessed through \({c}_{q}\left(t\right)\) to gauge the accuracy loss over time. These diverse cost metrics are integrated into \(c\left(\uppsi \left({\text{t}}\right)\right)\) ultimately providing a comprehensive cost assessment for node selection in FL. The parameter \(\uppsi \left({\text{t}}\right)\)) plays a pivotal role in this process. This approach offers valuable insights for optimizing FL systems across various datasets and computational resources.

Transfer Learning

TL approaches are employed as complex pre-trained models that are learned from a large number of data sources, such as 1000 categories from ImageNet, and then “transferred” to a relatively basic job with a small volume of data. Assuming that the source data \({G}_{x}\) depicts ImageNet, the resource label \({L}_{x}\) depicts 1000-category labeling, and \({S}_{o}\) depicts the resource objective-predictive function, including the classifier, we have the resource domain knowledge \(S\) as a triple variable containing \(D=\{{G}_{x}, {L}_{x}, {S}_{o} \}\). The target data \({W}_{s}\) represents the training set, \({L}_{s}\) represents the two-class labeling (deepfake, non-deepfake), and \({D}_{s}\) depicts the classifier to be produced. As defined,\(D=\{{G}_{s}, {L}_{s}, {S}_{s} \}\). The classifier to be built could be described as \({S}_{s}\) (\({G}_{s}, {L}_{s}\)|\(T\)) in TL. The classifier is described as \({D}_{s}\) (\({G}_{s}, {L}_{s})\) without utilizing TL [25].

$$D_s=\left\{\begin{array}{lc}D_s\;(G_s,L_s\vert T)=D_s\;(G_s,L_s\vert G_s,L_s,D_s)&\mathrm{Using}\;\mathrm{TL}\\\qquad \qquad \qquad \qquad \qquad \qquad \quad \varvec{1}\\D_s\;(G_s,L_s)&\mathrm{Not}\;\mathrm{using}\;\mathrm{TL}\end{array}\right.$$
(19)

So we may conclude that \({D}_{s} ({G}_{s}, {L}_{s}| T)\) is predicted to be a considerably nearer classifier \({S}_{2}^{{\text{ideal}}}\) than the classifier that utilizes just the target domain \({D}_{s} \left({G}_{s}, {L}_{s}\right)\), assuming a significant number of samples \(W\) and labels \(L\). Thus, \({\text{err}}[{S}_{s} \left({G}_{s}, {L}_{s}\right|T)\left(G\right),L\)] < \({\text{err}}[{D}_{s} ({G}_{s}, {L}_{s})\left(G\right),L\)]. The error function \({\text{err}}(a,b)\) computes the difference between the two inputs, a and b. Building and training a network from the start are one of the most important aspects that can help TL improve its efficacy. In the context of TL, the error function \({\text{err}}(a,b)\) plays a crucial role in quantifying the dissimilarity between two classifiers or models. Here, “\(a\)” represents the performance of one classifier, while “\(b\)” represents the performance of another. This error function measures the discrepancy between their predictions or classifications, serving as a gauge of how much the two models’ outputs diverge in terms of accuracy or correctness. In the passage, the comparison \({\text{err}}[{S}_{s} \left({G}_{s}, {L}_{s}\right|T)\left(G\right),L\)] < \({\text{err}}[{D}_{s} ({G}_{s}, {L}_{s})\left(G\right),L\)] illustrates this concept. It demonstrates that a classifier built using transfer learning denoted as \({S}_{s}\) (\(({G}_{s}, {L}_{s}| T)\) is expected to yield lower error when applied to data “\(G\)” with labels “\(L\)” in the target domain “T” compared to a classifier \({D}_{s} \left({G}_{s}, {L}_{s}\right)\) trained without transfer learning. This finding underscores the potential performance enhancement that transfer learning can offer by leveraging pre-trained models and adapting them to specific tasks. PTM helps the user avoid messing with hyper-parameters. Likewise, the PTM’s initial layers are viewed as feature descriptors, extracting low-level features, including shades, blobs, edges, textures, and tints, and the target model may only need to re-train the PTM’s last few layers because we believe the last few layers carry out the complicated tasks.

Results of Experiments

Experimental studies and additional evaluations are discussed. Firstly, we present the datasets and then the implementation details. The experimental data is then provided, followed by additional analysis.

Deepfake Dataset

FaceForensics (FF + +) [59], DeepFakeTIMIT [60], DeepFakes Detection Challenge Preview (DFDCpre) [61], and CelebDF [62] were the datasets used in the research. FaceForensics +  + is a big dataset for face modification detection that is utilized in deepfake detection. Also, FaceSwap, DeepFakes, Face2Face, and NeuralTextures are deepfake techniques for constructing artificial faces. There are more than 1000 deepfake videos in each category. Furthermore, movies are compressed using three different compression rates: the original (c0), the significantly compressed (c23), and the low-quality video (c40). Also, C23 movies are near to lossless compression, with a compression quantization value of 23. Also, c40 movies have a compression quantization value of 40. Faceswapping is used to construct DeepFakeTIMIT, which is based on the VidTIMIT dataset. Low-quality (LQ) and high-quality (HQ) fake movies are divided into two categories, each with a distinct resolution of fabricated faces. In each category, there are numerous fake videos. The DFDC Preview dataset, which contains around 5000 movies, is a sample of the DFDC data collection. The realistic videos have a wide range of performers of all skin tones, genders, and ages, as well as lighting, head positions, and visual variety backdrops. Two deepfake techniques are applied to create artificial faces, each with its unique set of quality switching results. CelebDF is a complicated deepfake video dataset of fake videos. The celebrities’ genders, ages, and races vary, and the original videos for this dataset were acquired from YouTube. The visual quality of the false videos has increased thanks to an upgraded deepfake synthesis algorithm. The videos in the datasets are all framed first. Also, DLIB is then used to acquire the feature points of the faces in each frame, which aids in the locating and trimming of the face region. The experimental evaluates accuracy (ACC) and the area under the curve (AUC). We detected artifacts at the frame level, i.e., at the image level. The greater the AUC and ACC values, the higher the method’s effectiveness. Table 2 summarizes the most commonly employed databases in this area, in which the visual manipulatives are accurately faced replacements.

Table 2 Dataset summary

Implementation Details

The Adam optimizer was picked as one of the best stochastic gradient descent (SGD) methods because it combines the best features of AdaGrad and RMSProp. This might easily deal with noise and sparse gradients during the training. Clipping the videos is also done with FFmpeg. To create and use the DL models, we used TensorFlow on the backend and Python 3.6 on the front end. scikit-learn was used to implement the ML model Linear SVC. To calculate performance measurements, the scikit-learn Python library was also employed. PySyft is used to implement the blockchain system. It is a Python library for secure and private ML that can be used in blockchain and decentralized systems. It can be used to implement FL algorithms, which involve multiple parties training an ML model on their data while keeping the data private. The computations were carried out on an Intel (R) Core (i9) processor running Windows 11, with an NVIDIA Quadro RTX 6000 graphics card and 64 GB of RAM. For a handful of the complex calculations, Google Colab Notebook was employed.

Results and Analysis of Experiments

The experimental data and analyses are presented in this section. We employed TL to improve accuracy and cut down on training time. The ImageNet dataset is used to train all DL modules. The detection accuracy of low-quality videos has increased by varying amounts in different datasets after employing the TL technique, as shown in Table 3, and the detection accuracy of HQ films has increased by 0.4% for the DFDC dataset. The success of the TL approach may be demonstrated by its ability to enhance detection accuracy and AUC. Enabling the TL technique can also increase the AUC in LQ videos by 0.8 for the CelebDF dataset when assessing AUC. Extensive experiments show that the BFLDL method may deliver excellent results on a variety of datasets. In general, a model trained with TL will outperform a model trained from scratch (without TL). This is because TL enables the model to apply information and skills learned from a big, general-purpose dataset, such as ImageNet, to the new task. When a model is trained from scratch, it begins with random weights and must learn everything about the task from the provided data. This can be difficult if the dataset is tiny or not very diverse, as the model may not have enough data from which to learn. When a model is trained using TL, the weights have already been learned on a large and diverse dataset, such as ImageNet. This provides a strong starting point for the model and can help it learn more quickly and accurately on the new, smaller, and more specialized dataset. As a result, a model trained with TL generally outperforms a model trained from scratch on a new dataset. However, the task, the dataset, and the model design will determine the precise performance difference. The BFLDL method’s accuracy and AUC score are greater than 96% on all datasets examined. The problem of missing face texture features has been handled in these four datasets thanks to the employment of a more powerful deepfake synthesis method. In general, the BFLDL method can produce the best outcomes.

Table 3 Accuracy of method on different datasets

Also, the influence of the input frame number was explored. The criteria for selecting input frames for deepfake detection will differ depending on the method used. Frame consistency, frame quality, and relevant information are some common criteria for selecting frames for deepfake detection. Use a more realistic number of input frames to increase performance. There are more parameters the more frames that are entered. Furthermore, two frames are not taken into account since the 3D model may not be properly utilized in the time dimension with two frames. As a result, the input is only analyzed for three, four, five, and six frames. Table 4 shows the experimental results for the number of input frames. The network with four frames as input works best. Consequently, 5 input sample frames look to be a reasonable number of input sample frames to provide significant temporal information without requiring too many parameters. We simply chose the mentioned numbers to focus on a narrower range of input frame numbers for simplicity and to reduce the number of variables tested. Furthermore, we discovered that when using more than six input frames, the performance of the deepfake detection method did not improve significantly and that the performance gain did not justify the computational cost of using more input frames.

Table 4 On the CelebDF dataset, the ACC of systems with various numbers of input frames were investigated

Comparison with Other Methods

All deepfake detection techniques use various datasets as input, depending on the settings of the investigations. The tests are performed on c40 samples from FF +  + , HQ, and LQ samples from DeepFakeTIMIT, samples from DFDCpre modified by technique, and all samples in CelebDF. We reimplemented methods and trained all techniques with the same settings. Table 5 shows the outcomes of the investigation. According to the experimental observations, the provided method outperforms the existing cutting-edge deepfake investigative techniques. On FF +  + , with DeepFakeTIMIT, the BFLDL achieves an accuracy of above 97%. Also, CelebDF has a detecting accuracy of 98.9%, whereas DFDCpre has a detecting accuracy of 98.1%. So the variations between various samples from different datasets are taken into account. Even when the compression rate is high, FF +  + ’s strong detection abilities on c40 samples are proved by its tolerable performance. When a variety of post-processing processes are applied, the good DFDC pre-performance demonstrates the detection capabilities. Furthermore, the BFLDL system’s success on CelebDF demonstrates its ability to handle high-quality manipulation examples. The effectiveness of the BFLDL approach, on the other hand, when evaluated against that of other techniques, demonstrates its efficacy.

Table 5 On different datasets, we compared the BFLDL system to various deepfake detection algorithms in terms of accuracy

The BFLDL system, on the other hand, offers a high-level of security and privacy. Also, the running cost rises with the number of clients or transactions because of the increased communication overhead, as shown in Fig. 3. The numbers are normalized. In the context of FL, normalization refers to the process of scaling and shifting the data on each client. Normalization is often applied to the data before it is used to train an ML model, as it can help the model to converge faster and improve its generalization ability. In the case of the FF +  + dataset, for example, it is split and distributed among the clients participating in the FL process. Each client would then apply normalization to its portion of the data before using it to train a local model. The local models trained on the client’s data would then be aggregated to form a global model, which would be used to make predictions [67]. The added value provided by FL can be appreciated by comparing the performance of the global model trained using FL to the performance of a model trained using a traditional, centralized approach. We gathered data from five different resources or clients to assess FL’s effectiveness. Multiple clients could communicate data and learn from FL in this paradigm. The total cost of the BFLDL distributed model was increased when the sources or clients were enhanced, as illustrated in Fig. 4. For better outcomes, it is preferable to use multiple clients.

Fig. 3
figure 3

This demonstrates that the running cost grows with the number of transactions as the communication load equals 1, 0.7, 0.5, 0.3, and 0.1 for resources/clients 1 to 5, respectively

Fig. 4
figure 4

Comparison of performance between the local model (CN) and BFLDL methods

The model loss convergence is shown in Fig. 5. Because the selection from different resources is not the same, the accuracy does not change smoothly, as seen in Fig. 6. The number of deepfakes used determines the precision. The model loss is handled in the same way. It is also obvious that the amount of providers is increasing. Each resource trains a local model using normalized data, and the global model integrates all local models. The quantity of providers affects the performance of the collaborative model. Furthermore, Fig. 7. illustrates the run time. The number of iterations in different sub-datasets varies depending on the dataset.

Fig. 5
figure 5

The loss of deepfake detection for different providers

Fig. 6
figure 6

The accuracy of deepfake detection for various providers

Fig. 7
figure 7

The time of deepfake detection for various providers

In summary, the BFLDL approach detects deepfake videos that combine FL, blockchain technology, and DL. It uses FL to train a model on data from multiple devices without transmitting it to a central server. It uses DL techniques to extract features from images and improve the model’s generalization. The BFLDL system also includes a data normalization approach to standardize data from different sources and a TL method to improve accuracy and efficiency.

Conclusion and Future Work

This study offered a framework for enhancing fake video detection and exchanging data among clients while ensuring privacy by employing up-to-date methods. Also, data normalization is a method of coping with diversity in information. The prior techniques require a large amount of data to train a more accurate model. We utilize a CN to detect deepfakes in images and videos since it has a high detection rate. So deepfake videos are detected using CN-based segmentation and classification and a method for jointly building a global model using blockchain technology and FL. So the CN helps DL models perform better in their interior layers. Extensive experiments were conducted on multiple DL models to train and evaluate them. In addition, to boost DL performance, we employ a TL approach. In addition, we installed a deep global model on the edge platform to reduce communication costs between the client and global models. Because it can learn from shared sources or data from several sources, the BFLDL model is intelligent. On the other hand, a comparison of BFLDL with benchmarks found that the BFLDL exceeded cutting-edge works in terms of accuracy and AUC.

Also, future studies would involve real-time deepfake detection accomplished using online intrusions that claim to examine both video and audio-visual data in 1-s intervals. In addition, the combination of the provided method with other techniques such as matrix algebra [68], Kalman filtering [69], multitask learning-enabled graph convolution network [70], active subspace random optimization [71], and deep neural network–based logical and activity learning model [72] is exciting lines for future research.