1 Introduction

Credit card transactions have significantly increased in recent years due to the quick development of electronic services, including e-commerce, electronic banking, mobile payments, and the widespread use of credit cards. Without strict verification and oversight, widespread credit card uses, and many transaction situations will result in billions of dollars in losses from credit card fraud. It is challenging to calculate the loss accurately. However, according to the Nilson Report [1], Fraud losses in all other countries totaled 18.39 billion dollars in 2018. This compares to 14.99 billion dollars in 2017. Total payment card volume worldwide is expected to reach 57.080 $ trillion in 2023, with gross card fraud reaching 35.67 $ billion. This number is expected to increase significantly in the coming years. Global gross losses from card fraud will reach 40 $ billion by 2027.

Fraudulent transactions may be done using either a stolen card from internal or external sources or false information about credit cards [2]. Activities of credit card fraud detection have been widely discussed by multiple researchers [3,4,5,6,7]. Most of these proposed algorithms have used supervised machine learning models to recognize whether a transaction is fraudulent or legitimate. Detecting credit card fraud is an important step in stopping fraud incidents. However, there are main challenges in the development of an ideal fraud detection system for banks, such as dataset insufficiency, and skewed distribution.

Dataset insufficiency: The lack of available public datasets is the main issue associated with FDS. Data security and privacy concerns-imposed barriers to data sharing for different banks. Therefore, in this study, a federated learning approach has been deployed to allow different banks to exchange datasets to construct an efficient fraud detection model without disclosing the privacy of each bank’s clients. The federated learning strategy aims to build a global integral model constructed by aggregating locally computed updates of the shared fraud detection model on distributed datasets without sharing raw data while preserving data privacy [8, 9].

Skewed distribution (class imbalance): All banks’ credit card transactions are very imbalanced; just a small percentage of them involve fraud, while the majority involve legitimate purchases. In the majority of cases, 98% percent of transactions are normal, while less than 2% percent of transactions are fraudulent. In just this situation, it is particularly challenging for predictive modeling algorithms to find patterns in the data from the minority class. As a result, classifier performance is significantly impacted by skewed class distribution. The problem of class imbalance that occurred in several domains has been addressed in several ways [10,11,12,13]. Figure 1 depicts a block diagram of the FDS with an unbalanced dataset.

Fig. 1
figure 1

Block diagram of CCFD model with an unbalanced dataset

1.1 Motivation and contributions

The following resampling methods have been suggested as a preliminary step in processing the credit card transaction unbalanced dataset: the Oversampling techniques such as minority oversampling technique (Smote); Adaptive synthetic sampling (AdaSyn), and Random oversampling (ROS). The undersampling techniques like random undersampling (RUS).

According to previous studies, several class balance approaches have been shown to cause classification algorithms to perform with varying degrees of accuracy. This prompts us to select several classification algorithms to compare their performance after using data balancing strategies. The generated dataset is utilized for training and testing various conventional machine learning and deep learning algorithms after applying the balanced distribution of the imbalanced class using the resampling approaches outlined above. The following list represents the machine learning and deep learning algorithms used in this study: RF; DT; NB; KNN; LR, and Convolutional Neural Network (CNN).

Next, comparative research on the effect of resampling techniques on the effectiveness of classification algorithms has been conducted. Also, the appropriate techniques for handling data imbalance problems are proposed. Finally, a federated learning model over multiple frameworks to preserve the data security and privacy challenges has been built, as shown in Fig. 2.

Fig. 2
figure 2

Federated learning model for FDS

Our contribution is summarized as.

Firstly, we applied the individual and hybrid resampling techniques with the common of machine learning classifiers, then to ensure the performance of the hybrid resampling techniques with machine learning, we compared the proposed hybrid approach with six of the state of arts.

Secondly, we applied the individual and hybrid resampling techniques with the CNN classifier, then to ensure the performance of the individual resampling techniques with CNN, we compared the proposed hybrid approach with two state-of-the-art.

Thirdly, after handling the unbalanced data, we built the federated learning model to handle the big issue of credit card fraud detection that learn the model with the training data distributed on their local database. With this approach, financial institutions can collectively reap the benefits of a shared global model, which has seen more fraud than each bank alone, without sharing the dataset.

Finally, we executed the proposed federated learning model with different optimization techniques and with several batch sizes over different platforms (Pytorch and Tensorflow federated) to get the best platform according to accuracy and computation time.

The following sections of this paper are presented as follows: In Sect. 2 is the review of all previous works of federated learning models to identify the fraudulent transactions, imbalance classification problem of CCFD and the integration between them. Section 3 shows a background of all used resampling methods. In Sect. 4, the details of the proposed hybrid resampling approaches with pseudocode of each approach. In Sect. 5, the main steps of the proposed federated learning model. The experimental results of all proposed hybrid resampling approaches using machine learning classifications and CNN classifier on benchmark dataset and the effectiveness of the federated learning model on different platforms were explained in Sect. 5. Finally, in Sect. 6 this paper is concluded and briefly suggestion our future works.

2 Related work

Fraud detection algorithms use machine learning to efficiently identify fraudulent transactions. Most proposed CCFDS are built using centralized learning models, and a handful of researchers are building federated learning models to tackle fraud detection. The supervised, unsupervised, and semi-supervised learning models use centralized learning strategies [14,15,16,17,18]. Fraud detection is viewed as a classification issue for a set of card transactions in data mining tasks. A comparison study on CCFD [19] has been by using supervised approaches such as Extreme Gradient Boosting (XGB); DT; RF; LR; K-NN, and SVM and unsupervised approaches such as Generative Adversarial Networks (GAN); Auto-Encoder (AE), Restricted Boltzmann Machine (RBM), and One-Class SVM (OCSVM). The authors [20] evaluated the performances of various ML techniques like SVM; KNN; DT, and NB for CCFD.

The Federated learning (FL) concept has an important role in the banking industry, especially in the fraud detection of credit cards. With the development of credit card fraud detection systems, there is a problem of data security and privacy protection, and FL will solve this problem [21]. This paper proposed a federated Neural Network Model. As a proper deep-learning model for identifying credit card fraud. However, it is unconcerned about the issue of privacy. In [22], this work applies a federated learning model for detecting Credit card fraudulence. This paper evaluates CCFD with a federated learning model for a real-time dataset. Compared to centralized deep learning models, this increases by the AUC test average of 10%.

In [23], the authors proposed two unsupervised deep learning models (AE; RBM) to identify credit card fraud using only a small number of parameters. For AE and RBM, the accuracy rate of federated deep learning models is 88% and 94% percent, respectively, while for centralized deep learning models, it is 99%and 92%percent [24]. This work introduces a new protocol that is an efficient and privacy-preserving strategy based on FL with a stochastic gradient descent method by combining differential privacy with homomorphic encryption. The authors [25] surveyed various types, including behavioral fraud, application fraud, counterfeit fraud, theft fraud, and bankruptcy fraud. Furthermore, the performance metrics for fraudulence are predicted by a decision tree, clustering algorithms, pairwise matching, neural network, and genetic algorithms. In order to combine the features in local and global models and obtain high performance with minimal communication expense, a feature fusion technique is developed [26].

Ref. [24] This work introduces a new protocol that is an efficient and privacy-preserving strategy based on FL with a stochastic gradient descent method by combining differential privacy with homomorphic encryption. The authors [25] surveyed various types, including behavioral fraud, application fraud, counterfeit fraud, theft fraud, and bankruptcy fraud. Furthermore, the performance metrics for fraudulence are predicted by a decision tree, Suvasini, et al. [27] presented comparative research on credit card fraud detection utilizing seven widely used classification techniques. The experimental findings demonstrated that, for real-time datasets, the decision tree classifier outperforms the other classifiers at predicting credit card fraud. However, the SVM model still detects fewer fraudulent transactions than the decision tree model does. Mohd [28] developed the genetic algorithm and scatter search techniques. The credit card limit is assumed and used as the cost of misclassification. This proposed technique determines the credit card’s available limit based on fraudsters’ use of this available limit. Kundu et al. [29] suggested a method to understand the transaction sequence by using the model of Hidden Markov and the K-Clustering Model. Based on the cardholder’s spending behavior, the proposed model created clusters for low, medium, and high spending amounts. It has been demonstrated that the model’s outcomes speed up fraudulence detection.

The class imbalance problem, which has drawn considerable interest from the various application fields of machine learning-based classification approaches, is the primary obstacle to developing a prediction model for CCFD [30, 31]. To address the unbalanced data, many approaches have been developed in different domains. Huang et al. [32] used deep learning to handle the imbalanced data in face analysis. They have handled this problem using a cost-sensitive approach and class resampling technique. In [33], Ouyang et al. proposed a framework for oil spill problems. The authors have proved that the imbalanced dataset problem decreases the learning model’s performance. Yang et al. [34] introduced a Sample Subset Optimization technique that handles the class imbalance distribution problem in Bioinformatics applications using ensemble learning. Sun et al. [35] created the EUS-Bag fitness function, an evolutionary under-sampling method based on a bagging ensemble framework. The PSOAANN approach, a hybrid of Particle Swarm Optimization and Auto-Associative Neural networks, was proposed by Kamaruddin and Ravi [36]. Wei et al. [37] presented an efficient solution to the unbalanced data issues for online credit card fraud detection.

3 Materials and methods

3.1 Resampling techniques

The resampling approach is a popular method for handling incredibly imbalanced datasets. Resampling approaches come in two varieties: the undersampling technique removes certain samples from the majority class (blue color data) the oversampling technique adds more examples from the minority class (orange color data), as demonstrated in Fig. 3.

Fig. 3
figure 3

Main resampling Techniques [38]

3.1.1 Oversampling techniques

3.1.1.1 Random oversampling technique (ROS)

To address the issue of class imbalance, ROS [39] is a useful and widely used oversampling method. ROS methodology: duplicate samples from minority classes that are chosen at random. Then, while training the machine learning models, combine this new sample with the original data. The original minority dataset is partially recreated using this random oversampling technique, which increases the likelihood that the model will overfit.

3.1.1.2 Synthetic minority oversampling technique (smote)

Smote [40] is common for class imbalance problems. This resampling technique uses synthetic data points existing created by interpolating new instances between available data points of the minority class. The K-Nearest Neighbors (KNN) algorithm is used to construct the interpolation of the instances of synthetic data. The KNN selects new minority class data points according to the requirements of synthetic data instances, and then adds them to the original dataset. The Smote technique will perform efficiently. In this case, the size of the datasets is small. But, when the size of datasets is large. The process will not function effectively, and creating more synthetic data points will need more calculation time.

3.1.1.3 Adaptive synthetic sampling (AdaSyn)

Adaptive oversampling is a technique used in [41]. It is suggested to avoid the limitations of Smote technique. These limitations occur while creating the synthetic data samples. Smote technique may make it more likely that data points may overlap. In AdaSyn oversampling approach, the synthetic data instances are produced using the density distribution of the minority class. AdaSyn enhances the imbalanced dataset by rebalancing it and reducing the learning bias.

3.1.2 Undersampling techniques

3.1.2.1 Random undersampling (RUS)

RUS is the most popular and effective resampling technique for class-imbalanced datasets [42]. Although the RUS is quicker than other resampling methods, it causes losing-out valuable data. Therefore, the RUS method decreases the performance of the classification algorithm while learning.

4 The proposed hybrid resampling techniques

4.1 Oversampling followed by undersampling

4.1.1 ROS followed by RUS

To balance the class distribution of a dataset, this approach employs a hybrid resampling strategy that combines random oversampling (ROS) and random undersampling (RUS). The algorithm is fed four parameters: X, PRUS, PROS, and Nmin. X is the original dataset, which includes samples from both the majority and minority classes. PRUS is the proportion of RUS that defines how many samples from the majority class are deleted. The percentage of ROS that decides how many minority class samples will be replicated is known as PROS. The number of minority class samples in X is denoted by Nmin.

The algorithm is divided into two steps: oversampling and undersampling. The technique creates NROS new minority class samples in the oversampling stage by randomly picking and duplicating Nmin samples. By multiplying Nmin by PROS, NROS is calculated. The additional samples are saved in an array SR before being added to X to create a new dataset SROS. The approach removes NRUS majority class samples from SROS during the undersampling step by randomly picking and deleting Nmaj samples. By increasing Nmaj by PRUS, we get NRUS. In SROS, Nmaj is the number of majority class samples. The array S (ROS+ RUS), which is the algorithm's final output.

figure a

4.1.2 Smote followed by RUS

To deal with imbalanced data, this algorithm is a hybrid sampling technique that combines oversampling (SMOTE) and undersampling (RUS) methods. The algorithm aims to improve classification model accuracy by generating a more representative dataset for both classes. It employs SMOTE to increase the diversity and density of the minority class, and RUS to reduce the majority class's noise and redundancy. The algorithm creates a new dataset called SSmote, which contains more instances of the minority class than the previous dataset, X, by combining it with the synthetic set S. After that, Ns = Nmaj * PR instances from the majority class in SSmote are chosen at random and removed from the dataset. These instances are saved to the final resampled dataset, a new array S(Smote + RUS). The output of the algorithm is S(Smote + RUS), which has a more evenly distributed class distribution than X.

figure b

4.1.3 AdaSyn followed by RUS

To deal with imbalanced data, is a hybrid sampling technique that combines the AdaSyn and RUS algorithms. The pseudocode begins by calculating the number of synthetic data samples required for the minority class. Then, for each instance of the minority class Xi, it locates its K nearest neighbors of the same class and computes a ratio Ri that measures how difficult it is to learn Xi based on how many of its neighbors are members of the majority class. First, applying AdaSyn that creates the synthetic data instances using the by selecting one of the K neighbors Xsi at random and interpolating between Xi and Xsi with a uniform random factor, then combines with the original dataset X to form the new dataset \({S}_{AdaSyn}\), which contains more instances of the minority class than before. Following that, using RUS, it selects Ns = Nm * P instances at random from the majority class in \({S}_{AdaSyn}\) and deletes them from the dataset.

figure c

4.2 Undersampling followed by oversampling

4.2.1 RUS followed by ROS

This hybrid between undersampling techniques (RUS) then Oversampling technique (ROS). First, the number of instances to be deleted from the majority class is calculated as NRUS = Nmaj * PRUS. It chooses NRUS instances at random from the majority class in X and deletes them from the dataset. It saves these deleted instances to a new array SRUS, which contains fewer instances of the majority class than previously. Then, the number of instances to be duplicated from the minority class is calculated as NROS = Nmin * PROS. It chooses NROS instances at random from the minority class in SRUS and duplicates them in the dataset. It saves the duplicated instances to a new array SROS, which contains more instances of the minority class than previously. Finally, it combines SRUS and SROS to create the final resampled dataset S(RUS+ ROS).

figure d

4.2.2 RUS followed by smote

Is a hybrid method for dealing with imbalanced datasets in machine learning. To balance the class distribution, it combines random under-sampling (RUS) and synthetic minority over-sampling technique (SMOTE). First, RUS: To begin, the algorithm reduces the size of the majority class by a percentage determined by PR. It creates a subset SRUS by randomly selecting samples from the majority class until the number of samples reaches Ns, which is calculated as a percentage of the original majority class size. Then (SMOTE): For each minority class sample in SRUS, the algorithm finds the minority class's K nearest neighbors. Then, by interpolating between the minority sample and its neighbors, it generates new synthetic samples. PS percent of SMOTE determines the number of new samples to be created. Bringing RUS and SMOTE together: Combining the under-sampled majority class dataset SRUS with the over-sampled minority class dataset SSmote yields the final resampled dataset S(RUS+Smote).

figure e

4.2.3 RUS followed by AdaSyn

Another hybrid method for dealing with imbalanced datasets in machine learning, is Algorithm 6. To balance the class distribution, it employs a combination of Random Under-Sampling (RUS) and Adaptive Synthetic Sampling (AdaSyn).

RUS: The algorithm begins by reducing the size of the majority class by a percentage defined by P. It creates a subset SRUS by randomly selecting samples from the majority class until the number of samples reaches Ns, which is calculated as a percentage of the original majority class size.

(AdaSyn): Based on a given parameter β, which represents the intended balanced level, the algorithm determines how many synthetic samples must be created for the minority class. The algorithm locates K nearest neighbors for each minority class sample in SRUS and calculates a ratio Ri, which indicates how many of the neighbors are members of the majority class. After normalization, the ratio Ri is used to calculate the number of synthetic samples needed for each minority sample. The minority sample and one of its randomly selected neighbors from the minority class are interpolated to create the synthetic samples. Integrating AdaSyn and RUS: The under-sampled majority class dataset SRUS and the over-sampled minority class dataset AdaSyn are combined to create the final resampled dataset S(RUS+ AdaSyn).

figure f

4.3 Oversampling followed by oversampling

4.3.1 ROS followed by smote

This approach helps to improve classifier performance on imbalanced datasets by increasing the diversity and representation of the minority class. To balance the class distribution, it combines Random Over-Sampling (ROS) and Synthetic Minority Over-Sampling Technique (SMOTE). ROS: The algorithm begins by increasing the size of the minority class by a percentage determined by PROS. It draws samples from the minority class at random and duplicates them to form a subset SR until the number of samples reaches NROS, which is calculated as a percentage of the original minority class size.

(SMOTE): The algorithm finds K nearest neighbors from the minority class for each minority class sample in SROS, which is the union of the original dataset X and the oversampled subset SR. Then, by interpolating between the minority sample and its neighbors, it generates new synthetic samples. PS percent of SMOTE determines the number of new samples to be created. Combining the over-sampled minority class dataset SROS with the over-sampled minority class dataset SSmote yields the final resampled dataset S(ROS+ Smote).

figure g

4.3.2 ROS followed by AdaSyn

The following algorithm implements the ROS + AdaSyn combination, It combines Random Over-Sampling (ROS) and Adaptive Synthetic Sampling (AdaSyn) to balance the class distribution. First, increasing the minority class size by a percentage determined by PROS. It draws samples from the minority class at random and duplicates them to form a subset SR until the number of samples reaches NROS, which is calculated as a percentage of the original minority class size. Then using AdaSyn, based on a given parameter β, which represents the intended balanced level, the algorithm determines how many synthetic samples must be created for the minority class. The method locates K nearest neighbors for each minority class sample in SROS, which is the union of the original dataset X and the over-sampled subset SR. It then computes a ratio Ri, which indicates the proportion of neighbors that are members of the majority class. After normalization, the ratio Ri is used to calculate the number of synthetic samples needed for each minority sample. The minority sample and one of its randomly selected neighbors from the minority class are interpolated to create the synthetic samples. to create the synthetic data instances based on the minority class density distribution.

Combining ROS and AdaSyn: The over-sampled minority class dataset S, which includes both duplicated and synthetic samples, is combined with the original dataset X to create the final resampled dataset S(ROS+ AdaSyn).

figure h

4.3.3 Smote followed by AdaSyn

Is a hybrid method for dealing with imbalanced datasets in machine learning. The following description for each phase:

4.3.3.1 Synthetic minority over-sampling technique (SMOTE)

The algorithm determines K nearest neighbors from the minority class for each minority class sample in the original dataset X. Next, by interpolating between the minority sample and its neighbors, it creates new synthetic samples. PS percent of SMOTE determines how many new samples need to be created. SSmote is created by combining the original dataset X with the oversampled minority class dataset S.

4.3.3.2 Adaptive synthetic sampling (AdaSyn)

Based on a given parameter {\, which represents the intended balanced level, the algorithm determines how many synthetic samples must be generated for the minority class. The algorithm locates K nearest neighbors for each minority class sample in SSmote and calculates a ratio Ri, which indicates how many of the neighbors are members of the majority class. After normalization, the ratio Ri is used to calculate the number of synthetic samples needed for each minority sample. The minority sample and one of its randomly selected neighbors from the minority class are interpolated to create the synthetic samples.

4.3.3.3 Combining SMOTE and AdaSyn

The over-sampled minority class dataset S, which is made up of both synthetic samples produced by SMOTE and AdaSyn, is combined with the original dataset X to create the final resampled dataset S(Smote+AdaSyn).

figure i

5 The proposed federated learning model

All banks will first agree on a standard fraud detection global model (the model’s architecture, activation function in each hidden layer, loss function, etc.). The existence of heterogeneity may lead to the misconvergence of the global model. Therefore, the proposed model requires handling the skewed data. The unbalanced data problem leads to the learned classifier identifying most of the fraud transactions as genuine ones. As a result, solving the unbalanced data issue is now a necessary step before developing a global model for fraud detection. Thus, the federated learning performance that is affected by statistical heterogeneity in the real-world scenario has been improved.

The proposed global model learns the fraud detection algorithm with the training data that is provided on its local database. Firstly, it handles the skewed data and normalizes the features in the appropriate interval. It then runs a neural network classification technique with an optimizer to obtain the optimal learning model parameters (gradients). Finally, it sends the gradients to the server. The combined global model has detected more fraud than each bank independently, even when the dataset is not shared. Figure 4 illustrates the main steps for the client–server process.

Fig. 4
figure 4

The main steps of the client–server process

6 Experimental results

In this section, the impact of individual and hybrid resampling strategies on the dataset of credit card fraud detection is compared. Different classification methods, including DT; GaussianNB; RF; KNN, and Logistic Regression. The federated learning model has been built over multiple frameworks to preserve data security and privacy challenges.

The experiments in this work have been done using Python programming language (Python 3). In this work, we utilized open-source tools Scikit learn (1.1.3), pandas (1.4.4), NumPy (1.22.3), matplotlib (3.5.3), TensorFlow federated (0.17.0), PyTorch (1.2.0), and Imblearn (0.9.1) in this work. The experiment was carried out using a desktop computer with an Intel core i7 1.80 GHz CPU, 16GB of RAM, and Windows 10 64-bit operating system.

6.1 Dataset

The Kaggle dataset [43] used in this research contains actual but anonymous credit card transactions performed by European cardholders. The dataset includes 284,807 transactions made by credit card in September 2013 days. There is no missing data, and only 492 of the 284,807 transactions are fraudulent, resulting in a heavily skewed dataset. Furthermore, it has 30 features, only two knowns, namely the transaction amount and time. See Table 1.

Table 1 Overview of the dataset obtained from Kaggle

6.2 Results and discussion

In this section, performance metrics, including precision, recall, accuracy, loss, F1-measure, and total computational time have been discussed to ensure the effectiveness of all used classifiers in conjunction with Resampling techniques. For accuracy evaluation, the machine learning classification techniques and CNN classifier have been adopted for comparison. The comparative results have been done by using (80:20) training–testing ratio displayed that the.

6.2.1 Machine learning classifier with data balancing techniques

This section shows the experimental results of the individual and hybrid resampling techniques in conjunction with the common machine learning classifiers.

Table 2 shows that Smote with RF has the best result according to Accuracy, F1-score, and Loss but is the worst according to Time Computation. Therefore, ROS with DT is the best according to time computation and almost performance parameters. Then, hybrid resampling techniques were proposed to obtain more effective results, as shown in Table 3. This table shows that Oversampling, followed by Oversampling and Oversampling, followed by Undersampling in combination with RF classifier, is the best resampling strategy for class imbalance issues in all hybrid resampling methods.

Table 2 Comparison results of common machine learning classifiers after applying resampling techniques on credit card dataset
Table 3 Comparison results of common machine learning classifiers after applying resampling techniques on credit card dataset

Regarding each classifier’s precision for various resampling methods, see Table 3. ROS + RUS; ROS + Smote, and ROS + AdaSyn routinely outperform the rest of these methods. Among machine learning classifiers, RF and DT have attained higher precision values.

The tribble equal operation that used in Table 3 means that the results of applying Smote then ROS as the same ROS then Smote.

For more reliability, the proposed hybrid approach is compared with many of the previous works, as shown in Tables 4 and 5. Wherever the proposed hybrid resampling technique is better than the previous works according to the performance measures.

Table 4 Comparison between previous work [3, 44,45,46,47] and our work
Table 5 Comparison between Ata`s work [47] and our work

The tribble equal operation that used in Table 5 means that the results of applying AdaSyn then ROS as the same ROS then AdaSyn.

In Table 4, the (80:20) distribution showed better performance for all common classifiers within the same data distribution. Table 5 shows that the proposed hybrid oversampling technique (AdaSyn + ROS) is better than the individual undersampling techniques for different classifiers.

Cross-validation is a very useful statistical approach for evaluating machine learning models several times to detect overfitting. In this study, k-fold cross-validation has been used. Grid search cross-validation method selected k = 10, on the given scale, our study will predict the highest accuracy, especially for RF with most resampling techniques. The cross-validation mean values of our study are compared with the previous work [48], which presents the Cross-Validation mean values obtained by each classifier for different resampling techniques. The outcomes of Tomeklinks (TMLK), ROS, and SMOTE techniques are as follows (0.964, 0.970, and 0.973). In our study. Among all resampling techniques, ROS and ROS + AdaSyn techniques have achieved excellent results. Among classifiers, RF has achieved higher mean values (0.9999). Table 6 presents the Cross-Validation mean values acquired by each classifier for various resampling approaches.

Table 6 Cross-validation mean values for our study

Each classifier’s overall evaluation time includes both the training and testing phases. Table 7 shows the total time of our study. This table shows that the total time of the previous work [48] is better than our study because this work used PCA as a preprocessing step. Due to a large amount of data, RF takes longer when combined with oversampling techniques. Although undersampling techniques save time, they may result in underfitting.

Table 7 Total time values for imbalanced methods of our study

Finally, the proposed hybrid resampling techniques within machine learning classifiers outperformed the individual resampling technique.

6.2.2 CNN classifier with data balancing techniques

The following is the structure of a CNN used to detect credit card fraud:

A vector of attributes that characterize each transaction is taken by the input layer. After applying 28 filters in the first convolutional layer, an activation function known as a rectified linear unit (ReLU) is applied. From the data, this layer extracts low-level information like transaction frequency or spending trends. 32fliter is applied by the second convolutional layer, and then there is another ReLU activation function.

After applying a 64-filter in the third convolutional layer, there is another ReLU activation function. From the data, this layer extracts higher-level traits like anomalies and outliers. The max pooling layer reduces the dimensionality of the second layer's output. This layer aids in the reduction of overfitting and the enhancement of generalization. The flatten layer transforms the max pooling layer's output into a one-dimensional vector that can be fed into a fully connected layer.

The output of the flatten layer is applied to the fully connected layer, which then performs a linear transformation followed by a dropout operation. The dropout operation, with a probability of 0.5, randomly sets some of the units to zero, which also helps to prevent overfitting and improve robustness. The output layer takes the fully connected layer's output and applies the sigmoid function to generate a probability distribution over two classes: fraud or not.

This section shows the experimental results of the individual and hybrid resampling techniques in conjunction with the CNN classifiers. The hybrid resampling techniques performed very well on machine learning classifiers. On another hand, the individual resampling techniques performed better with CNN classifier than the hybrid resampling techniques, as shown in Table 8. Among all resampling techniques, ROS and Smote have achieved higher accuracy values (99.93%).

Table 8 Comparison results of CNN classifier

For a fair comparison, the same hyper-parameters for previous works are used. There is no existing work that provides the same level of efficiency. For class imbalance problems, the smote resampling strategy works best with CNN. To ensure this result, the Smote + CNN is compared to two prior studies [48, 49], the results of which are presented in Tables 9 and 10.

Table 9 Comparison between previous work [49] and our work
Table 10 Comparison between previous work [48] and our work

Table 9 compares the proposed Model (Smote + CNN) to the baseline state-of-the-art models [49]. We can see that the proposed Model (Smote + CNN) outperforms the state-of-the-art ensemble. In terms of the majority of the criteria, particularly the F1 measure, LSTM, GRU, and ensemble model are used as ensemble techniques. The model reached its greatest AUC-ROC score, demonstrating the ability of the proposed model to distinguish between fraudulent and normal transactions as well as its ability to work with extremely unbalanced data.

The Smote + CNN algorithm outperformed all compared models, as shown in Table 11. In Table 12, the CNN model has been built with the same framework architecture of the previous work in addition to the resampling step using Smote resampling techniques. The simulation results showed that the proposed Smote + CNN model is better than the traditional CNN model. CNN classifier with data balancing techniques.

Table 11 Results of different batch sizes of FL_SMOTE + CNN
Table 12 Comparison results between the traditional model and federated model over TensorFlow environment

6.2.3 Federated learning model with different batch sizes over several frameworks

A federated learning model has been built with different batch sizes to select the optimal number of batch sizes for the CCFD problem, as shown in Table 11; then, this model runs on different environments, and these results are presented in Tables 12 and 13.

Table 13 Comparison results between the traditional model and federated model over the PyTorch environment

As per the graphical representation of these boxplots, shown in Figs. 5, 6, 7, 8, 9, for different classification techniques (machine learning classifier and CNN classifier) in combination with different resampling (individual and hybrid) techniques. For each resampling technique, the performance of all used classifiers is presented.

Fig. 5
figure 5

BoxPlot of accuracy

Fig. 6
figure 6

BoxPlot of precision

Fig. 7
figure 7

BoxPlot of recall

Fig. 8
figure 8

BoxPlot of f1-score

Fig. 9
figure 9

BoxPlot of loss

Tables 12 and 13 present the performance of the traditional model of Smote + CNN on Tensorflow and PyTorch environments, respectively. The accuracy of the traditional model on TensorFlow is better than PyTorch for all optimizers. However, it costs more computational time. On the side, the performance of the federated model of Smote + CNN on Tensorflow federated and PyTorch-pysyft environments, respectively. The accuracy of the federated model on PyTorch-pysyft is better than TensorFlow federated for most optimizers. But it requires more computational time.

The federated learning results are applied using 100 iterations, and Adam optimizer has used learning rate 0.1, SGD 0.1, and MSGD 0.1 with 0.2 as moment value.

Figures 10, 11, 12 demonstrate the performance of the single CNN model with different optimization techniques whereas Fig. 10 displays how the Adam optimizer outperforms other optimizers, especially on tensorflow platform. Tensorflow platform is faster than Pytorch platform with all optimization techniques as shown in Fig. 13. In Fig. 12, the Adam optimizer achieves the minimum loss value on Pytorch platform.

Fig. 10
figure 10

Accuracy of the single model across tensorflow and pytorch platforms

Fig. 11
figure 11

Time of the single model across tensorflow and pytorch platforms

Fig. 12
figure 12

Loss of the single model across tensorflow and pytorch platforms

Fig. 13
figure 13

Accuracy of the federated model with different optimizers

Figures 13, 14, 15 demonstrate the performance of the federated model with different optimization techniques whereas Fig. 13 displays how MSGD optimizer outperforms other optimizers. TensorFlow federated platform is faster than Pytorch_pysyft platform with all optimization techniques as shown in Fig. 14. In Fig. 15, the MSGD optimizer achieves the minimum loss value on Tensorflow Federated platform.

Fig. 14
figure 14

Time of the federated model with different optimizers

Fig. 15
figure 15

Loss of the federated model with different optimizers

7 Conclusion

A federated learning approach for CCFD is presented in this research to address data privacy concerns. Additionally, hybrid resampling methods were suggested as a way to address imbalanced class issues and enhance classification efficacy. The outcomes of the experiments demonstrated that when combined with the proposed federated learning approach. Notably, the Smote resampling technique is the best with the proposed CNN model, and AdaSyn + ROS is the best with the DT model according to all performance parameters and computational time. The accuracy of the federated model on PyTorch-pysyft (93%, 92%, 90%) is better than TensorFlow federated (92.15%, 91.97%, 92.93%) for Adam, SGD and MSGD optimizers, respectively. However, it costs more computational time. Because of the dataset's limitations, this should be approached with caution. The best accuracy for the RF, LR, KNN, DT, and Gaussian NB classifiers is 99,99%; 94,61%; 99.96%; 99,98%; and 91,47%, respectively, according to the experimental data. The comparative results reveal that the RF outperforms the NB, RF, DT, and KNN with high performance characteristics (accuracy, recall, precision, and f score). With all resampling approaches, RF achieves the lowest loss levels.

In future works, the performance of the proposed federated learning model will be improved by integrating more advanced optimization techniques. Also, privacy protection of gradients (learning parameters) that may lead to model poisoning by injecting malicious data will be handled. The federated model’s communication and aggregation updates will be optimized in a secure and scalable way.