Digital transactional fraud detection (DTFD) is the main success factor in the banking sector and business environment for detecting fraud that is committed using different digital payment modes, like debit, credit, prepaid cards, or other electronic payment modes. The rise of digital payment systems in all modes has meant that losses are expected to increase as well. Mainly on online platforms and in online shops, e-commerce has grown to make purchases easier. As e-commerce has grown in popularity, so has the prevalence of fraud. With concerns related to this kind of misrepresentation in various transactions, the reputation of the company may decrease, and that reflects on the economic sector. Online transactions that are high-risk and unlawful are always detected by fraud detection systems. Numerous systems continuously monitor real-time consumer behaviour and provide risk scores to detect potentially fraudulent transactions. E-commerce companies, digital banking systems, and other compliance departments employ solutions like Amazon, Flip Cart, and many others to continuously monitor for potential fraudulent actions that are taken by their users.

Financial institutions all across the world are attempting to improve their skills for preventing and detecting fraud in digital transactions. Fraud detection is a proactive strategy that frequently has a cost to safeguard the system from damaging activity that is ongoing when conducting transactions. Fraud prevention is a challenging task. Researchers from all over the world are suggesting new strategies to increase deterrent effects [1].

Most of the credit card fraud issues are addressed by data mining and machine learning techniques that have been commonly used to avoid fraud over the last few years and discussed by many authors [2,3,4,5,6,7]. The primary indicator of fraud location is the ability to detect unwanted behaviour by observing the activities of massive amounts of customer data. The methods utilised for fraud detection systems in machine learning and neural networks are KNN (K nearest neighbour), SVM (support vector machine), DT (decision trees), fuzzy logic, ANN (artificial neural networks), and many more [8]. Deep learning offers a promising answer to the problem of credit card fraud detection and digital fraud detection by allowing businesses or individuals to make the most use of both previous customer behaviour and real-time transaction details captured at the moment of the transaction. Various deep learning algorithms CNN (convolutional neural network), SOM (self-organizing maps), DNN (deep neural networks), LSTM (long short-term memory) are used for identification of credit card frauds [9,10,11,12]. There are a few authors who work with reinforcement learning in the fraud detection era. The authors theoretically explained reinforcement learning but did not provide insight into its implementation.

This article contributes to the RDQN model, which consists of two modules: feature selection and classification based on an expert reinforcement learning method that works with deep Q learning. As can be seen, the primary motivation for using the feature selection algorithm is to improve the classification rate. The best features always produce better results and reduce the computational time to obtain the decision value. Deep learning models yield higher accuracy as they train a network very well by updating the weights. However, the proposed DQN uses agents and an ensemble of agents’ actions in an environment to identify whether the transaction is fraud or not by voting majority. In spite of the single agent in reinforcement learning, we adopted a multi-agent system, which enhances the overall system performance.

Feature selection: In view of the classification problem, collecting data are an essential move. In this paper, Rough Set Theory (RST), a methodology that addresses ambiguity and uncertainty in data analysis, is utilised to access the dataset’s key features. This assists in obtaining reduced features as reducts and contributing reducts to our updated DQN architecture.

Classification: The proposed renovated RDQN uses deep reinforcement learning to blend DNN with the reinforcement learning framework (i.e., Q-learning). It is further outlined by the agents in their approach to accomplishing their objectives here. The two activation functions—ReLU in a hidden layer of the network and a new MISH (non-monolithic) activation function—train a deep neural network. The neural networks can apply complex processes like data classification to these activation functions. The backpropagation process of neural networks was trained using a mini-batch approach. For a higher classification rate, we use Q learning to validate the results. Q-learning is an off-policy reinforcement learning method that uses current transaction data to determine the fraud transaction as a course of action. By calculating the reward function, we determine the rewards for the individual transaction, and that will lead to conclusions instead of a fraud prediction. So, that concluded the better classification.

The following points are made to highlight the contributions of this study:

  • RST is a technique for extracting the most important features from a dataset and training different reduced minimal feature datasets.

  • created a novel hybrid approach for classification by combining DNN and reinforcement learning (Q).

  • Q learning is integrated into DNN. This DQN classifier is new to this field of digital transactional fraud detection application.

  • The statistical results also prove DQN outperforms other classifiers.

The rest of this paper is written in accordance with the rules. The mechanisms for detecting digital transactional fraud are detailed in the next section of the article. The subsequent section provides an overview of the key terms that are relevant to deep neural networks and reinforcement learning followed by which all the system’s specifics are provided. Next, frameworks for experiments are listed. Findings are provided in the penultimate section. The paper concludes with findings and ideas that can be extended to further research.

Review of related works

Now, it is a major issue for customers if they are on the unsafe side while doing their transactions in digital mode. It has been a difficult process to identify the fraud before significant damage has been done. However, many scientists and researchers are coming up with new ideologies and approaches to identify fraud in digital transactions. Here, the survey section discusses several contributions available in terms of fraud detection and prevention. Feature selection is the major concern of the proposed model. Vijaya et al. [13] extracted vital features from RST for predicting customer churn in the telecom sector. RST is one of the numerical practises to deal with vagueness or defective information. RST works based on the notion that links link information to each element in the universe (U) of discourse. The RST computes upper and lower approximations, followed by vagueness linked by the boundary region. RST was recommended by Minz et al. [14] to make the search for dominating attributes in information systems easier. A rough set-based decision tree (RDT) proposal was designed to solve the issue of high computational time by combining RS tools with traditional DT capabilities. More accuracy was seen with the combination of RDTs than with the separate decision trees. A novel rough set approximation employing probabilistic values was introduced by Zhou [15], and it requires more decision classes than two to produce an information table. To create the information table, the author talked about Pawlak’s rough set-based theory and the rough set model. The rough set model is a decision-theoretic rough set model that is created using a certain option and alpha and beta threshold values that are set to 0.5. To solve classification difficulties, a crude set-based model with probabilistic values was employed, which results in a three-way decision-making process as an addition to the often-used binary decision-making. For diverse goals, such as dimensionality reduction and feature extraction, writers additionally created a large number of additional feature selection algorithms and heuristics [16, 17]. Recognize that RST performs effectively, and the best feature is the selection, despite all of these issues.

The next module of our model is classification. Features generated by RST were based on existing training models like machine learning (ML), deep learning (DL), and reinforcement learning (RL). Here are some of the best training models created by various authors. Carrasco et al. [18] applied advanced fraud detection techniques based on complicated rules, statistical modelling, and ML. As is the case with this, he discussed several deep neural networks, such as MLP, CNN, and DAE, that are tested to quantify their potential to detect the false-positive rate. Now, the most trending technologies are machine learning and deep learning. These techniques were used by different authors to solve binary or multi-classification problems. For Example, Dighe et al. [19] investigated both machine learning algorithms (LR, NB, SVM, KNN, DT) and deep learning algorithms (Chebyshev Functional Link ANN and MLP) applied these algorithms to training and testing data to identify fraud in credit card transactions. MLP produced a better result in terms of accuracy.

In the same way, Forough et al. [20] suggested a novel credit card fraud identification model. He developed models based on both deep neural networks (LSTM) and probabilistic graphical models, i.e., conditional random fields (CRFs) that use sequence labeling. These methods were considered for previous instances and also for predicted labels. They compared it with other HMMs, GRUs, and single LSTMs. But the model (LSTM-CRF) gave good results on the credit card data set. Mbunge et al. [21] implemented the deep learning method MLP in different perceptions and the HMM—hidden Markov model—to analyse the observational data. It uses probability distributions to predict the likelihood of an event occurring, whether a transaction is fraudulent or legitimate. The probability of changing from one state to another is represented in the matrix by transition probabilities. These probabilities are provided to a multilayer perceptron that is used to classify the transaction as suspicious or non-suspicious.

Deep learning techniques have been thoroughly examined by Nguyen et al. [12], who have used them to solve challenging issues like detecting credit card fraud. As can be seen, LSTM and CNN are 1DCNN (one-dimensional) and 2DCNN (two-dimensional), respectively, where the internal representation is learned from the input data via the feature mapping process. On various financial data sets, he compared their performance with that of different machine learning models. To get the desired output, Pillai et al. [22] focused more on training the neural network for improved performance. In his proposed study, compared the performance of the neural network by varying nodes in the hidden layer and also with various types of activation functions Sigmoid, ReLU and Tanh of ANN. MLP with Hyperbolic Tangent (Tanh) gave the best result. They identified the Tanh activation function which gives the minimum sensitivity value due to its operation and nature. They [23] conducted thorough monitoring of the activation function, which is important for both training and performance analysis of neural networks. An enactment is used to demonstrate the neural network non-linearity concept. Many activation functions, including sigmoid, hyperbolic tangent, ReLU, Leaky ReLU, and Swish, have been utilised in the past. Here, a brand-new, modified activation function called MISH is suggested, providing performance robustness. Through extensive testing and experimenting, MISH produced results that were preferred to those of Swish and ReLU by Diganta Misra.

Many authors explored the deep Q learning algorithm and exploited it in a wide variety of complex applications, like collaborative business processes with cloud services, manufacturing assembly programs, many robotic applications, automated trading in equity stock markets, and many more [24,25,26,27,28]. Chatterjee et al. [29] discussed deep reinforcement learning for the application where the most phishing activities are taking place on websites and detecting malicious URLs. By calculating the reward function, they calculated dynamic behaviour on websites to identify phishing activities and learned the attributes using Q learning. For the purpose of detecting credit card fraud, Zhinin-Vera et al. [30] addressed the Q-CCFD model, which divides transactions into valid and fraudulent ones. They combined deep learning, an autoencoder, and AI agents to build a model using AI approaches. With the aid of reinforcement learning, this framework categorises issues by rewarding AI agents based on the expected variable. It assigns a reward as “positive” if it determines that a transaction is authentic; otherwise, it assigns a reward as “negative.” Similar to how a Q-learning algorithm was put into practise by Gopchandani et al. [32] for the purpose of detecting credit card fraud, they talked about the learning function of Q (quality). The Q learning model learns rules from a set of actions. If an action is different from the learning rules, the model takes a precise step in choosing a random action. Tortolero et al. [31] discussed and explained various things in deep reinforcement learning. They proposed the Q-CCFD model, which employs three components for efficient fraud detection: a deep autoencoder, a mediator network, and an agent. They have shown results from the training phase with 1000 episodes and predicted the class label in a testing phase. Mead et al. [32] extended their work on the reinforcement learning paradigm with the Markov decision process (MDP). In their experiments, an agent was always focused on predicting the optimal set of e-transactions, like stealing money as much as possible by committing the bank’s fraud. For each transaction, rewards are allocated based on the authorization activity of the cardholder. The card was charged − 50 for the first transaction. + 5 denotes a successful low-amount transaction, + 50 denotes a successful higher amount transaction, and 0 denotes a declined transaction. This technique has shown better results in increasing the true positive rate in credit card fraud detection systems.

Canese et al. [33] used both actor-critic algorithms and the taxonomy of reinforcement learning algorithms. They talked about non-stationarity, changing learning rates, and scalability as some of the drawbacks of multiagent reinforcement learning methods. This paper discusses many techniques for agents to operate in the environment and is well versed in the exposure of deep reinforcement algorithms. Methodologies such as partial observability, centralised learning of decentralised policies, COMA (counterfactual multi-agent policy gradient), which addresses the issue of multi-agent credit assignment, and agent-to-agent communication are exposed.

To increase the robustness of the models, Huang et al. [34] presented work on quantitative and qualitative trading by predicting price change methods in the financial industry market. By keeping an eye on the learning environment in a specific location, they developed a novel sampling strategy for determining which data are worth learning. The experiments’ findings showed that the adaptive sampling approach outperformed the random learning approach. Additionally, it increases the computational effectiveness of the model. A deep Q learning agent’s behaviour was taken advantage of by Carta et al. [35] with relevant training data from the real-world financial market sector, an agent was taught repeatedly. The experimental results of this intraday trading show that it outperforms traditional methods such as the buy-and-hold strategy. The four tuples state, action, probability, and rewards used by the authors to express the MDP and DQN models that are suggested as a method for learning the Q-table. Here, it is demonstrated that the DQN is a clever approach for handling classification-related issues.

Existing relevant works

Rough set theory

Rough sets constitute a sound basis for KDD (knowledge discovery in databases). Generally, the classical rough sets are used as a framework to design data mining and machine learning algorithms. Rough sets offer a mathematical tool to discover invisible patterns in data, and they give a tool for the induction of (learning) approximations of sets that would be used for hard computations. The classical rough set theory is based on equivalence relations, which can be thought of as a partition of the universe, and the partition is viewed as a type of knowledge composed of definable sets. In this theory, the uncertain sets are approximated by the definable ones. RST handles the certainty of the data. Basically, rough sets are used for several purposes, such as feature selection (FS), feature extraction (FE), data reduction (DR), core identification (CI), rule generation on decisions from data, straight-forward interpretation of pattern extraction, and obtaining results. So, it is called a predictive modelling approach. In our approach, we used the RST algorithm to reduce the number of attributes to ‘N’. But beforehand, we have to discuss the information table or decision system table, which is constructed in the form of tables with categorical data.

Information system (I.S.)

Information systems are also called “decision tables.” Knowledge is a rough collection of facts expressed as domain values of attributes describing an object. All facts are expressed in the form of an information or data table. Every row in the data table is treated as an object. An information table is expressed with four tuples, as follows:

$$\begin{aligned} \mathrm{I.S.} = < U, A, D, f >. \end{aligned}$$

In the preceding Eq. (1), I.S. represents an information system or database, whereas U represents a closed universe with an ‘n’ number of instances \( \{ x_1,x_2,x_3, \ldots x_n \} \), which is essentially a non-empty set consisting of objects. An attribute or feature set is a collection of n attributes represented as \( A= \{ a_1,a_2,\ldots a_n \} \) is the domain value of that attribute A. Let us consider \( A= S \cup Q \) then the tuple or record \(\textrm{DES}= (U,S \cup Q,D,f)\) is said to be a decision table or information table, where S is the conditional attribute and Q is the decision attribute. Obviously, every object and attribute will have some domain range that could be specified by the set D. So, if \(D=U_{a \in A}D_a\) where \(D_a\) means represents the set of values associated with the domain of the attribute a, Function to Write Information as

$$\begin{aligned} f: {U X A \quad -> D}, \end{aligned}$$

where as in Eq. (2), ‘f’ is the total decision function. Such that

$$\begin{aligned} {f(x, a) \in D_a \quad {\text {for every}} \quad x \in U, a \in A}, \end{aligned}$$

that is, in Eq. (3), function should satisfy the domain values of universe U towards respective attribute a for each object in the given set of examples. So, it does not violate the domain values. So, we have to define these kinds of functions on that attribute and the objects. The decision table defines the data types that are fully or partially reliable.


For a subset \(P \subseteq A\) of attributes in I.S., a relation is said to be indiscernibility relation, and this relation is denoted by IND on the universe U of any subset P with an objects \( \{ y_1, y_2,y_3, \ldots ,y_n\}\) and denoted by IND(P), is defined in the following Eq. (4).

$$\begin{aligned} \mathrm{IND(P)}= & {} \Big \{ (x_j,y_k):(x_j,y_k) \in U^2, \nonumber \\{} & {} \quad \forall _{a\in P}(f(x_j)=f(y_k)) \Big \} \quad {\text {where }}1 \le j,k \le n,\nonumber \\ \end{aligned}$$

where f(\(x_j\)) and f(\(y_k\)) corresponds to the attribute value a of the object \(x_j\) and \(y_k\), respectively. It should be noted that here IND(P) has been defined as an equivalence relation. The cluster of all equivalence classes in IND(P) are represented by the universe quotient set U, accredit as U/IND(P), in shortest form, U/P; and it is denoted by U/P = \(\{[x]_P \Vert x \in U\}\). In case that \((x_j, y_k) \in \mathrm{IND(P)} \), here x and y are forenamed as indiscernible (or identical) w.r.t P, and it is written as P-indiscernible. P-indiscernibility equivalence class relations are represented by \([x]_P\). Accordingly, the elements in \([x]_P\) are indiscernible by attributes from P. By this elaboration the IND(P) equivalence classes of the relation recalled as P-elementary sets.

Approximation of sets

Lets consider \(P \subseteq A\) as a subset of attributes and \(X \subseteq U\) as a subset of the universe. We can proximate X with the information was insisted on in P by constructing the P lower and P upper approximations on X. The description, i.e., X, deserved, which can be used to actuate the membership status of each object x in U w.r.t X. The rough membership function specifies the rate of associated overlap between set X and \([x]_P\). Equation (5), indicates P-lower approximation of a concept X.

$$\begin{aligned} \underline{\textrm{IND}_P}(X) = \{ x \in U | [x]_{P} \subseteq Z \}. \end{aligned}$$

Equation (6), indicates P-upper approximation of an element X. The objects absolutely or in sometimes may be related to the concept of X.

$$\begin{aligned} \overline{\textrm{IND}_P}(X) = \{ x \in U | [x]_{P} \cap Z \ne NULL\}. \end{aligned}$$

The positive, negative, and boundary regions of a set X generated by dividing the U universe. Here, described boundary region (B) or the doubtful region of IND(X) by Eq. (7).

$$\begin{aligned} \textrm{BIND}(X) = \underline{\textrm{IND}_P}(X) - \overline{\textrm{IND}_P}(X). \end{aligned}$$

From the equations it has been clearly said that if an object \(x_i \in \textrm{PIND}(X) \) means positive region \([\textrm{P IND}(X) = \underline{\textrm{IND}_P}(X)] \), then \(x_i\) corresponding to related target set X. Suppose, an object \(x_i \in \textrm{NIND}(X) \), i.e, negative region \([\textrm{N IND}(X) = U\overline{\textrm{IND}_P}(X)]\), then it does not belongs to X which implies a state of uncertainty. If \(\textrm{BIND}(X) = \phi \), the set X is called to be crisp set w.r.t P; Otherwise, the set X is designated as rough w.r.t P.The pair of \(\{\underline{\textrm{IND}_P},\overline{\textrm{IND}_P}\}\) is known as the Pawlak’s rough set of X w.r.t A attribute set.


The concepts that are primarily used for rule discovery are reducibility and core attributes. As given in I.S, some attributes are redundant when classifying in terms of any subset insisted on in X’s attribute set. Then, redundant attributes were removed without affecting the classification power of the diminished information system. The reduct (RED) is a minimal set of attribute lists from A that preserves similar divisions of objects in the universe U (considering the whole set of attributes).

$$\begin{aligned} \mathrm{IND(P)} = \textrm{IND}(\textrm{P}-\{b\}). \end{aligned}$$

Thus, it extract partition. In decisive, Let T, \(P \subseteq A \) and \(b \in H\).The following presumptions are assumed to be true:

  • IND(P) = IND(P − b),then b is redundant in P; otherwise P is indispensable.

  • Set P is said to be independent if all of its attributes are required or indispensable.

  • Given T H. If G is independent and IND(T) = IND(P), then the set T is the Reduct of P.

Equation (8) defined that all these defined attributes preserve the indiscernibility relation and, thereupon, set an approximation. Indiscernibility is basically the state of things where there is ambiguity. After completing the entire process on the data, we can get several such attribute subsets of s, and those minimal subsets are called “reduced sets.” The reduction occurs when you can make a decision based solely on the attribute X. As a result, we will obviously try to find the one with the fewest attributes. Genetic algorithms are also applied to the simultaneous computation of many reductases [1, 8, 10].


Core is the attribute that is shared by all of the indispensable attribute’s IND(P) in the set A. Core is a common attribute in all reduct sets. If removing an attribute introduces inconsistency, then that attribute is used as a core; otherwise, it is not used as a core. The list of all essential features of P is referred to as the core of P. The relation between the reduct and the core is specified by the following Eq. (9) where RED(P) represents the list of all reducts in P. The list of all indispensable features of P are called core of P. The relation between reduct and the core is specified by the following Eq. (9) where RED(P) represents the list of all reducts in P.

$$\begin{aligned} \mathrm{CORE(P)} = \cap \mathrm{RED(P)}. \end{aligned}$$
Fig. 1
figure 1

Renovated DQN architecture

Multilayer perceptron (MLP)

MLP is a class of ANN. The models of MLPs are the most basic deep neural networks. It is composed of a series of fully connected layers [36] in between the input, hidden, and output layers. Mainly, MLP works as an agent. Here, the inputs are the state and action that take place in the reinforcement environment. The user inputs will be multiplied by weights and then added with the bias that is shared by all the nodes of the hidden unit, as shown in Eq. (10). After the work is done by the activation function, the result will be given to the next layer. For every bit of a hidden layer, an activation function is applied, and that produces results. This process repeats, and the weights of the model are adjusted to minimise the error by the back propagation method until the desired output is obtained, either 0 for a non-fraud transaction or 1 for a fraud transaction.

$$\begin{aligned} \sum _{i=1}^{N} (W_i x_i) + \textrm{bias}. \end{aligned}$$
  • ReLU activation function: ReLU is one kind of activation function. The activation function calculates a weighted total and also uses bias to decide if a neuron ought to be actuated or not. The main goal of ReLU is to introduce the non-linearity concept into a neuron’s output. This takes the maximum value, and this is not fully interval-derivable, but we can take a subvariant. The value of the ReLU activation function is either 0 or 1. ReLU is defined by the formulae in Eq. (11).

    $$\begin{aligned} y(x)=\textrm{max}(0, x) \end{aligned}$$
  • MISH activation function: In a neural network, the concept of non-linearity is offered by an activation function that plays an important part in the network’s training and performance evaluation. The Mish, a unique neural activation function, is introduced in this study. When compared to Swish, Mish is a smooth and non-monotonic activation function, which can be defined as in Eq. (12).

    $$\begin{aligned} f(x) ={ x \cdot \tanh (\textrm{softplus}(x)) = x \cdot \tanh (\textrm{ln}(1 + e^x ))}.\nonumber \\ \end{aligned}$$

Deep Q networks (DQN)

Learning methodology based on reinforcement to achieve proficiency for optimal behaviour, this RL approach has been used. This adaptive learning process is characterised as the issue of an agent performing an action based on “trial and error” via communications with an anonymous environment that gives feedback in the form of numerical values, i.e., “rewards.” The employed agent works based on reinforcement learning (RL). Its activation function is the learning ReLU (hidden) and MISH at the output layer, and the algorithm is a novel version of deep learning. This particular framework solves classification problems using reinforcement learning (RL) with an agent who receives rewards depending on the state and action. If it marks a transaction as true, a positive reward is assigned; otherwise, a negative reward is assigned.

Figure 1 represented terminologies used in the reinforcement learning (RL):

  • Agent: An agent learns the model state \(S_t\) by taking an input \(X_t\), where t denotes the state transactions at t time. An agent performs the task in three ways (i) execute, (2) observe state and (3) receive reward. In the proposed model, the feature vector representation will be the agent’s input of given transactional data. Through actions Ut, an agent cooperates with the learning framework and provides rewards \(R(t+1)\), which can be used to enhance the function policy. The rewards are received from each transaction that is to be processed, and according to that, the Q-table is updated. Actually, the Q table is a reference table that is in the format of a matrix, and the matrix stores the q-values as state-action (State S and Action U) pairs. In the initial stage, it allocates 0 s to all states and is updated after completing each and every episode of the learning, and then an agent learns to execute the necessary action for a given state. By observing the rewards, the agent is able to provide the best action regardless of whether the transaction is true or false.

  • Action (U): The environment is updated as a result of the actions. Depending on how many layers there were in the neural network and how many feature vectors there were, different numbers of actions were altered. The collection of moves an agent is capable of doing in the environment.

  • State (S): The state of the environment at each time stamp ’t’ is what the agent is attempting to interact with or perceive, with some modifications affecting the action that an agent handles. The input transaction data \(x_t\) determines the state ’S’ in this model.

  • Policy(\(\pi (S)\)): The policy \(\pi \) describes the mapping between states and the best action to take for that state in the current environment. That means an action referring to that state can maximize the value of a reward. The policy set is important to the RL algorithm because it specifies the best decision to make.

  • Strategy: The policy should choose an action that maximises the feature reward.

  • Reward (R): A reward (R) is immediate feedback from the environment that measures the fraud or non-fraud of the agent’s action. So, this is treated as the optimum action for that concerned state.

  • Discount factor (\(\gamma \)): is described as balancing the agent’s performance in such a way that the agent may make the best decisions for both immediate and long-term benefit. The discount factor has a value between 0 and 1.

  • Probability of state transition (Prob): Which calculates the conditional probability (Prob (\(s_{t+1}\) | \(s_t\), \(u_t\))) to move from state st to state \(s_{t+1}\).

  • Episodes: The number of iterations that the agent required to find the better optimal Q-values for all the state and action (S,U) pairs.

The proposed RDQN model for digital transaction fraud detection (DTFD)

The overall framework of the proposed architecture, named intelligent agent system for DTFD, is illustrated in Fig. 2. This figure exposes the overall performance in terms of identifying fraud in digital transactions. Phases in this the proposed model is explained in subsections below. Our dataset is applied to rough set theory (RST). The RST is dividing our data set into ‘M’ number of reducts RD1, RD2,...,RD\(_M\) which produce a minimal set of attributes that amuse the target class based on the indiscernibility of the class label. A deep reinforcement learning network, i.e. DQN is an intelligent system divided into ‘N’ number of agents like DQN1, DQN2,..., DQN\(_N\).

Fig. 2
figure 2

Intelligent agent system for DTFD

Then, apply the renovated DQN architecture for each and every reduced set as RDQN, and that will cause better accuracy in performance. Every DQN that produces an output with the predicted label indicates whether or not fraud exists. After combining the RD1 \(\times \) DQN1, RD2 \(\times \) DQN2 ...RD\(_M\) \(\times \) DQN\(_N\) results with a weighted average or with majority voting. Here, each DQN is formed with the MLP neural network to train the agent to predict fraud cases in the digital transactional reinforcement environment.

Dataset pre-processing

To assess the performance of the model, we used a real-world, unique data set and conducted different experiments. The credit card fraud data was accessed from the UCI machine learning repository. This is a highly imbalanced dataset. That means it does not provide much accuracy in classification. For that, using the oversampling technique SMOTE to balance the data. And then, to develop a model, multiple types of data are required to be processed, like numerical data and categorical data. To run the proposed model, we required categorical data, which is fed as input to the RST. It will automatically increase the efficiency of our model because dealing with large variables is difficult and it is preferable to reduce the number of variables handled.

Renovated DQN architecture

In our proposed architecture, we introduced a renovated DQN architecture. This DQN architecture is made up of deep neural network multi-layer perceptron’s and reinforcement learning Q-learning. The renovated DQN architecture is built with different activation functions. In the hidden layer, ReLU activation is used, and MISH activation is used in the output layer, both to speed up the process and obtain accurate results in the identification of fraud cases.

DQN as classifier

In this paper, we presented a novel renovated deep Q-network model, as well as a custom open AI gym environment for deep RL agents that stores experience replay and an approximation value function. The DQN agent employs an epsilon-greedy policy to perform classification action based on batches of input. The open AI environment prospects an agent’s action and then evaluates the rewards for the agent accordingly. The agent’s memory stores this entire experience. At the end of batch completion, the DQN agent samples a batch of memory from its experience replay buffer, edits the Q-value by Q-network, evaluates the loss, and performs back-propagation to revise the weights. Our updated model successfully classified fraudulent and non-fraudulent digital transactions, achieving state-of-the-art performance.

The classic model of the MDP consisted of single-agent RL. A finite MDP is defined as a record \(<X, U, f, p>\), the definition contains X as a fixed set of states in an environment, a finite set of agent actions U, the state probability function \(f: X \times U \times X \rightarrow [0,1]\) and the reward function \(\rho : X \times U \times X \rightarrow R\) [37].

At each various time step t, the state \({x_t \in X}\)defines an environment. And then agent look into the state and that yield an action \(u_t\) which belongs to set of agent actions \((u_t \in U)\). Subsequently, an agent banking environment conversion of its state to some other state, \(x_{t+1} \in X\) in keeping with the moving possibilities or probabilities that are given by the function f: the final probability in \(x_{t+1}\) after \(u_t\) actions performed and executed in \(x_t\) is \(f(x_t,u_t, x_{t+1}\)). So, an agent obtains scalar reward \((r_{t+1} \in R)\), by the value given in the reward function i.e., \(\{\rho : r_{t+1}\} = \rho (x_t,u_t, x_{t+1})\). A reward value updated with an immediate matter of fact an action \(u_t\), i.e., the moving or transition from \(x_t\) to \(x_{t+1}\). Moreover, it says nothing with regard to the long standing consequences of this policy. It is assumed that the reward function is bounded.

In deterministic environment, the next state of the environment is purely determined by the present state and their action that is carried out by the agent in an environment \(r_{t+1}\) = \(\rho \)(\(x_t\),\(u_t\)). It observes rewards in each episode based on determination \((\rho : X \times U \rightarrow R)\). Each episode in which action is taken will have its own learning process. And the learning process does not work with the single feature vector. In MDP, it has some terminal states. So, we can receive rewards from these terminal states. Here, the function of transition probability “f” is updated with the transition function \(f: X \times U \rightarrow X\).

The policy defines the agent’s behaviour, which explains how the agent accepts its actions in the present. Generally, a policy or rule may be either deterministic or stochastic. The derived equation for stochastic is \(h: X \times U \rightarrow [0,1]\) and for deterministic is \(h: X \rightarrow U\). If a policy does not alter over time, it is called stationary. The agent’s main goal is to find a better policy that maximizes with an expectation of discounted return \({\gamma }\) in Eq. (13) for each state x.

$$\begin{aligned} R^\pi (X)= E \left\{ \sum _{t=0}^\infty \gamma ^kr_{t+1} |x_0 = x, \pi \right\} , \end{aligned}$$

where the discount factor \(\gamma \) lies between \((0< \gamma <1]\), the number of possible transition states under the policy \(\pi \), the expectation is taken into account. The discount factor \(\gamma \) can be described as capturing and increasing uncertainty in future rewards. A long-term benefit gained by an agent is expressed by the return R. The different ways of representing the return of R are discussed [43].

The definition of Q (an optimal) Q-function as \(Q^*(x,u) = \textrm{max}_h Q^h (x,u)\). The Bellman optimality satisfies this Eq. (14).

$$\begin{aligned} \begin{aligned}&{Q^ *(\textrm{State}\; x, \textrm{action}\; U)}\\ {}&\quad =\sum x^{'} \in f(x,u,x^{'}) \rho (x,u,x^{'}) \\ {}&\qquad + \gamma \textrm{max}\; u^{'}Q^*(x^{'},u^{'}) \textrm{where}\; \{x:\forall \in , u \in U \}. \end{aligned} \end{aligned}$$

According to Eq. (15) an optimal action reward is obtained by taking u of x, which is a proven or expected prompt reward summation, and multiplying it by the discounted and most advantageous value that is reachable from the next sequential state. Once \(Q^*\) is available, the precise policy, which improves the return, can be evaluated. It chooses an action in every state that maximizes the subsequent future reward. The optimal Q-value is expressed in Eq. (15) as follows:

$$\begin{aligned} \pi ^*(x) = \textrm{argmax}_{u} Q^*(x,u). \end{aligned}$$

When various activities achieve the highest Q-value, any of them can be chosen and the approach will remain optimal. In that case, the argmax operator was interpreted as a parameter for opting for good clarification. As shown in Eq. (16), a policy that improves or maximises a Q-function is said to be greedy in that Q-function. There are processes done in two steps, such as analysing \(Q^*\) and the greedy policy in \(Q^*\).

Q-learning said to be procedure for an iterative approximation. At every time step, Q starts learning from arbitrary of Q-function, and it look into the transitions i.e, represented as (\(x_t\), \(u_t\), \(x_{t+1}\),\(r_{t+1}\)), and the Q function is updated after each transition observable.

$$\begin{aligned} Q_{t+1}(x_t,U_t)= & {} Q_{t}(x_t,U_t) + \alpha _t r_{t+1} \nonumber \\{} & {} + \gamma \textrm{max} U^{'} Q_t(x_{t+1},u^{'})- Q_t(x_t,u_t). \end{aligned}$$

Equation (16) denotes the difference between the current estimate \((x_t,u_t)\) and the best Q-value of \((x_t,u_t)\) and for which the value updated or estimated \(r_{t+1} + \gamma max_u^{'} Q_t (x_{t+1},u^{'})\). Here, the new estimation is generated according to the bellman Eq. (14), i.e., applied to \(Q_t\) in the state and action pair \((x_t, u_t)\). The value of learning rate \(\gamma \) \(\in \) (0, 1] can be varied over time in discrete mode, and usually, it decreases over time.

If an agent continues to try all activities with nonzero probabilities in all the states, then it can be called exploration and then polished. Typically, the probability decreases over time. For instance, the selection of probability random action at each stage \(\epsilon \in (0,1)\), and probability \((1 - \epsilon )\) with a greedy action. So that, we can obtain this \(\epsilon \) -greedy exploration.

Proposed algorithm

Algorithm for Calculating Reducts

1. \(RD_i\) be a set of all reducts

2. C be a set of all attributes

3. D be a decision attribute

4. Initialize \((b1,b2,....b_N)\) =\(\tau (a1,a2,...a_N)\)

5. where is permutation

6. \(RD_i\) = \(\{ \}\)

7. for i=1 to N

8. \(D_i(C_i) \ne D_i(b_i)\)

9. \(RD_i <- \quad b_i\)

10. else

11. \(RD_i <- RD_i - b_i\)

12. end for

13. End

Renovated DQN Algorithm

1. Initialize replay memory \(RD_i\) to Capacity M

2. for each i = 1 to N

3. Assign function action-value Q with \(RD_i\) features and random weights

4. for episode 1 to M do

5. Initialize sequences \(S_i=\{ \}\)

6. for t to T do

7. with prob \(\epsilon \) select \(a_t\)

8. \(\pi (\frac{a}{s}) =\left\{ \epsilon , \quad \text {Random Action}\right. \)

\((1-\epsilon ), \quad a*=\text {argmax} Q(S,A) \text {where} a \in A\)

9. Implement action \(a_t\) in Emulator

10. Examine Reward value \(r_t\) and image value \(x_{t+1}\)

11. Set \(s_{t+1}\) = \(s_t , a_t, x_{t+1}\) and preprocess \(Q_{t+1}= \phi (s_{t+1})\)

12. Store Transition \((\phi _t,a_t,r_t,\phi _{t+1})\) in D

13. Sample random method of minibatch on transactions \((\phi _j,a_j,r_j,\phi _{j+1})\) from D

14. Activation function \( \left\{ ReLU \quad y(x) = \text {max}(0,x)\right. \)

\(\text {Mish} \quad f(x) ={ x .tanh(softplus(x)) = x . tanh(ln(1 + e x ))}\)

15. Set \(y_i\) \(\left\{ 1 \quad \text {If transaction in fraud}\right. \)

0    Non-Fraud

16. Perform Gradient Descent Step w.r.t Mean Squared Error \(y_i - Q(\phi _j; a_j; \phi )^2\)

17. End for \(RD_i\)

18. Ensemble of \(DQN_1,DQN_2\ldots DQN_N\)

19. Each entrance of \(DQN_i\) is predicted based on Average weightage of DQN Agent.

Experimental setup

It can be exposed in three different frame-ups. The overall performance of the proposed model is determined by the operation of all frame-ups. Frame 1 compares the performances of different classifiers with a European data set (including all the features). In Frame 2, some wrapper and selection-based feature selection algorithms were applied to the base classifiers. Among all feature selection techniques, RST provides essential features. The final frame up 3 adapted RST in the DQN algorithm (i.e., RDQN), tested all of the reducts produced by RST with DQN, calculated average weighted accuracy, and compared this average weighted accuracy to other existing fraud detection techniques.

Frame-up: 1

Many machine learning algorithms, such as LR, SVM, NB, and ANN, are used for classification. Classification is done on the dataset by dividing it into training and test data. All models are trained based on the training data and then predict the target variable, whether the transaction is fraud or not, in the test data. Some of the base classifiers that are considered for proposed model consideration and comparison are explained here.

Logistic regression (LR) It is a basic common classification technique that shows the relationship between dependent and independent variables. When the dependent variable is categorical or binary and the predictors are continuous, then categorical linear regression is easily described [38, 39]. LR uses a nonlinear sigmoid function to find the best-fit parameters. In the equation, the sigmoid function is denoted by the symbol \(\sigma \), and the input variable x to the sigmoid function is shown in Eq. (17).

$$\begin{aligned} Y = 1 / (1+e^{-z}). \end{aligned}$$

Naïve Bayes This is a bayesian statistical strategy that determines the best decision based on the highest probabilities. From known values, bayesian probability estimates unknown probabilities. It also enables the application of prior knowledge and logic to questionable assertions. This technique makes the assumption that the features in the data are conditionally independent [39]. The NB classifier performs based on the conditional probabilities, as shown in Eq. (18), of the targeted classes, i.e., fraud and non-fraud. KNN is one of the most popular supervised classification algorithms.

$$\begin{aligned} P(A|B) = \frac{(P(B|A) *P(A))}{P(B)}. \end{aligned}$$

KNN The KNN classifier examines the pattern space among its nearest neighbours. A new instance that should be near when it assigns an unknown sample to the space [39]. To examine the closeness among neighbouring instances, an algorithm uses the Euclidean distance function in Eq. (19).

$$\begin{aligned} \sqrt{\sum _{i=1}^k (x_i - y_i)^2}. \end{aligned}$$

SVM Which is one of the best in a deep classification tasks. Generally, SVM works in high-dimensional feature space. There are two factors that SVM prepare them strong—marginal space and kernel function in Eq. (20) depiction [38]. With the use of a kernel function, SVMs map into a high-dimensional feature space and then learn the model for classification purposes without any added computational difficulties.

$$\begin{aligned} W^Tx + b = 0. \end{aligned}$$

DT Another extensively used method for classification purposes is the DT. A tree has some interior knobs that reflect tests on a characteristic; each branch indicates the occurrence of that test, a concurrent terminal node grasps a target label [39]. Here, the dataset is oftentimes partitioned using either a breadth-first greedy (BFG) or a depth-first greedy (DFG) strategy, and the process is stopped after all of the features have been assigned to a given class label. The best split is one that prevents subgroups from iterating, i.e., one that keeps them as distinct as possible. Formulae in Eq. (21) provide information.

$$\begin{aligned} \textrm{Gain}(\textrm{IS},A)= & {} \mathrm{Entropy(IS)}-\sum _{v \in \textrm{values}(A)}\frac{|\textrm{IS}_v|}{|\textrm{IS}|} \nonumber \\{} & {} \textrm{Entropy}(\textrm{IS}_v). \end{aligned}$$

Frame-up: 2

The feature selection process is one of the necessary steps in data pre-processing. It improves the model’s performance and also reflects the computational cost of the model. Hence, the proposed model takes feature selection into consideration and applies some filter- and wrap-based feature selection techniques.

Chi-square feature selection (CSFS) This is the easiest method for feature selection that influences the target variable in supervised learning. Basically, this method focuses on categorical features in a dataset. Select the best number of features by calculating the chi-square score, which can be calculated between each variable or feature and assesses whether the sample’s association between two categorical variables reflects their true association in the population. A Chi-squared score is denoted by Eq. (22).

$$\begin{aligned} X^2= (O_\textrm{frq}-E_\textrm{frq}) / (E_\textrm{frq}), \end{aligned}$$

where \(O_\textrm{frq}\) is number if observations of the class and \(E_\textrm{frq}\) number of expected observations of class.

Table 1 Important features from RST

Pearson correlation FS (PCFS) This is called the univariate method, which measures some type of correlation between two random variables, such as a particular feature and a target variable, and keeps the features that have a higher correlation. It can be measured by correlation using Eq. (23).

$$\begin{aligned} r= \frac{\sum _{i=1}^n(C_i-{\bar{C}})(G_i-{\bar{G}})}{\sqrt{\sum _{i=1}^n(C_i-{\bar{C}})^2\sum _{i=1}^n(G_i-{\bar{G}})^2}}. \end{aligned}$$

\(C_i\) is the overall training example, \({\bar{C}}\) is the mean value of the random variable \(C_i\) and \({\bar{G}}\) are the average values of the random variable \(G_i\). So, the correlation value is between + 1 and − 1. A + 1 means accurate positive correlation, -1 means absolute negative correlation, and both + 1 and − 1 or close to zero mean they are highly correlated. If it is closer to zero means, there is no correlation. Hence, select features that have a high positive or negative correlation.

Selection-based feature selection (SBFS) For this model, use forward selection so that the features are consequently added to an empty set until an addition is made where the extra features do not reduce the criterion. At every step, based on the cross-validation score of an estimator, the estimator selects an important feature to add or remove. In the case of unsupervised learning, this sequential feature selector always cares about the features (X), not the targeted outputs (Y).

Recursive feature elimination (RFE) RFE is one kind of greedy algorithm that works in terms of feature ranking techniques. RFE starts with the entire set and then removes the least-suited features one by the one, selecting the top-most appropriate features based on this ranking technique. That means the estimator is well trained on the complete feature set, and the importance of an individual feature is obtained through any particular callable attribute.

Rough set theory-based feature selection The major task performed by rough set theory is feature extraction, and it handles the certainty of the data. RST algorithm is used to reduce a set of attributes to a minimum of ‘N’ attributes. Here we calculate the indiscernibility relation specified in Eq. (6) based on the information table. Many RD’s are generated from this RST with different minimal set combination and out of all these, some sets are taken into consideration that are listed in Table 1. For Example, five reducts are chosen at random, with the set containing a minimum of 9-13 attributes. The RD1 has 13 attributes, the RD4 and RD5 are sets of ten attributes, and the RD3 has nine attributes.

Frame-up: 3

In this section, we introduced the renovated DQN algorithm, which operates by agent in a bank environment. The RST derived reducts (RD1, RD2,..., RD5) are fed to DQN to form RDQN1, RDQN2,..., RDQN5. To know the performance of all RDQNs, we calculated a weighted average and compared the renovated DQN algorithm with other state-of-the-art methods. Some existing hybrid models of fraud detection techniques were used for comparison. The existing fraud detection techniques are IFDTC4.5, SAE-GAN, CNN-SVM-KNN, and DEAL, which are illustrated.

IFDTC4.5 Askari et al. [40] built or refined a decision tree using intuitionistic fuzzy logic and DT (C4.5) to detect digital fraud. The fuzzy logic takes into account the cognitive characteristics of variables so that it can estimate that normal transactions are not considered fraudulent digital transactions, and fraudulent digital transactions are not treated as normal digital transactions. Tree generation mechanisms construct decision trees using transactional datasets that have been used and considered information gain ratio. The trained data set is then used to build the tree, which is then tuned several times until it has accurate classification.

SAE-GAN Together, the sparse auto-encoder and generative adversarial networks outperform other methods for detecting fraud. SAE aids in increasing the number of features and obtaining key features [41]. They trained GAN using the optimised Generator G, Discriminator D, and loss function parameters. After the training process, the discriminator will have the capability to separate fraudulent or abnormal transactions, which are different from genuine transactions. The trained SAE, which is a discriminator in the GAN, is used to predict the form of a new digital transaction during the testing phase.

CNN-SVM-KNN Raghavan et al. [42] investigated various ML and DL models on various data sets to identify abnormal transactions. The primary goal of this paper is to generate intuition about which models work best for which types of datasets. This study reveals that in detecting fraud, SVM would be the best model with larger datasets, and this could potentially be combined with CNNs to score a more predictable performance. SVM, RF, and KNNs can provide better enhancement on smaller datasets. CNN usually exceeds in performance when compared with other DL methods such as AE, restricted Boltzman machine (RBM), and deep belief networks (DBN). They summarised the results on different datasets with MCC and AUC values.

Table 2 Performance based on base classifiers

DEAL Arya et al. [43] proposed DEAL is a predictive framework, i.e., deep ensemble learning to identify abnormal transactions in real-time data. Proposed DEAL, a novel framework for predicting fraud transactions in real credit card data streams. We represent each transaction as a tensor to reveal the latent and inherent relations between spending patterns and fraudulent transactions. By uniting the extra-tree ensemble method with deep learning (DNN) in characteristic space, the effect of highly imbalanced classes is reduced, thus yielding a better prediction accuracy of fraudulent classes. We addressed the valuable metrics for evaluating the CCFD frameworks. The proposed framework is evaluated on the basis of these valuable metrics: categorical accuracy, training and prediction accuracy, log loss, false positives (FP), and fraud catching rate (FCR).

Experimental results

The proposed RDQN model is efficient in improving the accuracy of digital transaction fraud detection. It was done in three frames. Initially, feature selection techniques are applied to datasets, and the best technique, namely RST, is chosen due to its demonstrated performance. The RST generated minimal feature subsets known as RD1, RD2,..., RD5. Second, the DQN algorithm was applied to each reduced dataset, and the weighted average of accuracy was calculated. Lastly, we compared DQN’s performance in terms of classifying fraud and non-fraud transactions with other existing methodologies. To implement a model, these are the requirements that must be taken into consideration. As a result, we used a \(10\textrm{th}\) generation Intel Core i5 processor and a hybrid drive system with SSD (256 GB) and HDD (1 TB) to accelerate the process. For our convenience in various operations, we implemented the proposed model in Python 3.8 using Scikit-Learn, Keras, and MATLAB (R2020b). And given the results, it demonstrated its superiority among all state-of-the-art methods in detecting fraudulent transactions.

Dataset description

The dataset is available in Kaggle repository [44], and the information on the growing risk of digital transactional fraud underlines the difficulty in getting this kind of data. The major challenge is the highly imbalanced dispersion among fraud and non-fraud target classes in millions of tuples of data. Banks are often exposed to fraudulent transactions and constantly improve systems to track them. The European fraud dataset contains a total of 2,84,807 samples of transactions. Out of these samples, 492 cases were identified as fraudulent transactions, and the remaining are treated as genuine transactions. The dataset has 31 attributes in total, with 30 attributes for input and one attribute for the target class or output. Of these attributes, only three variables are found to have an appropriate transaction name, such as time, amount, and class. The rest of the attributes (V1–V28) do not contain any names for their corresponding transactions. Moreover, this is a numerical dataset that was deliberated by the European Bank in 2013 and is protected, meaning it is data in the form of numbers with anonymous attribute names. The credit card fraud dataset provided is 100% complete and does not contain any missing values.

Evaluation metrics

The confusion matrix is an important criterion for calculating the performance of a classification algorithm, and knowing which criterion gives the summarization helps in understanding what the classification model is estimating correctly and what types of errors it is making. \(T_\textrm{POS}\) is the number of samples that are both given the same value (i.e., positive) in actual and predicted, and \(T_\textrm{NEG}\) is the number of instances that include both actual and predicted. Negative, \(F_\textrm{NEG}\), and \(F_\textrm{POS}\) represent the numbers in the classification rate of errors. The given performance metrics, such as accuracy, true fraud, false fraud, specificity, and precision, are expressed in Eqs. (24)–(28).

$$\begin{aligned} \textrm{Accuracy}= & {} (T_\textrm{POS}+T_\textrm{NEG})/(T_\textrm{POS}\nonumber \\{} & {} +F_\textrm{NEG}+F_\textrm{POS}+T_\textrm{NEG}) \end{aligned}$$
$$\begin{aligned} \mathrm{True \; fraud}= & {} T_\textrm{POS}/ (T_\textrm{POS}+ F_\textrm{NEG}) \end{aligned}$$
$$\begin{aligned} \mathrm{False \;fraud}= & {} F_\textrm{POS}/ (F_\textrm{POS}+ T_\textrm{NEG}) \end{aligned}$$
$$\begin{aligned} \textrm{Specificity}= & {} T_\textrm{NEG}/ (F_\textrm{POS}+ T_\textrm{NEG}) \end{aligned}$$
$$\begin{aligned} \textrm{Precision}= & {} T_\textrm{POS}/ (T_\textrm{POS} + F_\textrm{POS}) \end{aligned}$$

Results for frame-up: 1

Table 2 depicts performance metrics that are derived from the confusion matrix shown in Fig. 3. These performance metrics are required to compare the performance of different base classifiers. To validate the performance of these classic algorithms, a dataset was required. The dataset information is available in “Dataset description”. The dataset is formal divided into 70% for the training model and 30% for testing the model. The 70 percent data are 199,365 \(\times \) 31; out of these, 199,125 are non-fraud transactions, and 240 are frauds. After appropriate training and testing, all selected ANN models outperform all other specified models in terms of accuracy. SVM is among and follows.

Fig. 3
figure 3

Confusion matrix for DTFD

Table 2 depicts the data about the comparison of all the metrics of the specified models. The ANN model performed well, yielding the best results in terms of accuracy (91.5%), F score (96.2%), and true fraud rate. Figure 4 shows the bar chart for the above performance comparison on various classifiers, which are specified in Table 2.

Fig. 4
figure 4

Performance metrics of base classifiers

Results for frame-up: 2

In this framework, different feature selection algorithms based on specified traditional classifiers were examined in Tables 3, 4, 5, 6 and 7. The selected features from different algorithms improve the model of any classifier more than having all the features. Here, many filter- and wrapper-based feature selection methods were utilized, and those are discussed in “Frame-up: 2”.

Table 3 Roughset-based performance on base classifiers
Table 4 Chi square feature based performance on base classifiers
Table 5 Pearson correlation feature based performance on base classifiers
Table 6 Selection-based performance on base classifiers
Table 7 Recursive feature elimination performance on base classifiers

The main objective of the RST is to extract important features. These features play a vital role in classification and prediction. To explore this, tools called the Rough Set Exploration System Toolbox 2.2.2 (RSES) were used. The idea behind RST was to use an information table to calculate the indiscernibility relation specified in Eq. (4). Then, obviously, it would get all the reducts or reduced sets, which are useful for a better way of classifying things.

The performances were compared by performance metrics like accuracy, true-positive rate, false-positive rate, and F1 score. Generally, it was observed that RFE+NB, RFE+ANN, RSBFS+SVM, and RSBFS+ANN gave the best accuracy. Out of RFE and RSBFS, RSBFS has provided better accuracy. That means it was given the best features. Finally, it was proved that the RSBFS+SVM model worked well. Tables 3, 4, 5, 6 and 7 are depicted clearly about feature selection-based performance among base classifiers and heeded how RSBFS was worked well rather than all other feature selection techniques from Fig. 5. So, RSBFS was Chosen as feature selection technique and was applied to proposed model to improve the performance. Furthermore, agent could easily learn the model within banking environment.

Results for frame-up: 3

The reduced feature sets generated by the RSES toolbox are mentioned in Fig. 6 and are thought to be of equal value to the original massive data set. It generated a number of reductions that uniquely identified the transaction as fraudulent or not. This randomly generated reduced set has fewer attributes than the original huge data set in its entirety. Based on a few runs through the RSES tool, a minimal set of 9–13 attributes was chosen from among all the reducts. Reduct table with attributes, number of rows, attribute set size, Pos.Reg, SC, and Reducts displayed. The reducts were used as input for the proposed model. The set RD1 consists with 13 attribute set i.e., RD1 \(=\) {Time, V4, V13, V14, V15, V17, V18, V21, V22, V24, V25, V26, Amount}, RD2 \(=\) {V4, V9, V11, V12, V14, V15, V22, V24, V26, Amount}, RD3 \(=\) {V4, V14, V15, V17, V19, V21, V23, V26, Amount}, RD4 \(=\) {Time, V1, V3, V28, V24, V11, V25, V12, V26, V4}, RD5 \(=\) {Time, V1, V3, V28, V24, V11, V12, V25, V26, V4, V15, V22}.

The proposed renovated RDQN model was discussed and tested with other fraud detection techniques, which are shown in Table 9.

The proposed model has achieved state-of-the-art performance on a highly skewed credit card fraud data set with 96.09 percent accuracy. It was able to correctly classify fraudulent and non-fraudulent transactions. This high-performance model opens the door to many opportunities for exploring the scope of reinforcement learning in the field of classification problems and decision-making.

In Fig. 7, the vertical axis represents training loss and the horizontal axis represents epochs. Figure 7a–f clearly show varying training losses with subsequent epochs. The main objective of the DQN agent is to reduce the training loss, which results in an improvement in accuracy. Figure 7a shows that training loss is more due to considering all the attributes in the data set, which reflect themselves in the training process of the model. The epochs of noise would be greater at first, but they would be reduced with proper training provided by an agent, as shown in Fig. 7.

According to Fig. 7e, RD4 is very efficient but has expensive features based on the accuracy and error rate of the deep learning model (DQN). As a result, it is concluded that RDQN4 is the most useful in classification. Next to RD4, RD5 performed better with less training loss. Subsequently, we ensembled all the models (RDQN1 + RDQN2 + RDQN3 + RDQN4 + RDQN5) and then calculated a weighted average, which was considered the performance metric compared to other fraud detection models. RDQN gave the better performance in terms of average weighted accuracy, which is shown in Table 8 as well as Fig. 8. The maximum accuracy value elucidated the effectiveness of the classification procedure of DQN, which was compared with other fraud detection techniques.

Fig. 5
figure 5

Comparison of accuracy on specified feature selection techniques upon classifiers

Therefore, we compared our proposed Renovated RDQN model with other fraud detection techniques such as IFDTC4.5 [40], SAE with GAN [41], a combination of CNN, SVM, and KNN [42], and DEAL [43], as shown in Table 9 and clearly represented in Fig. 9. Among all these state-of-the-art methods in fraud detection, SAE+GAN and DEAL work better with 94.98 and 94.65%.

Table 10 shows the comparison of the proposed RDQN with the best selected traditional algorithms (SVM, ANN), feature selection techniques (RSBFS+ANN, RSBFS+SVM), and other fraud detection techniques (SAE+GAN, DEAL). Our proposed classifier RDQN scales higher in accuracy of 96.09with a graph, as shown in Fig. 10, to determine the accuracy of our proposed model. Conclusively, the proposed model outperforms others in digital fraud detection cases.

Complexity of the model

In experiments, the model has successfully distinguished between fraudulent and non-fraudulent data. To forecast a model’s potential, a complexity analysis is crucial during the design and implementation phases. The number of neurons, the passing of weights, the number of states, the number of epochs, and the actions to be taken based on the Q values were some of the limitations that were taken into account while evaluating the performance of the DQN classifier. The number of samples in the data set connected to weights in the network determines how many states there are. The temporal complexity of training a DQN with an input layer, two hidden layers, and one output layer (optimal policies, denoted by “P”) was investigated. The time complexity for this solution was determined as \({\mathcal {O}}(NZ\Sigma _{v=0}^{v-1} P_v \cdot P_(v+1) )\). The complexity analysis of DQN Classifier on ’M’ number of reducts, then it would be \({\mathcal {O}}(MNZ\Sigma _{v=0}^{v-1} P_v \cdot P_(v+1) )\).

Fig. 6
figure 6

Reducts randomly chosen as input to DQN

Table 8 Comparison of accuracy for different DQN on reducts
Fig. 7
figure 7

Training loss in epochs of RDQN

Fig. 8
figure 8

Comparison graph between RDQN\(_{1-N}\) and weighted average of RDQN\(_{1-N}\)

Fig. 9
figure 9

Accuracy comparison between different feature detection techniques

Table 9 Comparison of accuracy between existing classification models
Table 10 Accuracy of RDQN with other fraud techniques
Fig. 10
figure 10

Comparison graph of proposed RDQN with base, feature-based, other FDT

Conclusion and future scope

The proposed renovated method RDQN is used in this research work to detect fraud in digital transactions using reinforcement learning. The second module selects the best feasible attribute set using filtering and wrapping techniques based on traditional ML algorithms like LR, NB, DT, SVM, and ANN. RST provided critical properties as reducts for further processing. The third module takes reducts as inputs for DQN to classify fraud and non-fraud transactions. Now, the proposed DQN performance has been compared with other base classifiers, feature-based classifiers, and other existing fraud techniques. The proposed model has a 96.09% accuracy. This is a suitable model for solving any binary classification-related problems. In future work, there is still room for establishing other learning environments and modifying the reward functions. It is also important to compare the proposed model with new research that arises in the future.