1 Introduction

Recently, the Internet has grown quickly worldwide, becoming an important part of numerous people's daily lives. Several fields use wireless network information systems, such as online social networking, business transactions, traffic planning, military spying activities, and the IoT [1]. Computer network security is of utmost significance in today's ever-changing digital world. Traditional security measures are frequently inefficient because of the growing complexity of cyber-attacks [2]. There is now a wide variety of cyber threats, from simple malware and phishing to advanced persistent threats (APT) [3]. Traditional security solutions, which depend on predetermined rules and signatures, struggle to keep pace with these changing threats. The need for adaptable, intelligent systems capable of learning from their surroundings is apparent [4]. Understanding the overall network's security condition, seeing faults and abnormal activity on the present network, and providing comments or improvements depend on network security personnel having high situational awareness [5]. Network risk perception is how people, groups, or computer programs think about and understand potential risks and vulnerable places within a computer network. Identifying, analyzing, and understanding risks that could harm the privacy, security, or availability of network resources and data is part of this process [6]. A computer network uses adaptive network protection to find, reduce, and stop cyber threats before they happen. As opposed to traditional or rule-based ways, adaptive network protection uses smart and learning-based methods to constantly change and improve security measures based on changing threats [7]. By adding risk-based security measures to an organization's digital framework, the protection part of the adaptable security model keeps objects secure. Obtaining the vulnerabilities and making controls tighter requires a close examination of systems [8].

As a subfield of machine learning (ML), reinforcement learning (RL) is the most similar to human learning since it can acquire knowledge about its environment via exploration and exploitation. [9]. In this scenario, an agent interacts with the environment, gaining knowledge and making decisions based on the observed information. However, the environment offers rewards and penalties to the agent based on its expected behaviour and actual actions. [10]. The idea of RL was heavily influenced by how most people acquire new skills, namely, by witnessing the results of repeated attempts at the task at hand. [11]. RL excels in real-time and adversarial settings because of its flexibility and utility in modelling an independent agent to conduct consecutive activities, ideally without or with minimal past knowledge of the atmosphere [12]. Deep learning in RL approaches has increased their ability to tackle complicated problems by improving their function approximation and representation learning capabilities [13]. A strong indication is that combining deep learning and RL is well suited for cyber security applications in an era of rapidly evolving, pervasive cyber network risks [14]. RL algorithms can be used to map and understand the structure of computer networks in real time [15]. Agents may discover devices, their functions, and their relationships inside a network via environmental exploration. Due to this dynamic mapping, the network's topology may be seen in real time [16]. The parameters' uncertainty is modelled using probability distributions, and decisions are made based on maximizing the expected reward. [17]. Cyberattack predictions may be made in advance with the help of a virtual agent built using reinforcement learning. [18]. Additionally, deep reinforcement learning (DRL) technology is used to choose participants with sufficient computing resources and high-quality datasets to increase the quality of data sharing. [19]. Feedback from the environment (as in a trial-and-error interaction to determine what works well with a certain environment) might help an RL approach learn to preserve the environment more effectively by rewarding or penalizing the approach's actions. [20].

The major contribution of the article is.

  • Designing Deep Reinforcement Learning-assisted Network Awareness Risk Perception and Prevention Model (DRL-NARPP) for detecting IoT network attacks.

  • Introducing Q-Learning for IoT network environment that can determine different network intrusions utilizing an automated trial–error technique and continuously enhance its attack identification abilities.

  • The experimental outcomes have been implemented, and the suggested DRL-NARPP model increases the risk assessment, accuracy, and prediction ratio and reduces false positive rates compared to other existing models.

The remainder of the study is prearranged: Sect. 2 deliberates the literature study, Sect. 3 recommends the DRL-NARPP model, Sect. 4 discusses the experimental outcomes, and Sect. 5 concludes the research article.

2 Related Works

Several theoretical approaches are available to understand how the risk-related network context may outline network awareness and perceptions. Most studies on information system security have focussed on system security or ensuring that the system is protected through appropriate software and hardware. However, there is a lack of research on network awareness of information security and how this awareness influences their behaviour. The authors note that the role of humans in network security is becoming more acknowledged in academia and industry and they present a review of studies on user awareness of information security [21,22,23,24].

2.1 Cyber-Security Risk Prediction Models

Ihsan H. Abdulqadder et al. [25] proposed the directed acyclic graph (DAG)-based blockchain technology for Context-Aware Authentication Handover and Secure Network Slicing. The author provided context-based authentication and secure handover strategies through the Markov decision-making (MDM) and the weighted product model to increase security. Abdul Razaque et al. [26] presented the web-based Blockchain-enabled cybersecurity awareness program (WBCA) to decrease the risk of cybercrimes. The suggested program aids in comprehending typical cybercriminal actions and strengthens end-user familiarity with cyber hygiene, best practices, susceptibilities, and current patterns. The suggested WBCA leverages Blockchain to defend the software from attacks.

2.2 Network Intrusion Detection Models

Nuno Oliveira et al. [27] proposed the Intelligent Cyber Attack Detection and Classification (ICADC) for a network-based intrusion detection system. The outcomes of the experiments indicate that a sequential approach is higher for dealing with the issue of anomaly detection. The LSTM is a trustworthy model for discovering sequential patterns in data on network traffic, with an accuracy of 99.9% and an f1-score of 91.6%. Halima Ibrahim Kure et al. [28] recommended integrated cyber security risk management (i-CSRM) for risk prediction for critical infrastructure security. A novel i-CSRM framework utilizes a decision-support system based on fuzzy set theories to aid in systematically identifying critical assets and machine learning techniques to predict the risks that may arise and evaluate the efficiency of any current controls. In addition, the findings have shown that machine learning classifiers are very effective in predicting various forms of risk, such as DoS attacks, cyber espionage, and malicious programs.

2.3 Network Risk Prediction Models

Bdah Mohammed Mubarak AlShahrani and Mohammad Tabrez Quasim [29] discussed the Adaboost Regression Classifier (ABRC) for classifying Cyber-Attack and securing the network risk. The proposed ABRC uses a deep learning framework to estimate the impact of attacks on the security of a network. According to the investigation of its performance, the suggested ABRC significantly outperforms the existing deep learning method in detecting cyber attacks. Shareeful Islam et al. [30] deliberated the comprehensive assessment model (CAM) for Asset criticality and risk prediction in cybersecurity risk management (CSRM) of cyber-physical systems (CPS). The experimental findings show that stakeholders may benefit from an efficient risk management technique when applying the fuzzy set theory to identify the criticality of assets. Cyber espionage, denial of service, and crimeware are risks reliably predicted by machine learning classifiers.

2.4 AI Methods in Network Security Systems

Onder Tutsoy and Martin Brown [31] presented the reinforcement learning analysis for a minimum time balance problem. The convergence rate and difficulties related to the value function parameters are first examined in this study using a second-order unstable balance test problem. With the assumption that the optimum control strategy for the minimum time is known, the author focuses on converging the minimum time value function. During the simulation, it becomes clear that the temporal difference mistake creates a null space linked to the basis function for the experiment's end. It is shown that the residual gradient approach converges quicker than TD(0) for this specific test case since the second step is to examine the parameter convergence rate. ONDER TUTSOY [32] introduced the artificial intelligence-based long-term policy-making algorithm to generate time-varying policies for opening the schools part by part. Under worst-case scenarios, the algorithm's primary goal is to generate rules that maximize school enrolment while minimizing pandemic mortality. Under worst-case scenarios, the results show that the suggested algorithm may provide effective policies that reduce COVID-19 fatalities while increasing school enrolment.

Based on the survey, existing systems have several issues in attaining high-risk assessment, accuracy, and prediction ratio and reducing false positive rates. Hence, this study suggests applying the Deep Reinforcement Learning-assisted Network Awareness Risk Perception and Prevention Model (DRL-NARPP) for detecting malicious activity in cybersecurity.

3 Deep Reinforcement Learning-assisted Network Awareness Risk Perception and Prevention Model (DRL-NARPP)

Network awareness refers to an organization or system's capacity to comprehend and monitor its network infrastructure's condition. Understanding the network requires familiarity with its components, its configurations, the data's path, and any security holes. Being aware of the developments in one's network is essential for respectable cybersecurity since it enables the discovery of anomalous behaviour, the identification of possible threats, and the prompt resolution of security issues. Assessing and comprehending risks and hazards to a network is an integral part of risk perception. Assessing the possible effect on the network and the company as a whole, as well as the possibility of a threat occurring, is part of this process. An in-depth familiarity with the assets and risks they face and constant monitoring is essential to accurate risk perception. A proactive method of cybersecurity, adaptive network prevention includes continuously monitoring and updating security policies in response to changing threats. This approach takes security to a new level by including dynamic protections like AI and machine learning. This might involve applying machine learning algorithms to identify abnormalities, adopting behaviour-based analysis, and using automatic response mechanisms to mitigate new risks. The extent and complexity of cyber attacks have grown exponentially as the number of linked IoT devices has risen in recent years. The introduction of several inventive risks and varied network applications poses a significant challenge to the design of an efficient network intrusion detection system (IDS). Signature-based methods have become the norm; however, they are readily overcome by small changes to malware or its dropper. Another strategy takes behavioural deviations into account. This may be accomplished by keeping track of the system's actions over time and marking those that seem out of the ordinary. Anomaly detection-based methods, however, commonly incorrectly labelled common and legal system operations as malicious. Detecting attacks in networks using machine learning is another prevalent method. Recognizing patterns in data are a crucial application for machine learning. This may be used in cybersecurity to improve understanding of an attacked system's actions. However, traditional machine learning algorithms, particularly in operational environments, have their limits regarding cybersecurity. Unless a dataset is selected intentionally to be balanced, it will typically comprise benign data since most of the observed data in a genuine system is not attacked. That means that an attack's base rate is quite low. As a result, most machine learning algorithms tend to overfit data elements from benign environments. In practice, these types of learning algorithms often fail. In addition, they have trouble generalizing to risks they cannot see. Due to the development of DRL techniques, complex cyber-attacks may be detected and countered, including inserting falsified data into cyber-physical systems. Hence, this study suggests the Deep Reinforcement Learning-assisted Network Awareness Risk Perception and Prevention Model (DRL-NARPP) for detecting IoT network malicious activity in cybersecurity. Incorporating long-term planning abilities into RL algorithms enables them to consider future situations and risks. RL learning agents may learn to mitigate risks by practising with simulated environments and making informed predictions about the future. RL algorithms may be designed with reward functions that discourage behaviours linked to potential risks. An RL agent may learn to minimize risk by manipulating the reward signal and avoiding behaviours that result in unfavourable future states. The RL algorithm may still optimize policies using the Multi-layer NN outputs. Estimating the values or probability associated with different actions in various states is made easier by the network's function approximation capabilities. The RL agent may learn to reduce future risks by optimizing the policy using these approximations.

Figure 1 shows the Interaction between the Environment and Agent in the RL process. In RL, an agent is defined by its ability to generate learning experiences by direct interaction with the environment, as opposed to the other major branch of ML, supervised techniques learning by examples. Ideas of action, state, and reward characterize RL. In this approach, the agent makes an action at every time step that results in two outcomes: a new state for the environment and a reward or penalty for the agent. The reward is a function that, given a state, may indicate to the agent whether or not its current behaviour is optimal. The agent learns to do more positive and less negative behaviours based on the rewards it receives. Q-learning is a well-liked RL technique that uses the Bellman equation to optimize the discounted cumulative reward.

Fig. 1
figure 1

Interaction between Environment and Agent in RL process

$$P\left({w}_{t},{b}_{t}\right)=E\left[{r}_{t+1}+\alpha {r}_{t+2}+{\alpha }^{2}{r}_{t+3}+\dots +\left|{w}_{t},{b}_{t}\right|\right]$$
(1)

The discount factors \(\alpha \in [0, 1]\) handle the significance levels of upcoming rewards. It is employed as a statistical technique to examine the learning convergence. Due to the stochastic environment's limited observability or inherent uncertainty, the discount is often used in practice.

The expected reward (Q-value) of action provided a set of states must be stored in a lookup table or Q-table for Q-learning. As the state and action spaces grow, additional memory is needed. Real-world issues generally require continuous states or action space; hence, Q-learning is ineffective in tackling these issues. Thankfully, DL has developed into a potent tool that can be used with conventional RL methods. DL techniques can learn an efficient low-dimensional depiction of raw high-dimensional information because of its signature capabilities: representation learning and function approximation. However, utilizing deep neural networks (DNN) to estimate Q-functions is unbalanced because of the relationships amongst the series of observations and the associations between the Q-value \(P(w, b)\) and the target value \(P({w}{\prime}, {b}{\prime})\). On the one hand, the experience memory stores a wide list of learning experience tuples \((w, b, r, {w}{\prime})\), which are determined by the agent’s interaction with environments. The agent's learning progression randomly retrieves these memories to evade being influenced by the connections between subsequent experiences. On the other hand, the target networks are identical replicates of the estimating network, except that their variables are fixed and only changed periodically.

Figure 2 shows the proposed DRL-NARPP model. The data are taken from the Edge-IIoTset cybersecurity Kaggle dataset [33]. Data pre-processing includes converting raw data into a consistent and comprehensible format. Data pre-processing practices are employed as desirable at the production phase of the system development life cycles. Grouping and aggregating data, such as summation grouping, normalizing and standardizing data, and scaling data are all pre-processing procedures. Once the data have been pre-processed, suitable cyber data encoding enables dataset-feature mapping. The input data have been transformed into feature output, ready for use by the subsequent feature learning subsystem. A denoising autoencoder (DAE) was used in this research to improve the precision of our suggested IDS. Regarding inputs and outputs, DAE is a deep neural network. This model is trained using the adversarial dataset (the corrupted dataset) to predict the unperturbed dataset (the clean dataset). There are two parts to it: an encoder and a decoder. The encoder comps the input information into a representation called code, and the output data is reconstructed by the decoder from the code. Between the visible input and output layers of an NN lies a collection of hidden layers (called encoders) that make up a DNN. DNN may alter the connection between the input and the output for either a non-linear or a linear relationship link. To solve sequential decision-making issues, reinforcement learning (RL) employs iterative procedures in which an agent (or decision-makers) interact with its environments to learn how to behave appropriately under dissimilar environments. In formal terms, the agent aims to find a policy that would lead to the best possible outcome given the system's present state. This research aims to improve the system's effectiveness by teaching the agent to adopt a strategy that enhances the number of IoT assaults identified over time. If attackers want their IoT assaults to last in the IoT network and generate more profit, they must master stealth and resistance. An attacker may use various techniques to hide the telltale signs of an assault and prevent it from being uncovered. For instance, attackers may regularly alter attack traffic's temporal and spatial characteristics. Fairly accurately, the created system has identified cyberattacks on network traffic. The probability of a cyberattack occurring depends on the severity of the recognized security threat, the level of exposure, and the frequency of attacks. To direct the agent's behaviour, the reward function gives it rapid feedback. By approximating the predicted cumulative reward, the value function helps the agent assess the long-term effects of its activities. Lastly, the value function estimates values, and the policy decides what the agent should do with those values.

Fig. 2
figure 2

Proposed DRL-NARPP model

DNNs simulate master agents and individual learners, with separate outputs for the critic and the actors. The initial output is scalar values signifying the predictable reward of a provided state \(U\left(w\right)\) while the secondary output is vectors of value demonstrating a likelihood distribution of total probable activities \(\pi (w, b)\). The values of loss functions of the critic are stated by:

$${K}_{1}=\sum {\left(R-U\left(w\right)\right)}^{2}$$
(2)

As inferred from Eq. (3), where \(R = r + \alpha U \left({w}{\prime}\right)\) denotes the discounted future rewards. The actor is pursuing the minimization of the subsequent policy loss functions:

$${K}_{2}=-{\text{log}}\left(\pi \left(b\left|w\right.\right)\right)*B\left(w\right)-\vartheta G(\pi )$$
(3)

As shown in Eq. (3), where \(B(w) = R -U (w)\) indicates the estimated benefit functions and \(G\left(\pi \right)\) is entropy terms, which manages the exploration ability of agents with the hyperparameters \(\vartheta ,\) controlling the strength of the entropy regularizations. The benefit functions \(B(w)\) show how beneficial the agents are when it is in particular states. Asynchronous advantage actor-critic (A3C) learning is asynchronous because every learner interrelates with its distinct environment and updates the master network self-sufficiently. The cycle is repeated until the learning is complete; at this point, the master network is utilized.

To study cyberattacks on Cyber-Physical Systems (CPS), the cyber state dynamic by a statistical model characterized as

$$y\left(t\right)=f\left(t,y,v,\omega ;\theta \left(t,b,d\right)\right);y\left({t}_{0}\right)={y}_{0}$$
(4)

As discussed in Eq. (4), where \(y\), \(v\) and \(\omega\) represent the physical states, control input and disturbances, respectively. Besides, \(\theta \left(t,b,d\right)\) defines the cyber states at time \(t\) with \(b\) and \(d\) referring to cyber attack and defence, correspondingly.

Figure 3 shows the DRL-assisted intrusion detection systems. In this architecture, DRL agents (edge servers or base stations) connect wireless terminals to the rest of the network. These agents observe their environments and use the DRL model explained in the next sections to make decisions. The model may be trained to predict future rewards and determine prospective intrusions or attacks with the help of DRL agents, a unique type of RL agent. The IoT node works as an agent, monitoring its environment and making decisions based on the information it learns. This agent may be a server or base station with sufficient processing capability. There are two stages to intrusion detection. To begin, a disseminated trust management mechanism is set up to select reliable nodes for the network carefully. The legitimacy of each device is tested by reputation assessment. Nodes with specific functions like base stations or servers collect transmitted information and use DRL models to draw judgments. A DRL agent may train the model to predict impending reward and recognize probable intrusion or assaults. The DRL agent will only talk to the most reliable devices to handle their data needs. Secondly, this paper recommends adopting a DRL technique to identify network intrusions. During training, the agent uses an exploration policy, like the \(\epsilon\) -greedy approach, to investigate potential actions. This policy gives the agent discretion to take an action at random with a probability of or to use the greedy method and select actions with the maximum value functions with a probability of 1-\(\epsilon\). As an extension of conventional RL methods, the Deep Q-network (DQN) algorithm calculates the Q-value by considering a comprehensive states-actions pair with functions P(w,b). Using the states and actions of the environment and the Kaggle dataset, the DQN agent in our system approximates the Q-value using a DNN. It is impossible to save the Q-value for each state-action combination in Q-tables due to the large number of features in the dataset and the batch size used in the DQN progression. To estimate the Q-value of each state-action combination, the researchers suggest using the DQN algorithm for intrusion detection.

Fig. 3
figure 3

DRL-assisted intrusion detection systems

Supposing the attackers can launch different attacks represented by attack vectors\(B= ({b}_{1}, {b}_{2}, {b}_{3}, ..., {b}_{m})\). The thresholds utilized to choose the respective attacks in time slots \(t\) can be signified by\({T}^{t}=\left({\theta }_{{b}_{1}}^{t},{\theta }_{{b}_{2}}^{t},{\theta }_{{b}_{3}}^{t},\dots ,{\theta }_{{b}_{m}}^{t}\right)\). The components in attack vectors are being perceived continuously and simultaneously. Precisely, in time\(t\), if the perceived entropy \({G}_{{b}_{j}}^{t}\) for attacks \({b}_{j}\) about a feature surpasses the present thresholds\({\theta }_{{b}_{j}}^{t}\). The IoT assault indicator delivers an alarm to administrators. Identification outcomes may be either affirmative (malicious traffic is confirmed as attacks) or negative (malicious traffic is ruled out as attacks). Each subject's detection findings may or may not reflect the subject's true state, which may be summed up into four categories. Attack traffic is confirmed to be malicious, a True Positive (TP). When benign data are erroneously labelled as malicious, this is called a false positive (FP). False Negative (FN): Malicious traffic is incorrectly labelled benign, whereas True Negative (TN) correctly labels benign traffic as malicious.

Now, this study calculates the added incidences for every case during a period \(T\) including of \(m\) time slot \(({t}_{1}, {t}_{2},..., {t}_{m})\). This study uses \({M}_{11}^{T}\) to signify the number of authenticated true assaults. The number of false alarms is signified as \({M}_{12}^{T}\). The number of actual assaults missed is \({M}_{21}^{T}\). \({M}_{22}^{T}\) is utilized to signify the number of authentic benign traffic flows. Then, in time \(T\), the complete rewards determined from the environments are:

$${R}^{T}=\left({Q}_{0}-{C}_{0}\right)*{M}_{11}^{T}-\left({C}_{0}-{C}_{1}\right)*{M}_{12}^{T}-{C}_{2}*{M}_{21}^{T}+{Q}_{1}*{M}_{22}^{T}$$
(5)

where

$${M}_{11}^{T}+{M}_{12}^{T}+{M}_{21}^{T}+{M}_{22}^{T}=m$$
(6)

Then, the hit rates \({\delta }_{T}\) and false alarm rates \({\beta }_{T}\) could be calculated by:

$${\delta }_{T}=\frac{TP}{TP+FN}=\frac{{M}_{11}^{T}}{{M}_{11}^{T}+{M}_{21}^{T}}$$
(7)
$${\beta }_{T}=\frac{FP}{FP+TN}=\frac{{M}_{12}^{T}}{{M}_{12}^{T}+{M}_{22}^{T}}$$
(8)

The system state in time \(T\) could be signified as:

$${W}_{T}=\left({\delta }_{T},{\beta }_{T}\right)$$
(9)

This study models the attack identification in IoT networks as Markov decision. The objective is to increase the utility of IoT assault identification systems \({R}^{T}\) via optimizing the threshold \({\theta }_{{b}_{j}}\) for identifying the particular attack types \({b}_{j}\).

$${\theta }_{{b}_{j}}^{*}={\text{arg}}\underset{{\theta }_{{b}_{j}}\ge 0}{{\text{max}}}{R}^{T}$$
(10)

The feature threshold must be suitably selected to maximize the system's utility. This study wants the identification system not to miss detecting certain attacks. For this study, though, the system should not send out too numerous false alarms because that might require human assistance. Providing awards for careful and accurate attack detection should encourage the system to do its task, lowering the need for human involvement by a large amount. To help the attacker make enhanced decisions over time, this study suggests a network risk detection method based on reinforcement learning.

Figure 4 shows the scenario of an IoT network attack. As the attacker variations attack variables like attack type and rate, the classification boundary at which traffic is considered malicious may change. Therefore, this research applies an RL model to the attack identification system to continually update the classification boundaries, adjusting to the novel IoT threats. Data metrics are commonly used in the intrusion detection sector for anomaly discovery. In information theory, entropy quantifies the uncertainty associated with a data value. More unpredictability in the information variable is associated with a larger entropy value. In contrast, entropy is low when there is less uncertainty about the information variable. The low entropy value indicates that information variables are concentrated in their distribution, which may indicate the presence of an abnormality in the present system. Anomaly detection is more difficult because of the wide variety of protocols, platforms, hardware and software that open different susceptibilities to attack. Meanwhile, new low-rate attacks make distinguishing between safe and harmful ones more difficult. Additionally, the attacker is developing the capability to modify attack techniques and even design a recent attack depending on input from environments, necessitating immediate response from the defence. The suggested DRL-NARPP model increases the anomaly detection ratio, attack prediction accuracy ratio, and network risk assessment ratio and reduces the false positive rate compared to other existing methods.

Fig. 4
figure 4

A scenario of IoT network attack

4 Experimental Outcomes and Discussion

This study presents the Deep Reinforcement Learning-assisted Network Awareness Risk Perception and Prevention Model (DRL-NARPP) for detecting IoT network malicious activity in cybersecurity. The data are taken from the Edge-IIoTset cybersecurity Kaggle dataset [31]. The information contains both normal network activity and several malicious attacks. More than ten IoT devices (including an ultrasonic sensor for detecting water levels, low-cost digital sensors for sensing temperature and humidity, a flame detector, a heart rate sensor, a pH sensor metre for measuring soil moisture, and so on) contribute to the IoT information stream. However, this study uncovered and evaluated fourteen attacks against IIoT and IoT communication protocols, split into five categories: information collecting, denial-of-service/distributed denial-of-service, man-in-the-middle, malware and injection. This dataset analyses the efficacy of machine learning strategies under both centralized and decentralized learning conditions and provides the main exploratory data analysis results on the suggested realistic cyber security dataset. The performance of the suggested DRL-NARPP model has been examined based on metrics such as anomaly detection ratio, attack prediction accuracy ratio, network risk assessment ratio and false positive rate.

(i) Anomaly Detection Ratio

Detecting attacks and other anomalies in IoT network infrastructure is becoming more important in this field. Threats and attacks in IoT network infrastructure are increasing with the widespread use of such systems across industries. Some attacks and abnormalities that may bring down an Internet of Things system include data type probing, denial of service, malicious control, scan, malicious operation, spying, and wrong setup. However, attacks against IoT network systems may spread over a wider region, causing damage to a far greater number of devices than would be the case with a conventional communication attack on a local network. Microservices in the IoT network exhibit sporadic behaviour, disturb the consistency of IoT service operation, and constitute an anomaly. RL's flexibility in responding to unfamiliar situations is an attractive feature. An IoT anomaly detection system based on RL may adjust its behaviour over time to account for changing network and device conditions. When looking for anomalies, making a series of choices is common depending on the system's changing condition. RL's strengths lie in learning optimum solutions across sequences of actions, making it well-suited to this type of problem. Q-learning aims to train agents to do the best possible actions in highly uncertain environments. Environments with huge state spaces are no problem for Deep Q-Learning since the network approximation of the Q-function (estimates the cumulative reward for each action in a given state) makes this possible. Figure 5 shows the Anomaly Detection Ratio.

Fig. 5
figure 5

Anomaly Detection Ratio

(ii) Attack Prediction Accuracy Ratio

This research provides an RL-based attack identification model that can spontaneously learn and identify the alteration of assault patterns, allowing it to adapt to the novel features of attacks on IoT networks. The DRL attack prediction module uses the Mini-batch encoding technique to bring reinforcement learning to a supervised learning setting. This incorporation enhances accuracy. In addition, policy networks implemented in this part of the model are deliberately kept simple to maximize productivity. Feature selection improves IDS model performance by removing superfluous features and maximizing prediction precision and efficiency. Compared to Support Vector Machine (SVM), hidden Markov models, and other ML or data mining techniques, experimental findings acquired from system demand trace information demonstrate the superiority of the suggested DRL-based network IDS regarding greater accuracy and reduced computing costs. Figure 6 shows the attack prediction accuracy ratio.

Fig. 6
figure 6

Attack Prediction Accuracy Ratio

(iii) Network Risk Assessment Ratio

Cyberawareness refers to the knowledge and comprehension end users have about cybersecurity best practices and the everyday cyber dangers their networks or organizations face. Using reinforcement learning, especially Q-learning, for network risk assessment entails teaching a bot to take actions that lessen vulnerability across the board. Creating a reward structure that incentivizes the RL agent to take measures that secure the network. The incentive system should reward those who contribute to a more secure network. For instance, one may award points for seeing and fixing security flaws while docking them for doing nothing about an impending danger. The trained model's effectiveness in reducing network risks is evaluated using a distinct dataset or a simulated environment. This entails testing the agent's decision-making skills in various environments. Figure 7 shows the network risk assessment ratio.

Fig. 7
figure 7

Network Risk Assessment Ratio

(iv) False Positive Rate

The intrusion detection rate may measure false positive rate in a network. Assessing the efficacy of intrusion detection systems (IDS) continued to depend heavily on finding a balance between the risks of false negatives and false positives. Confusion between normal and abnormal behaviour might result in false positives and negatives when employing simple detection systems. Several DoS and DDoS assaults may masquerade as regular traffic, and to effectively detect them, it is important to study numerous elements in the network behaviour. Using a single data point (such as link usage) might lead to erroneous conclusions in settings where heavy use of a certain resource is to be anticipated. Because of this, methods that can tell the difference between attack activity and legitimate programs using a lot of resources must be developed. Figure 8 shows the false positive rate.

Fig. 8
figure 8

False Positive Rate

Table 1 shows the confusion matrix for classifying phishing. Cyberattacks, such as ransomware, insider threats, phishing, botnets, malware, and several more, are a continual reality today. The situation is completely unsustainable, and it's continuing to get worse. There is a tremendous and ever-increasing volume of data that might be compromised. By including a learning progression geared towards email type, concealed malware, or compromised URLs, reinforcement learning may be utilized for spam and phishing detection. Parallel classification problems aim to identify phishing attacks within a data collection set, including spoofed and authentic instances. False positive (FP) rate procedures the number of legitimate instances erroneously recognized as phishing attacks concerning all current actual occurrences, as demonstrated in Eq. (11).

Table 1 Confusion matrix
$$FP=\frac{{M}_{L\to P}}{{M}_{L\to L}+{M}_{L\to P}}$$
(11)

As shown in Eq. (11), where \(M\) symbolizes the number of phishing instances that are accurately categorized as phishing, \({M}_{P\to P}\) denotes the number of authentic cases that are erroneously categorized as phishing, \({M}_{L\to P}\) indicates quantity of phishing occasions that are imprecisely categorized as original, \({M}_{P\to L}\) represents the number of actual occurrences that are mistakenly categorized as original and \({M}_{L\to L}\) signifies the number of genuine cases that are effectively categorized as original.

5 Conclusion

This study presents the Deep Reinforcement Learning-assisted Network Awareness Risk Perception and Prevention Model (DRL-NARPP) for detecting IoT network malicious activity in cybersecurity. This research explored the application of reinforcement learning, specifically Q-learning, in the context of network awareness, risk perception, and adaptive network prevention. The objective was to design a smart system that could continuously evolve in response to the ever-changing nature of cyber threats. Utilizing Q-learning, the network can dynamically modify its preventative measures in response to a constantly changing threat landscape. This flexibility is essential for successfully fighting evolving cyber threats, some of which may reveal previously unanticipated patterns. Accurate perception and evaluation of network threats were shown to be within the agent's control due to its reinforcement learning capabilities. The agent became more adept at spotting security flaws and dangers in the network as it accumulated experience by interacting with the system and analyzing collected data. Over time, the agent learned optimum decision-making techniques that reduced false alarms while effectively responding to true security risks. The Q-learning model optimized preventative measures by continual interaction with the network environment. As part of this process, this study changed the firewall rules, updated the security environments, and implemented preventative measures to lower the risk level. The suggested method uses a decentralized approach; however, it fails to account for computational and network costs (such as delays between agents in routers and the central intrusion detection system).