1 Introduction

It is expected that the unmanned aerial vehicles (UAV), also known as drones, will scale their flights, and their operations beyond visual line of sight (BVLOS) will cover most of the air traffic by 2035 [1]. This rapid expansion of drone coverage can be useful in various fields such as agriculture, emergency response, search and rescue, and firefighting.

This expected growth of future drone traffic evidences the need for automated traffic management solutions, ensuring a safe course of operations of unmanned aviation. One project called B-Drone [2], which is funded by State of Research Agency in Spain, concerning drone fleet supervision and optimal planning for business operations aims at providing holistic and integrated solutions to U-space [3], Unmanned Traffic Management (UTM) system in Europe, by focusing on open research questions related to the large scale, business-oriented operation of drone fleets.

The integration of AI, particularly deep reinforcement learning, promises to enhance the autonomy and adaptability of drones in complex environments. Drones equipped with deep reinforcement learning method have been proposed by researchers, and their focus is on tasks like avoiding obstacles in an environment, pursuing other drones, or engaging in drone racing, and navigating them safely to their designated destination. For instance, a drone trained by using DRL method competed against three human champions, including the world champions of two international leagues, in real-world, head-to-head races [4]. This intelligent drone won several races against each human champion and demonstrated the fastest recorded race time. Moreover, synchronized drones have been trained by DRL for tracking and aerial filming while maintaining formation and preventing collision [5]. Researchers demonstrated that the trained DRL model successfully enabled the drones to follow a target drone. In addition, the DRL models are also implemented and tested in real world drone applications such as a system for autonomous landing of a UAV on a fixed pad relying solely on raw visual inputs [6]. It is demonstrated that the networks were able to generalize to real environments although the training was performed only in simulation and with a limited subset of textures. Also, Multi agent deep reinforcement learning is used in swarms of drones for monitoring floods [7]. The agents are trained over simulated floods and the simulation results show that the aircraft can efficiently monitor floods in a coordinated fashion with both reward schemes.

However, drones have also introduced new challenges, particularly in security and safety. The increasing use of commercially available drones and their expanding capabilities pose a threat to safety of skies when misused. The potential for their misuse, intentionally or accidentally, poses significant risks to public and airspace safety. In order to mitigate these challenges, effective countermeasures have been developed. The technologies used in counter drone systems include detection, localization, and identification of small UAVs which include infrared sensors, laser devices, optical surveillance aids and devices, acoustic devices, LiDAR (Light Detection and Ranging) sensors, equipment operating with image recognition technology, devices capable of detecting and localizing UAV remote control signals, and human air observers [8]. Additionally, AI powered drones can also be used for counter drone operations. For instance, deep reinforcement learning (DRL) methods can be a promising approach for countering drones autonomously. DRL, which is a subset of AI, can be both applied to the drones to autonomously perform tasks, such as navigation, obstacle avoidance, target tracking, etc. DRL models can be trained to learn optimal strategies through trial and error experiences and this allows them to adapt to dynamic environments and make decisions.

In a survey [9], the limitations of available counter-drone technologies and the advantages of using them are explained in detail. In this study, it is stated that the threats coming from drones in airports are not easy to deal with. However, methods such as geofencing, multiple radars with different detection ranges and a combination of radio-frequency sensors with visual detection sensors can be implemented to defend airports against unwanted drones. It is also highlighted that airfield operators must remain within the law when using disruptive technologies, and the risks to the wider community should be fully assessed and understood. Additionally, a review of the available technologies of counter unmanned aerial systems (UAS) [10] evaluates the concept of a multiplatform counter-UAS system which is based on a team of mini drones acting as a cooperative defensive system. However, there are challenges such as team coordination, team communication networks and team simulation frameworks. The studies indicate that utilizing a drone equipped with deep reinforcement learning is an effective method for countering malicious drones. Combating unknown and malicious drones can be very accurate and efficient by implementing an AI method in counter-drone technology. AI methods significantly reduce the time required to engage with the target compared to other methods which rely on human intervention. A drone equipped with AI can identify and classify targets with high precision, thereby minimizing the false interdiction with the targeted object by using an AI. For instance, an artificial intelligence approach based on deep reinforcement learning to counter drones is introduced, as detailed in a doctoral thesis [11]. The author suggested that a key benefit lies in autonomous decision-making, enabling AI agents to independently make decisions based on their surroundings and circumstances. This capability enables effective countering drone threats, even in vulnerable environments. Also, AI agents can be trained in simulation, providing a secure platform for experimentation, testing, and validation prior to actual deployment. Similarly, a UAV platform is proposed in another study [12] to autonomously detect, hunt, and take down other small UAVs in GPS-denied environments. This approach was developed without a motion-capturing system and introduced a baseline approach for autonomous drone hunting in a GPS-denied environment. Besides, solutions based on multi-agent reinforcement learning are also employed for countering drones. For example, several behavior-based multi-agent strategies are proposed in a doctoral thesis [13]. In this research, multi-agent persecution with deep reinforcement learning techniques is investigated, with the implementation of collective hunting behavior. In another study [14], multi-agent deep reinforcement learning strategies for defender UAVs to intervene in target-oriented attacks of intelligent UAV swarms is investigated. Rather than relying on a confrontational approach of attack and defense, researchers focus on developing countermeasures to intelligently mislead or delay attacks on a target. Moreover, a multi-agent system is suggested in [15], formulating the Counter-UAS problem as a multi-agent partially observable Markov decision process. Multi-agent partial observable DRL for pursuer control optimization to train a group of UAS in terms of pursuers or agents, to pursue and intercept a faster UAS or evader, which tries to escape from capture while navigating through crowded airspace with several moving non-cooperating interacting entities.

Explainable artificial intelligence (XAI) is a group of processes and approaches to understand how an artificial intelligence system makes decisions [16]. To uncover the black box of DRL methods, several XAI techniques are used in the literature. For example, these methods include Gradient-weighted Class Activation Mapping (Grad-CAM) [17] for visualizing the areas of attention, Saliency Map [18] to highlight the important saliency features, Shapley Additive Explanations (SHAP) [19], a game theoretic approach to explain the output of any machine learning model, and Linear Interpretable Model-agnostic Explanations (LIME) [20]. Moreover, the methods in general categorized into multiple cases such as graphical visualization of the agent learning process, summarizing the policy a RL agent has learned, verification of the models and the physical system being controlled, human collaboration, and query-based explanations [21].

XAI has been investigated by researchers to produce more explainable AI models that users can understand and trust. In a review by Hickling et al. [22], the explainable AI methods that researchers use and apply XAI to DRL models and the practicality of these XAI implementations, are analyzed. The authors also discuss the weaknesses of the existing XAI methods. In this review, it is also highlighted that one XAI method which works in a scenario may not be applicable in a more complex scenario such as a game with more actions available and more outcomes for those actions. Moreover, Sullivan et al. [23] trained the Deep Convolutional Q-Learning agent in 20 Atari games and visualized the Experience Replay using Deep Shap Explainer, which is utilized to approximate SHAP values for deep learning models. Deep Shap explainer is built on the DeepLIFT (Deep Learning Important FeaTures) algorithm [24]. DeepLIFT is a method to deconstruct the output prediction of a neural network based on a specific input by backpropagating the contributions of all neurons in the network to every feature of the input and SHAP. The analysis shows that the agent’s decision-making process is validated by checking whether agent’s actions are aligned with the game’s mechanics and objectives. Additionally, two different XAI methods, SHAP and LIME, are implemented to explain the decisions of a reinforcement learning agent playing Atari’s Breakout game [25]. Another research [26] explores the explainability in AI agent’s policies by taking inspiration from how humans explain and examines the feature attributions to explain how input state features influence the predicted actions. Furthermore, a SHAP method is implemented to explain DRL agent’s actions [27]. The results demonstrate that SHAP values are a suitable approach to interpret the decisions of the DRL agent, and the contributions of the different state features can be logically explained. XAI methods can also be used together to explain the DRL model decisions. For example, 3 different XAI methods such as SHAP, LIME and Linear Model Trees (LMTs) which is a regression tree that uses a linear function to form its prediction [28]. The research highlights the utility of XAI methods that they can be useful to understand the reasoning behind decision-making process of the DRL black box. However, the authors also caution that these methods might not be suitable for real-time explanations due to their slow processing speed.

In a review by Heuillet et al. [29], the authors provide an overview of recent studies regarding explainable reinforcement learning, which makes these models more understandable. In this review, explainable reinforcement learning methods are categorized in terms of the employed RL algorithms, decision process, and the explanation type, such as using images, text, or diagrams.

Explainable deep reinforcement learning is also utilized to understand better the trained model for UAV autonomous path planning [30]. Authors proposed a technique based on the feature attribution and textual and visual explanations are generated to allow end-users to understand the decision behind the particular behaviour of the agent.

Another research [31] proposes an overview of explainable reinforcement learning methods by classifying and assessing the current XAI methods. In this overview it is found that most explainable reinforcement learning methods work by mimicking and simplifying a complex model rather than designing an intrinsically simple one.

Additionally, reinforcement learning is visualized interactively for debugging RL policies and interpreting RL algorithms [32]. For example, visualization of optimization landscapes is used to explain Learning dynamics of actor-critic learning algorithms.

The interpretability and effectiveness of advanced AI models in specific applications, such as counter-drone operations, are crucial for the user acceptance of AI-driven decision-making systems [33]. Specifically, transparent and explainable AI systems are needed to counter drones, where decisions must be explainable and justifiable.

Previously, DRL model for countering drones has been studied [34], and it is demonstrated that a deep reinforcement learning method can be an effective method to counter a drone in a 3D space by using another drone. Despite the effectiveness of DRL models, DRL models’ decisions can be questionable and they raise concerns regarding the transparency and interpretability of their decision-making processes. This lack of transparency can limit the trust in the counter drone system. These challenges are addressed in this research by proposing an approach called Shapley Additive Explanations (SHAP) [19] to analyze DRL model decisions for counter drone systems. SHAP is a game-theoretic method to interpret complex machine learning models. Our aim in applying SHAP is to enhance the transparency and trustworthiness of DRL-based counter drone systems. Our previous work [35], implementing saliency maps to explain DRL model decisions, has been extended and Shapley Additive Explanations is used to help identify the highlighted regions in an input image which are the most important for the predictions made by the DRL agent. The main contribution of this research is that Shapley Additive Explanations are utilized to explain deep reinforcement learning agent’s decisions by visualizing experience replay in counter drone operations. Moreover, a simple deep reinforcement learning model is implemented in this research in terms of the number of actions and the reward function, without applying very deep and complex neural network. This approach enables the agent to converge faster, leading to successful episodes. Additionally, the absolute SHAP values are calculated instead of default SHAP values and the results are investigated by comparing the agent’s decisions in different state input images. Absolute SHAP values offer a clear understanding the contributions of features to the agent’s decisions. By combining advanced AI techniques such as DRL with transparent decision-making processes, we aim to ensure the continued safety and security of our skies in the face of evolving drone technologies.

The remainder of this manuscript is organized as follows. Section 2 covers the background. Section 3 presents the experimental setup, the tools used during training DRL algorithm, and the details of the explainable AI method, SHAP. In Section 4, DRL method training results are presented. Also, Shapley Additive Explanations results are explained and analysed in detail, followed by a discussion and the conclusions of this research.

2 Background

Reinforcement learning (RL) is a machine learning approach where an agent learns through trial and error experiences in an environment. The agent begins to learn about the environment without any prior knowledge and interacts with it at discrete time steps t. This interaction is presented in Fig. 1. The current State is represented as \(S_t\) and the State Space is represented as S. The Action taken at a given time step and Action Space are denoted as \(A_t\) and \({A}(S_t)\), respectively. Rewards are updated in each time step, denoted as \(R_{t+1}\), resulting in a new state \({S}_{t+1}\). A full explanation of the fundamental concepts of RL can be found in [36, 37]. In this paper, the agent is a quadcopter and will be rewarded in each time step t by the environment. The goal of the agent is to maximize the cumulative reward.

Fig. 1
figure 1

The agent-environment interaction in reinforcement learning

Reinforcement Learning algorithms differ based on their characteristics, such as their approach to solving RL problems, the type of information they use, or the requirements they have. RL can be broadly categorized into three main branches: the value-based approach, policy-based approach, and model-based approach. In value-based RL, the agent seek the policy which maximizes a value function which describes the long-term reward of a particular state or action. On the other hand, in policy-based approach, the agent focuses on directly learning a policy, which is a mapping from states to actions. Moreover, in model-based reinforcement learning, a model is provided to the agent for the environment or the agent is to learn an environment model to perform the certain tasks in the environment.

Reinforcement learning algorithms are also classified from different perspectives: model-based and model-free methods, value-based and policy-based methods (or combination of the two), Monte Carlo (MC) methods and temporal-difference methods (TD), on-policy and off-policy methods. In Fig. 2 the taxonomy of reinforcement learning algorithms is presented and they are as follows: DP (Dynamic Programming), TD (Temporal Difference), MC (Monte Carlos), Monte Carlo Tree Search (MCTS) [38], Q-learning [39], I2A (Imagination-Augmented Agent) [40], DQN (Deep Q-Network) [41], Quantile QT-Opt [42], C51 Categorical 51 [43], TRPO (Trust Region Policy Optimization) [44], ACKTR (Actor Critic using Kronecker-Factored Trust Region) [45], AC (Actor-Critic) [46], A2C (Advantage Actor Critic) [47], A3C (Asynchronous Advantage Actor Critic) [47], DDPG (Deep Deterministic Policy Gradient) [48], TD3 (Twin Delayed DDPG) [49], SAC (Soft Actor-Critic)n [50], REINFORCE [51], world models [52].

Fig. 2
figure 2

Taxonomy of reinforcement learning algorithms [53]

Deep reinforcement learning (DRL) combines reinforcement learning and deep neural networks as function approximator and DRL enables agents learn how to achieve their goals. DRL proposes the use of deep neural networks as the agent’s decision algorithm. Together with the experience replay memory, deep RL has been able to achieve super-human level when playing video and board games. For instance, deep Q-network (DQN) method is published by DeepMind [41] and the main characteristic of DQN is to use a deep convolutional neural network to approximate the optimal action-value function, which is defined as the expected return starting from a particular state and taking a particular action by the agent. DQN provides updates of action values and target values iteratively. Target values are the desired values that the agent aims to approximate during the learning process. Target values are computed by using the Bellman equation, which takes into account the immediate reward obtained from an action and the discounted maximum future Q-value of the next state, and used as a reference for updating the parameters of the Q-network during training. The deep RL solution is also based in double deep Q-network (DDQN) [54], an extension of the DQN implementation. DDQN selects from the state the action of the agent which maximizes the Q-value. Q-values are estimations of the future reward of an action executed in a given state.

Double Deep Q-Network (DDQN) algorithm with dueling network architecture and prioritized experience replay (PER) is performed in this paper. In the dueling network architecture presented in [55] there is no need to estimate the value of each action choice as it is calculated in DQN and Double-DQN. Instead of following the convolutional layers with a single sequence of fully connected layers, the dueling network has two new streams. One of the streams estimates state-value and the other stream estimates the advantage for each action. Moreover, PER is introduced by [56] to make the agent learn faster. Previously, experiences are sampled uniformly from a replay memory and the transitions are replayed without considering their significance. However, PER prioritizes the experiences and important transitions are replayed more frequently so that the agent learns efficiently. More details on DDQN, dueling network architecture and PER can be found in [11].

The details of the agent’s neural network can be seen in detail in the following Sections 3.2.

3 Experimental setup

In this section Explainable AI system and the tools & methods required for training and testing deep reinforcement learning algorithms in this paper are presented. DRL Model is trained on a desktop PC with NVIDIA GeForce RTX 3060 Ti with 8 GB VRAM graphics co-processor.

3.1 Tools

3.1.1 Airsim simulator

AirSim [57] is an AI research platform to experiment with deep learning, computer vision and reinforcement learning algorithms for autonomous vehicles, and is built on Unreal Engine [58]. Airsim has a wide range of environments such as mountains, blocks, neighbourhood, city environment, etc. In this paper, the neighbourhood environment is selected for training and testing the deep reinforcement learning algorithm.

Fig. 3
figure 3

Environment Setup x-y-z Directions

Fig. 4
figure 4

The dueling Q-network [55]

In the Unreal Engine coordinate system [58], the orientation is defined with the x-axis pointing forward, the y-axis pointing to the right, and the z-axis pointing upward. In Airsim, the coordinate system follows the North-East-Down (NED) convention, where the north direction is aligned with the x-axis of the Unreal Engine. The orientation of x-y-z directions are presented in Fig. 3.

3.1.2 OpenAI-Gym

OpenAI-Gym [59] is an open source framework for reinforcement learning tasks. It is a set of tools for developing and comparing reinforcement learning algorithms. It is compatible with TensorFlow [60] which is an open source platform for designing artificial neural networks. The gym library has a collection of environments for testing reinforcement learning algorithms. These environments have a shared interface that allows writing general algorithms. Users can also create custom environments for training and testing RL algorithms.

3.1.3 Shapley Additive Explanations (SHAP)

SHAP [61] represents a collaborative open-source initiative and it is implemented as a Python package to explain progressively more complex models (Fig. 4).

3.2 DRL model

The deep reinforcement learning model is built using dueling network architecture and is trained with DDQN Algorithm is presented in Appendix A and prioritized experience replay. The image, presented in the following Section 3.2.1, is the input of a convolutional neural network (CNN) and then the tensor becomes the input of three RELU layers with the following characteristics: each has activated consecutive 256 kernel dense layers. The output layer is a dense layer and the outputs are the action values. In other words, the model architecture includes two convolutional layers with ReLU activation functions and CNN layer output is flatten followed by fully connected dense layers. The agent neural network model is shown in Fig. 5. Hyperparameters used to train DRL model are also presented in Table 1. In addition, Fig. 6 shows the neural network representation with dueling architecture. The network consists of convolutional layers followed by fully connected dense layers. In this figure, the dueling architecture outputs are the state-value V and the advantage for each action A [55]. State-value V represents the value of being in state s. On the other hand, A shows how much better or worse each action is compared to the average action in state s. The two streams are combined to produce a single output Q function shown in (1). The advantage of using dueling architecture is that the agent can learn which states are more valuable without learning each action at each state. The dueling network architecture can be seen in Fig. 4.

$$\begin{aligned} Q(s,a;\theta ,\alpha ,\beta ) = V(s;\theta , \beta ) + A(s,a;\theta ,\alpha ) \end{aligned}$$
(1)

where \(\theta \) is the parameter of the convolutional layers, while \(\alpha \) and \(\beta \) are the parameters of the two streams of fully-connected layers.

Fig. 5
figure 5

Agent deep NN model

                                          

Table 1 Hyperparameters of the training
Fig. 6
figure 6

Agent dueling architecture

                                    

Fig. 7
figure 7

Depth image taken from the agent’s front depth camera

3.2.1 Drone agent states

Agent state is composed of depth image seen in Fig. 7, with 256 \(\times \) 144 pixels captured continuously. The depth image provides additional information about the scene by assigning a depth value to each pixel, indicating the distance between the camera and the corresponding point in the scene. When viewing the first-person view of a depth image in AirSim, objects closer to the camera will appear darker, while objects farther away will appear lighter. In Fig. 7, the target drone can be seen darker, while the objects, which are at a distance from the agent drone, appear lighter in the image.

In addition, a grid is drawn on the image if the drone approaches the geo-fence limits in \(x-y\) directions. Geo-fence limits are shown in Table 2. Geo-fence ensures safety by defining limits within which a drone must operate in the environment. The grids start to be drawn on the image when the distance between the drone and the geo-fence limits is lower than or equal to 1 meter. The thickness of the grid increases as the drone moves towards the geo-fence limits. An example of the grid image is shown in Fig. 8. The grid is also drawn on the full image when the drone moves and comes closer to the geo-fence limits in \(x-y\) directions.

Table 2 Geo-fence Specifications

3.2.2 Drone control actions and the target drone

The agent can perform three different actions: moving forward, yawing left and right. The actions are represented in detail in Table 3. The target drone is stationary and keeps its altitude at 4 meters. the initial distance between the agent and the target is 8meters

3.2.3 Reward function

The reward function includes both incremental rewards that penalize the agent during the episode and the reward that a successful episode gives 100. During each step before the end of an episode, a penalty is applied to promote rapid counter-drone action. This penalty starts at 1 and gradually increases by 1 every 50 steps until it reaches 4, as indicated by the negative values in the reward function presented in Table 4. In addition, an intermediate step reward is added: \(\Delta Distance\) which represents the change in distance to the target between the current step and the previous step. This value is positive when the agent is approaching the target and negative when the agent is moving away from the target in the last step. The time limit is also set to restart the episode after 180 steps and the reward of the agent becomes the the sum of rewards at the end of the 180th step. In this paper, the penalty for collision with obstacles or geo-fences in the environment is not applied.

Fig. 8
figure 8

Fences: grid on the full image

Table 3 Actions

3.3 SHAP experiment

Figure 9 shows the process before and after the training DRL agent, as well as the explainability part of the experiment. Firstly, the DRL agent is trained by using the Dueling Network and prioritized experience replay and the experiences such as State (s), Next State (\(s'\)), Reward (r), and Action (a) are recorded during the training. After training, the model is saved and applied to the SHAP Deep Explainer, along with data sampled from the experiences. The data is divided into two parts: one for developing the SHAP Deep Explainer and the other for calculating SHAP values. The SHAP library was used to calculate feature importance. The following steps were implemented:

  • Initialize the SHAP DeepExplainer with the trained model and the data.

  • Calculate SHAP values for the test data.

  • Aggregate SHAP values to determine feature importance.

Explaining models with large amounts of data can be challenging due to the heavy computation involved in SHAP analysis. TensorFlow, Keras, and SHAP libraries were used. Parameters such as the number of background samples and test samples were chosen based on the dataset size. The size of the data can be selected depending on the memory of the gpu used, and it can range from 100 to 1000 image states in this paper. The execution time for calculating SHAP values was also measured to understand the computational efficiency of the proposed method. The time taken to calculate SHAP values for 25 test samples was approximately 11 seconds. The description of the dataset is as follows:

  • Data Source: The dataset includes state-action pairs collected from simulated drone interactions with the environment in Airsim.

  • Preprocessing: States were preprocessed to normalize pixel values and resize images to a fixed size (144x256).

  • Dataset Size and Features: The dataset contains 10,000+ samples, with each state represented as an image.

Table 4 Reward function
Fig. 9
figure 9

Explainability process

4 Results

Fig. 10
figure 10

Cumulative rewards vs time steps

4.1 DRL method training results

In this section, training results are presented. In Fig. 10, the training results for 26K steps are shown. This training is set at 4 m altitude,where both the learner drone and target drone are flying and there is no change in altitude. Also, Fig. 11 shows cumulative rewards for each episode during training. Both Figs. 10 and 11 show that negative cumulative rewards are very high at the beginning of the training, indicating that the agent explores the environment and crashes into obstacles. In Fig. 11, the episodes are color-coded according to agent’s success in blue, crash (failure) in red or time limit in cyan. The agent also reaches the time limit many times in first few hundreds episodes and it receives below \(-400\) cumulative rewards. However, the agent learns how to avoid obstacles and catch the target before 500 episodes and the cumulative rewards maintain around 100. Even after the agent learns how to avoid obstacles, the exploration rate (epsilon) gradually reduces over time through annealing, which is set at 15, 000 steps. The Epsilon-Greedy policy is applied to manage the level of exploration throughout the entire learning process. The exploration rate starts at a maximum value of 1 and gradually decreases to a minimum value of 0.01 by the time 15, 000 steps are completed. In other words the agent is still allowed to explore with a small amount of randomness that might lead to crashes into obstacles in the later stages of training.

Fig. 11
figure 11

Episode rewards vs episode number

Fig. 12
figure 12

Episode steps vs episode number

Fig. 13
figure 13

Initial image state, action and next state

In addition, the number of steps performed on each episode during training is presented in Fig. 12. The agent spents a lot of time to explore at the beginning and it can reach the maximum number of steps which is 180 steps. However, after 500 episodes, the agent spent less time in each episode, and the number of time steps became stable, consistently below 25.

4.2 Shapley additive explanations results

The results of the Shapley Additive Explanations demonstrate several significant findings crucial to explainable deep reinforcement learning research. The decisions made by the deep reinforcement learning (DRL) agent become explainable after following the SHAP process, which is by training the DRL model and applying the SHAP Deep Explainer, as presented in Section 3. In this section various plots are investigated. Firstly, the current image state and the next image state are presented to see what happens after agent’s decision. For example, if the agent turns left, it can be seen whether the obstacle is avoided or not. In addition, the SHAP plots are presented, demonstrating the importance of SHAP values for the agent’s action. Absolute colorbar is utilized to clearly show the importance of shap values, ranging from white (indicating lower importance) to red (indicating higher importance). This visualization helps in understanding how the DRL agent makes decisions with respect to actions and why the rest of the actions are not selected. Additionally, the SHAP values are filtered and visualized with a threshold based on standard deviations from the mean. For example, values less than (\(\text {mean} - \text {threshold}\)) are set to blue color and values greater than (\(\text {mean} + \text {threshold}\)) are set to red color. This approach allows us to plot SHAP values that positively affect the agent’s decision.

Several tests have been made by selection different image states to explain DRL agent’s decision. In Fig. 13 the current state, the next state and the action taken by DRL model are shown. In this image target drone is seen from the agent front camera and it is expected that the agent needs to move forward to catch the target drone. Figure 14 shows the absolute SHAP values plotted by applying SHAP Deep explainer. As it is seen in the SHAP plot, high intensity of red color shows the importance of shap values on the image state. The DRL agent considers the target drone and its around has highlighted in red color which shows high importance. The DRL agent has made the decision and chose the first action, go forward. This can also be seen in Fig. 15 which shows the filtered SHAP values within \(\pm 5\) standard deviations from the mean. It is also seen that certain objects in the SHAP plot are identified by the agent as obstacle such as plants and trees or obstacle free region such as sky.

Fig. 14
figure 14

SHAP deep explainer plot

Fig. 15
figure 15

Highlighted pixels by SHAP analysis

Fig. 16
figure 16

Initial image state, action and next state

Another image state contains grids drawn on the image, representing virtual geofences is selected to investigate SHAP explanations. Figure 16 illustrates the image state observed by the DRL agent, with grids drawn on the image. The agent makes a decision ,action yaw left, to avoid crossing the geofences shown in the image, proceeding to the next state without grids. The reason can be explained in SHAP plot shown in Fig. 17. The geofences are identified by the DRL agent and the importances of shap values are drawn on the image. As it is seen in this figure, going forward has been given high importance showing intense red color, which means the DRL agent cares about the geofences. It is expected that the agent can turn left or right. As it is seen in Fig. 16, there is an obstacle such as tree seen on the top right of the state image is given more importance, intense red color which means higher absolute shap values. The agent makes decision and turns left which avoids geofences. Figure 18 also presents the filtered SHAP values within a range of \(\pm 3\) standard deviations from the mean.

Moreover, Fig. 19 shows an image contains obstacles such as tree, road sign etc. seen through the agent’s front camera, and the next state in which the target drone becomes visible. This demonstrates how the DRL agent’s decision, action yaw left, leads it to face the target. The explanation for this can be found in the SHAP Deep Explainer plot presented in Fig. 20. The agent’s decision to yaw left is influenced by the tree obstacle in front of it and on its right side. The obstacles can also been seen in high absolute SHAP values which shows high importance as it is seen in previous SHAP plots. Figure 21 also presents the filtered SHAP values within a range of \(\pm 5\) standard deviations from the mean. We see the values that are affecting positively to the decision where above the threshold as red. Also, blue ones are those that affect negatively to the decision.

Fig. 17
figure 17

SHAP deep explainer plot

5 Discussion

The results presented in Section 4 demonstrate the effectiveness of a deep reinforcement learning agent for countering drones and making its decisions explainable through Shapley Additive Explanations. By visualizing the importance of features in the image states, SHAP helps in understanding why the agent selects certain actions over others. The agent’s identification of obstacles and geofences, represented as grids drawn on the image, is shown in the SHAP values, indicating their influence on the agent’s decisions. Moreover, it is seen that the high importance is assigned to the target drone in the SHAP plots and it indicates that the agent recognizes the target drone.

Fig. 18
figure 18

Highlighted pixels by SHAP analysis

Fig. 19
figure 19

Initial image state, action and next state

The various plots demonstrated for different states, it is obvious that the agent’s decisions are affected by nearby obstacles. However, agent’s random behavior during exploration and exploitation which is explained in detail in Section 4 can also effect the accuracy of SHAP plot explanations. As previously discussed, there is still small randomness even after annealing and this small random behaviour can lead to action selection by the agent and it can diverge from the model output predictions. Understanding the impact of random behavior on decision-making accuracy can lead to the development of more robust exploration-exploitation strategies which can reduce the unpredictability of the agent’s actions.

The SHAP analysis uncovered key pixels influencing the model’s decision-making process. The specific regions of the input images, such as edges and drone shapes, were critical for the agent’s actions. By visualizing SHAP values, we identified which parts of the image affected decisions to move left, right, or neutralize a target. Key pixels such as the presence of drone-like shapes in specific image regions were highly influential. These pixels align with the task of identifying and responding to drones in the environment. As the agent encounters and learns from varied and challenging situations during training in the environment, it becomes more capable of making stable and reliable decisions which increases its robustness and performance in real-world applications. Similarly, the agent’s prioritization of pixels over time can be explained by an evolving understanding and weighting of different pixels, indicating the agent’s learning dynamics.

Additionally, the saliency maps are calculated based on the gradients of the model’s output with respect to the image input. These gradients help identify the highlighted regions in an input image which are the most important for the predictions made by the DRL agent. In Fig. 22a, b and c, it can be seen that the areas of the image have different color intensity levels. These regions show which sections of an input image are most influential for the predictions made by the agent.

Fig. 20
figure 20

SHAP deep explainer plot

Saliency maps provide a local explanation focused on individual input samples. However, a disadvantage of using saliency maps is that they do not offer the comprehensive feature importance analysis provided by SHAP. Although saliency maps enable quick visualization of feature importance, they may not offer the same depth of explanation as SHAP.

Fig. 21
figure 21

Highlighted pixels by SHAP analysis

Fig. 22
figure 22

Saliency Maps for the Corresponding Image States

The findings from SHAP and saliency map analyses offer future enhancements to the DRL agent and its training process. By identifying the pixels with the greatest impact on agent’s decision, we can focus on improving the representation and processing of these pixels. For example, if the SHAP values indicate that edge detection is crucial for drone identification, future iterations of the model could include advanced edge detection techniques or improve image preprocessing steps.

6 Conclusions and future work

In this paper, DRL agent is trained in Airsim simulator by implementing duelling neural network and prioritized experience replay. After training the agent, the experiences are stored and data is sampled for explainability. The DRL agent’s decisions are investigated by using Shapley Additive Explanations. In addition, instead of using default shap values to explain the agent’s decisions, absolute shap values are calculated to measure directly the magnitude of the effect of each feature on the model’s output. Adding SHAP explanations makes the counter-drone system more trustworthy and transparent and it helps the end-users understand why the DRL agent makes certain decisions by showing clear visuals of what factors influence those decisions.

The ability to explain the DRL agent’s decisions through SHAP and saliency maps also has practical implications for trust and accountability. In safety-critical applications such as counter drone operations, understanding why an agent acts in a certain way is essential for gaining the trust of users and stakeholders. It ensures that the agent’s actions can be justified and are based on logical interpretations of the input data. This transparency can facilitate regulatory approval and user acceptance, as well as provide a basis for refining and improving the model.

The combination of DRL with explainable AI techniques such as SHAP shows great promise for developing reliable and effective counter-drone systems. Explainability methods can enhance the explainability of the DRL agent’s decision-making process and can also provide important feedback for DRL model’s performance. By analyzing and interpreting the model’s behavior, we can develop more effective, reliable, and trustworthy DRL agents for complex, real-world tasks such as counter drone operations. By utilizing AI capabilities while prioritizing transparency and explainability in AI decision-making, airspace security can be enhanced through the deployment of explainable AI-powered counter drone solutions. However, it is necessary to further explore the effectiveness of alternative explainability methods. In future work, we will explore state-of-the-art explainability methods to identify areas for improvement, and validate the effectiveness of our method in offering clear and reliable explanations. The proposed method will be tested with different explainability approaches, and the current work will be expanded to compare approaches in terms of computation time, accuracy, consistency, transparency, and interpretability. We will also investigate the computational time using A Fast Spatial-temporal Information Compression algorithm, which is capable of accurately modeling nonlinear small sample data with low computational cost. Additionally, we will employ various evaluation metrics, such as mean absolute error and root mean square error. Mean absolute error measures the forecasting accuracy of the model, while root mean square error calculates the deviation between predicted and true values. The Fast Spatial-Temporal Information Compression algorithm is expected to significantly reduce both errors and computational costs. This evaluation will assess the accuracy of the DRL model’s prediction results and SHAP values.