1 Introduction

Reinforcement learning (RL) vulnerabilities to adversarial attacks are generally well known [1]. In a pertinent attack scenario, an adversary may attempt to mislead a deployed RL system during testing by manipulating attack samples. Adversarial attacks often involve slight perturbations to observations from the environment, resulting in behavioral changes that can lead to catastrophic consequences [2]. Such vulnerabilities pose a significant threat in real-world situations [3, 4], as they can cause autonomous vehicles to swerve into oncoming traffic [5]. These attacks can have an instant impact on environmental dynamics. For instance, in a scenario where the agent is not under attack, an action a in the state s results in the agent being in \(s'\), but in an under-attack scenario, the same action a in s leads the agent to a different state \(s'_{\delta }\). Consequently, the agent might perceive that it is in a different state than its real one. This paves the way for the application of detection methods capable of identifying the environmental alterations resulting from attacks. In fact, these modifications prompt us to propose that adversarial attack detection and context detection in RL are closely linked [6, 7]: any sudden alteration of the environment dynamics is a context change, and in an adversarial RL setting, such a change may mean that the agent goes from a free-of-attack context to a under-attack context. Therefore, detecting such changes as early as possible is mandatory in safety-critical domains to prevent dangerous situations or performance degradation.

Motivated by a current lack of research, this paper aims to provide a novel perspective on the adversarial detection problem in RL. Specifically, we approach this problem from a sequential perspective, where attacks can be perceived as abrupt changes in the input. To address this issue, we propose the utilization of a clustering-based approach from sequential analysis cited in [8, 9] to develop a model-free countermeasure for RL that can effectively identify disturbances in the environment’s dynamics. Compared to other methods [6, 10,11,12], the suggested detection technique captures the sequential nature of an MDP, does not need prior knowledge of the transition and reward dynamics, and is able to detect changes in the environmental dynamics in real time, in the trajectory generated by adversarial attacks, and conducts a tractable univariate analysis of the environmental dynamics in the search for abrupt changes. Our detection system consists of two phases. Firstly, our approach produces a set of clusters from the collected transitions in a free-of-attack scenario. Then, our system discriminates between perceived environmental transitions, categorizing them as anomalies or normal based on a distance metric and a predefined threshold. This threshold can be adjusted to optimize results based on the specific application domain. To exemplify this, we employ ROC curves, which enable the visualization of the ideal trade-off between true positive and false positive rates.

This paper is structured as follows: Section 2 provides a concise explanation of RL and adversarial attacks for better comprehension of subsequent content. Section 3 outlines existing research on adversarial RL and various defense methods. Section 4 introduces a novel approach to detecting attacks by framing the problem as a context detection issue. Section 5 describes our clustering-based detection approach, and Section 6 reports on the evaluation performed. Finally, Section 7 provides a summary of the main findings and future research directions.

2 Background

In this section, we will explain the concepts of reinforcement learning and adversarial attacks relevant to classification and RL systems.

2.1 Reinforcement learning

We examine RL tasks formalized by a Markov Decision Process (MDP) [13]. An MDP is a 4-tuple \(\mathcal M = (S, A, T, R)\) where S is the set of states, A is a set of actions available in each state, R is the reward function \(R: S \times A \rightarrow \Re \) that assigns a reward r to each state-action pair sa, and T is the transition function \(T: S \times A \times S \rightarrow [0,1]\) where \(T(s,a,s')\) denotes the probability of transitioning from state \(s \in S\) to state \(s' \in S\) after taking action \(a \in A\). The objective is to acquire a control policy, \(\pi (s) = a\), that specifies the action \(a \in A\) to be executed in a given input state \(s \in S\) for the purpose of optimizing the return \(J(\pi )\) as indicated in (1):

$$\begin{aligned} J(\pi ) = \sum _{k=0}^{K} \gamma ^{k} r_{k} \end{aligned}$$
(1)

The discount factor, \(\gamma \), determines how much the agent prioritizes future rewards, with values ranging from 0 to 1 (\(0 \le \gamma \le 1\)). The variable \(r_k\) represents the immediate reward received at step k. The agent interacts with the environment and generates transitions in the form of \(\tau = \langle s, a, s', r \rangle \), where \(s'\) denotes the state to which the agent transitions, \(s \in S\) is the current state, \(a \in A\) is the action taken, and r is the reward received. Assuming a state space S in n-dimensions (\(S \subset \Re ^n\)) where every state is a vector \(s = (s_0, s_1, ..., s_n)\), and an m-dimensional action space A (\(A \subset \Re ^m\)) where each action is a vector \(a = (a_0, a_1, ..., a_m)\), a transition tau can be restated as \(\tau = <s_0, ..., s_n, a_0, ..., a_m, s'_0, ..., s'_n, r>\), where \(\tau \) belongs to the subset of \(\Re ^{(2n + m + 1)}\). We calculate the distance between two transitions, \(\tau _{i}\) and \(\tau _{j}\), using the Euclidean distance denoted in (2):

$$\begin{aligned} d(\tau _{i}, \tau _{j}) = \sqrt{\sum _{k} (\tau _{i,k} - \tau _{j,k})^{2}} \end{aligned}$$
(2)

In the transitions i and j, the k-th component is represented by \(\tau _{i,k}\) and \(\tau _{j,k}\), respectively.

Fig. 1
figure 1

RL attack targets. The adversary can compromise three different points of the learning process: the perceived state, the selected action, or the reward function

2.2 Adversarial attacks

Adversarial attacks are the result of adversarial examples in supervised learning, which have small but deliberate feature perturbations that lead to false predictions by supervised classifiers. Adversarial examples are a worry in machine learning, specifically deep learning, as they highlight that even cutting-edge models can be vulnerable to small changes in their input data. Adversarial examples provoke doubts concerning the durability and dependability of machine learning models in practical applications. Equation (3) is a formal explanation of what an adversarial attack consists of. To explain it, let x denote an observation and f represent a classification system. One can construct an adversarial example for classifier f by solving the following optimization problem.

$$\begin{aligned} \min _{\delta } \ d(x, x+\delta )\qquad subject \ to \ f(x) \ne f(x+\delta ) \end{aligned}$$
(3)

The aim of this task is to determine the smallest possible perturbation that can alter the decision made by classifier f in comparison to its decision based on the initial input. The success of the attack is determined by achieving this goal, using a similarity metric d. The optimization problem in (3) can be solved by finding the perturbation \(\delta \) of an observation x that causes the classifier f to output an incorrect class \(f(x) \ne f(x+\delta )\). The same principles apply to RL. An attacker can target a victim who adheres to policy \(\pi \) by manipulating the victim’s observations of the environment, with the objective of inducing the victim to select non-preferred actions where \(\pi (s) \ne \pi (s+\delta )\). This can lead to a decrease in the cumulative reward or cause severe consequences for not just the victim but also any neighboring systems or persons.

3 Related work

In this section, we present a thorough literature review of previous research related to our approach, divided into two parts. The first part covers documented attacks found in existing literature, followed by a detailed analysis of currently employed detection mechanisms.

3.1 Attacks

In the literature, three distinct attack targets have been identified for injecting Adversarial Examples into any RL algorithm. Figure 1 illustrates these targets, where each adversary corresponds to a unique attack. To clarify, the targets are:

  1. 1.

    State perception. A significant portion of academic works focuses on the manipulation of state perception during both training [14, 15] and testing [16, 17] in machine learning. The primary aim of these studies is to delay, perturb, or falsify the learning agent’s perceived state s in the face of malicious attacks. For instance, Uniform strategy [16] implements an iterative attack that misleads the victim in each iteration. Crafting an adversarial example during each iteration makes this process computationally complex. Therefore, some research attempts to minimize the number of attacks performed. For example, an attack takes place when the Q value surpasses a predetermined threshold [14] or when a learned policy has a vulnerability found based on the difference between the most and least effective action [18]. Novel methodologies propose to execute attacks using a multi-objective function. which aims to maximize the impact on the victim’s policy and minimize the number of assaults [19].

  2. 2.

    Action selected. The attackers can also aim towards driving the victim to undesired states by disrupting the action a executed by the learning agent [20]. To execute this type of attack, the attacker must be familiar with the path to the malicious states.

  3. 3.

    Training reward. Attacks that perturb the reward function exist. These attacks pertain to changes made to the reward produced by the environment in response to actions taken by an RL agent [21]. The objective is to modify the victim’s policy, driving it toward undesired states.

While our method can address all three types of attacks by analyzing the complete trajectory of the agent, we focus on preventing the first attack on state perception, since the specific attacks under evaluation caused perturbations in the perceived state. In addition, state perception attacks are extensively studied in the adversarial field. To facilitate a thorough and comprehensive analysis of multiple attack strategies, focus is directed towards the behavioral patterns of the first opponent illustrated in Fig. 1.

3.2 Detection mechanisms

Figure 1 can also be utilized to categorize existing defense methods since each one focuses on detecting these categories of attacks. However, many of the defense mechanisms employed in RL focus on identifying adversarial observations, neglecting the potential vulnerability of the actions and the reward signals, which are also susceptible to adversarial manipulation. These defenses often leverage techniques directly inherited from supervised learning paradigms. In this context, there are two main classes. The first category proposes training procedures to increase the robustness of a trained model against adversarial examples. Defensive distillation [22] and adversarial training [23] are methods to include in this category. Distillation is a process for transferring knowledge from different architectures; the goal of using several different architectures is to reduce computational complexity. This defensive distillation allows the authors to reduce the effectiveness of the adversarial examples. When defensive distillation is used, an eightfold increase in the number of features is required to disrupt the learning model [24].

In contrast, adversarial training defenses incorporate adversarial examples into the collected data to train an architecture that detects novel perturbations [4]. In this method, the learned model loses a small amount of accuracy in predicting clean examples, but adversarial training creates robustness to adversarial examples. Although distillation and adversarial training can achieve some success, they have two major drawbacks. First, training models that require attacks to learn to discriminate are not feasible in safety-critical systems where a single attack can result in catastrophic consequences. Additionally, the trained model can be easily fooled by adversarial examples generated by attack methods not encountered during training. Therefore, it is crucial to establish efficient detection methods that can be applied to diverse domains to counteract adversarial samples, even those that are novel to the system.

The second category focuses on detecting the statistical differences between the adversarial examples and the legitimate data [25]. For example, using a binary classifier to detect adversarial examples [26]. In this case, the authors show that a binary classifier trained with three different adversarial crafting methods is able to detect almost all attacks. However, they do not guarantee the detection of attacks not used in the training data. Other works based on statistical data have tried to extend their defenses against more adversarial crafting methods. Since these defenses are based on statistical data [11], they are not able to detect specific attacks. The collection of adversarial inputs must be large enough to generate a statistical bias capable of detecting the presence of unexpected behavior. Another work in this category develops a detector based on the Mahalanobis distance between the input variables [27]. The difference between the Euclidean distance and the Mahalanobis distance is that the latter takes into account the correlation between the variables. However, to detect attacks in an RL task, it is not only relevant whether an individual observation belongs to the training or test distribution or not. In an RL setting, it is extremely important to ensure the coherence of the transition and the reward the agent receives from the underlying MDP. Therefore, all these methods are effective in detecting single adversarial observations, but they forget the sequential nature of an RL task, where a sequence of states, actions, and rewards takes place. To sum up, they are blind to adversarial transitions. Such adversarial transitions may be composed of states belonging to the training or test distribution, which is even more unlikely for approaches that rely solely on observations to detect them [11, 25]. In contrast, we focus on the problem of detecting adversarial examples in sequential decision tasks. For this purpose, we exploit the coherence of the dynamics of the environment.

In this paper, we suggest that adversarial attacks can affect the distribution of transitions and rewards in RL environments. Similarly, such attacks affect the distribution of the training data in a supervised classification task [11]. Therefore, the abrupt perturbation of the environmental dynamics allows us to identify adversarial attacks. To the best of our knowledge, adversarial attack detection techniques have never been tackled from this novel perspective in RL. Some of these approaches are based on change-point detection, which involves a variation of the statistical features. And, it is relevant to keep in mind that this novel perspective opens the door to apply existing detection approaches also in an adversarial context [6], where they formalize an RL algorithm with context detection (RL-CD) to deal with non-stationary environments. Aiming to improve these results, other works focus on classifying the transitions into two categories: known and unknown, depending on a quality measure [7]. If this metric exceeds a certain threshold, it is considered a context change. However, RL-CD requires a set of parameters to be tuned according to the problem. Thus, this quality measure mainly depends on this ad hoc configuration. To solve this complex tuning task, another work proposes an incremental CUMSUM, building a library of models and policies for each type of context [28]. The limitation of this approach is the complexity of computing different policies and models. Furthermore, all of these approaches require a priori knowledge of the transition and reward dynamics, but this assumption rarely holds in the real world. In contrast, others detect the statistical differences between episodes, but these approaches are unable to detect statistical changes between time steps [12, 29, 30]. Such episodic detection techniques are useless in an adversarial context, where it is necessary to detect attacks as soon as possible.

In contrast to all previous approaches, this paper proposes a model-free countermeasure for the rapid detection of adversarial attacks based on the abrupt changes they produce in the sequence of transitions. Furthermore, this analysis is not performed from a multivariate point of view, but from a more compact and tractable univariate perspective. Such a multivariate analysis of transitions is not feasible due to the curse of dimensionality: the high dimensionality of RL tasks prevents us from using multivariate methods [10]. Consequently, a clustering-based approach is proposed in this paper to transform the analysis of the sequence of transitions from a multivariate to a univariate problem.

4 Problem formulation

Let \(\mathcal M = (S,A,T,R)\) be an MDP in which an agent, which we will call the victim, has correctly learned a near-optimal policy \(\pi \). Assume that the victim interacts with \(\mathcal M\) using \(\pi \) in a free-of-attack context. At each time step, \(\pi \) produces experience tuples of the form \(\tau = \langle s,a,s',r \rangle \) derived from T and R. Then there exists a time step \(k > 0\) at which an adversary begins to perturb the states perceived by the victim, \(s \rightarrow s_{\delta }\), with \(s, s_{\delta } \in S\) and \(s \ne s_{\delta }\). In this new context, using the same policy \(\pi \), the victim will perceive experience tuples of the form \(\tau = \langle s,a,s'_{\delta },r \rangle \), where for each state-action pair \(\langle s,a \rangle \) the next state \(s'_{\delta }\) is not chosen according to T, but from a different distribution \(T_{\delta }\). Similar reasoning applies if the adversary attacks the actions or even the reward signal that the victim perceives from the environment. Thus, the adversary can perturb the functions T and R perceived by the victim, possibly at the same time. Hence, there exist time steps in which the trajectories sampled from the environment are associated with a new adversarial task \(\mathcal M_{\delta }=(S,A,T_{\delta },R_{\delta })\) that the adversary induces in the victim whenever he attacks with \(T_{\delta } \ne T\) and/or \(R_{\delta } \ne R\). In this scenario with two MDPs, \(\mathcal M\) and \(\mathcal M_{\delta }\), the problem of attack detection is reduced to the problem of change-point detection, i.e., detecting the time steps at which the environment model changes. For example, at time k the environment model changes from e.g. \(\mathcal M\) to \(\mathcal M_{\delta }\), in which case the victim changes from a free-of-attack context \(\mathcal M\) to a under-of-attack context \(\mathcal M_{\delta }\). The context can also change from \(\mathcal M_{\delta }\) to \(\mathcal M\), in which case the victim will no longer receive attacks.

It is important to note that from a sequential analysis point of view, the problem of context detection reduces to detecting abrupt changes in streaming data. In RL, a data stream is a sequence of transitions as in (4):

$$\begin{aligned} \Gamma = \{\tau _{1},\tau _{2},\tau _{3},\dots ,\tau _{k}, \tau _{k+1}, \tau _{k+2},\tau _{k+3}, \dots \} \end{aligned}$$
(4)

where \(\tau _{t}\) is the t-th transition perceived by the victim at time step t. The change point detection problem in the \(\Gamma \) sequence can be formulated in terms of testing the null hypothesis \(\mathcal H_{0}\) against the alternative hypothesis \(\mathcal H_{1}\) [31]. \(\mathcal H_{0}\) asserts that at the current step t the model parameter remains the same, so the transition \(\tau _{t} \in \Gamma \) was derived using \(\mathcal M\). The alternative \(\mathcal H_{1}\) claims that at the current step t the model parameters change and \(\tau _{t} \in \Gamma \) is derived from \(\mathcal M_{\delta }\). In an adversarial setting, we can obtain the expression of (5):

$$\begin{aligned} \left\{ \begin{array}{ll} \mathcal H_{0} &{} \lnot attack \\ \mathcal H_{1} &{} attack \\ \end{array} \right. \end{aligned}$$
(5)

where the hypothesis is a logical expression attack that indicates whether an attack occurs. A powerful strategy for implementing the attack point detection mechanism depicted in (5) is to use a clustering-based approach as described in Section 5. This approach aims to detect attacks in \(\mathcal M\) in a timely manner without introducing false negatives or positives.

5 Clustering-based attack detection

A clustering-based approach to detecting attacks offers two distinct benefits compared to other methods of detecting change-points [6, 7, 10, 28]. Firstly, it enables the development of a “model-free” attack detector that does not require approximation of the transition functions T or \(T_{\delta }\) or the reward functions R or \(R_{\delta }\) [6, 7, 28]. On the contrary, it converts a complex multivariate change-point detection problem into a simpler and more manageable univariate change-point problem, as explained in Section 5.1. Our recommended strategy consists of two stages. Firstly, we obtain knowledge on the transition space partition (Section 5.1). Secondly, we employ the learned partition to identify adversarial attacks (Section 5.2).

5.1 Clustering of the transition space

In the initial stage of the proposed approach, the aim is to develop a partition \(\mathcal C\) of the transition space through the transitions obtained from a policy \(\pi \) in a free-of-attack context. The first step of the algorithm is represented by Algorithm 1, which requires inputs including the number of episodes, H, the number of steps, K, per episode, the policy, \(\pi \), and the number of clusters, k.

Algorithm 1
figure d

First step: clustering construction.

Algorithm 1 stores the transitions \(\tau = \langle s,a,s',r \rangle \) that the agent experiences with \(\pi \) in \(\mathcal T\) (line 7). After H episodes, it creates a k-means model \(\mathcal C=\{c_{0},c_{1},\dots ,c_{k}\}\) utilizing the transitions from \(\mathcal T\) as the training set (line 9). Here, \(c_{i}\) represents the i-th centroid with \(c_{i}=\langle s_{i},a_{i},s'_{i},r_{i} \rangle \). It generates a list, denoted as \(\beta =\{\beta _{0},\beta _{1},\dots ,\beta _{k}\}\), where each \(\beta _{i}\) represents the radius of the i-th cluster, as computed in (6).

$$\begin{aligned} \beta _{i} = \frac{1}{n_{i}} \sum _{j=1}^{n_{i}} d(\tau _{j},c_{i}) \end{aligned}$$
(6)

The algorithm uses the resulting partition of the transition space \(\mathcal C\) and the list of thresholds \(\beta \) to detect adversarial attacks, where \(\tau _{j}\) refers to the j-th transition in the i-th cluster, and n is the number of instances in that cluster (line 9).

It is worth noting that the sequence \(\Gamma \) depicted in (4) consists of multiple variable transitions \(\tau \subset \Re ^{(2 \times n)+m-1}\), necessitating a multivariate analysis to detect change points. However, the partition \(\mathcal C\) resulting from Algorithm 1 enables us to convert the multivariate analysis of \(\Gamma \) into a univariate analysis by simply considering the sequence \(\Gamma ^\mathcal {C}\) in (7).

$$\begin{aligned} \Gamma ^\mathcal {C} = \{d_{\tau _{1}},d_{\tau _{2}},d_{\tau _{3}},\dots ,d_{\tau _{k}}, d_{\tau _{k+1}}, d_{\tau _{k+2}},d_{\tau _{k+3}}, \dots \} \end{aligned}$$
(7)

Here, \(d_{\tau _{t}} \in \Re \) represents the Euclidean distance between the transition \(\tau _{t}\) in the t-th time step and its nearest centroid \(c_{i} \in \mathcal C\). This reduction enables detection of abrupt changes in the univariate sequence \(\Gamma ^\mathcal {C}\) for the attack detection problem, as elaborated in Section 5.2.

5.2 Detection of adversarial attacks

Roughly speaking, the \(\mathcal C\) partition serves as a coarse representation of the trajectory followed by the agent in a free-of-attack scenario. Consequently, the \(\Gamma ^\mathcal {C}\) series can recognize unusual deviations in the agent’s trajectory due to adversarial attacks in an under-attack scenario. The proposed detection mechanism is demonstrated in Algorithm 2.

Algorithm 2
figure e

Second step: context change detection.

Algorithm 2 takes as input the number of episodes H, the number of steps per episode K, the policy \(\pi \), the transition space partition \(\mathcal C\), and a list of thresholds \(\beta \). After obtaining a suitable partition of the transition space \(\mathcal C\) in the previous step, we calculate the distance \(d_{\tau _{t}}\) between the perceived transition \(\tau _{t}\) at time step t and its nearest centroid \(c_{j}\in \mathcal C\) (line 8 in Algorithm 2). Then, Algorithm 2 compares this distance with the threshold \(\beta _{j}\). Consequently, the logical expression attack shown in (5) is reduced to \(attack = [d_{\tau _{t}} > \beta _{j}]\) (line 9). If \(d_{\tau _{t}}\) is greater than \(\beta _{j}\), the transition is deemed an adversarial transition, meaning a context change has been deduced from an adversarial attack. This is a methodology for model fitting wherein a change is acknowledged if a new transition does not fit into any of the current clusters. If the system detects a context shift from ’free-of-attack’ to ’under-attack,’ the agent ceases its execution (line 10).

Figure 2 shows an example of how our detection system works. The figure displays transition points as blue and red markers. Upon creating the cluster model \(\mathcal C\), we obtain a set of centroids \(c_{j} \in \mathcal C\) (green stars), and establish thresholds \(\beta _{j}\) as the radius of each centroid. The radius (\(\beta _{j}\)) varies based on the training transitions associated with each cluster. Thus, instances that exceed the threshold distance to the nearest centroid are identified as normal transitions (blue markers), while the remaining instances (red dots) are classified as adversarial transitions.

Fig. 2
figure 2

Representation of the partition of the transition space

It is crucial to recognize that the thresholds significantly impact the effectiveness of the suggested detection mechanism. In tasks involving security, these values cannot undergo adjustments in the deployment stage while anticipating the system to have received enough attacks to differentiate between those that are and are not an attack, because even a single attack can have disastrous consequences. Therefore, we have opted to calculate their values heuristically in the initial algorithm step, as outlined in (6). These computed values of \(\beta \) enable the Algorithm 2 to efficiently identify attacks at an early stage.

6 Evaluation

In this section, we present the results obtained by using our proposed detection mechanism in the grid world and three well-known OpenAI games: carpole, mountain car, and acrobot. We design the experiments (i) to demonstrate that the transition-based detector proposed in this paper is able to capture the dynamics of RL tasks, in contrast to state-based detectors that focus only on single observations (Section 6.2); (ii) to validate the ability of the proposed approach to detect context changes from a free-of-attack to a under-attack context (Section 6.3); (iii) to validate the proposed approach from a binary classification perspective, which allows us to classify the transitions into two groups: adversarial and non-adversarial transitions (Section 6.4). Before doing so, we first present the experimental setting (Section 6.1).

6.1 Experimental setting

We evaluate our attack detection approach based on clustering in three well-established domains. Initially, we perform an initial proof of concept by utilizing a sample domain, such as a \(10 \times 10\) grid. Subsequently, we refine our methodology in three sophisticated OpenAI domains, namely carpole, mountain car, and acrobot. In all scenarios, we presume that the victim had previously learned a policy \(\pi \) in a free-of-attack setting. For the purpose of this study, we implement a tabular Dyna-Q in the grid domain and utilize the DQN algorithm for the OpenAI environments. Table 1 displays the parameter values required for learning the policy \(\pi \). Each domain is described in terms of its state space \(\mathcal {S}\) and action space \(\mathcal {A}\), the algorithm employed to learn the policy \(\pi \), the learning rate \(\alpha \), and the maximum number of steps per episode K. We set the number of episodes to 1000 and the discount factor to 0.99 in all domains. It is important to emphasize that the domains employed in this study originate from the OpenAI Gym framework, a widely recognized platform for the development of Reinforcement Learning algorithms. Consequently, the specific configuration of the state space, action space, and the parameter K was not chosen by the authors, but rather, they adhere to the predefined settings within these Gym implementations. With respect to the “Algorithm” column in Table 1, it is essential to acknowledge that both the selection of the RL algorithm, network architectures, and the parameter denoted as \(\alpha \) were informed by prior research endeavors that have demonstrated their effectiveness [19].

Table 1 Parameter setting of the learning process

After \(\pi \) converges, the policy is used to generate the transition space. Later, the transition space will be used to build the partition (\(\mathcal C\)) and compute each radius’s distance (\(\beta \)). The number of transitions \(|\mathcal {T} |\) utilized by k-means in Algorithm 1 is comparable across the four domains. We conducted 200 test episodes to generate a similar number of instances. Combined, we have 25,000 transitions for each environment. We evaluated the number of clusters using the values k = 64, 256, 1024. Our observation shows that using more than 1024 clusters does not yield a significant improvement and only complicates the partition creation process in terms of memory and time.

Finally, we evaluate our detection approach against four cutting-edge strategies: the Uniform attack [16], ‘ the Strategically-Timed attack (ST attack) [18], the Q attackFootnote 1 [14] and the Multi-Objective RL attack (MO attack)  [19]. Unlike the uniform attack, both the ST and Q attacks try to do as much damage to the victim as possible by reducing the number of attacks. The ST attack is triggered only when the discrepancy between the most and least favored action surpasses a predetermined threshold. Additionally, the Q attack computes the highest Q value for each state and launches an attack if this value surpasses a threshold. Hence, both the ST and Q attack necessitate a threshold, with thresholds set at 0.3 and 1.4, respectively. Finally, we compare these attacks to the more advanced MO attack, which aims to undermine the victim’s policy in the long term. For this purpose, MO-attack aims to achieve two goals: maximizing the damage to the victim’s policy and minimizing the cost of attacks in order to avoid detection. To configure the type of attack, MO-attack also necessitates defining a weight, denoted by w, for its two optimization metrics. If \(w = 0\), the attack prioritizes cost optimization, whereas if \(w = 1\), the adversary prioritizes optimizing the damage inflicted on the victim. We use two different versions of the MO, one that prioritizes causing maximum damage to the victim with a weight of 0.7, and another that aims to minimize the cost of the attack with a weight of 0.2. We would like to clarify that the chosen attack strategies are state of the art in terms of minimizing the number of attacks to be executed and preventing the attacker from launching continuous attacks, which would make them easier to detect. Furthermore, in the context of RL, few attack strategies have been proposed to date, and the chosen ones represent a broad spectrum of strategies that allows for a comprehensive analysis of the detection capabilities of the proposed method.

In these attack strategies, an attack is defined as the addition of noise, denoted as \(\delta \), to the original state. The range of noise values used in this paper is described in Table 2, and within this range, there are six different attacks defined. Some of them are classified as minor, while others are deemed more damaging perturbations. This analysis serves to determine if our approach can detect both small and large perturbations.

Table 2 Range of \(\delta \) values, which represents the amount of noise added to the original state

The rationale for the specific parameter values in Table 2 is rooted in their ability to ensure a correct balance between the disruption inflicted to the behavior policy of the victim and the associated attack cost: it is easy for our approach to identify high-level noise attacks that cause significant disturbance to the victim, while simultaneously acknowledging its limitations in detecting low-level noise attacks that, in fact, may not disturb the victim. The values presented in Table 2 have been chosen to establish a correct balance between these disparate scenarios. Naturally, the more aggressive the perturbation, the greater the cost. However, the only multi-objective strategy which determines the most effective attack to disrupt the victim by minimizing attack costs is the preferred approach. Other methods randomly select from predefined attacks, ignoring cost considerations.

6.2 Single observations vs. transitions

In this section, we demonstrate that detection mechanisms relying solely on single observations are incapable of detecting adversarial attacks’ impact on the environment’s dynamics. We note that single-observation-based detection methods are most popular for detecting adversarial attacks [11, 25,26,27]. However, we show in this section that such methods are ineffective in an RL context.

Figure 3 displays an example 3x3 maze where an agent must navigate to the goal box marked in green, denoted by the letter G on the grid. The agent has acquired an optimal policy, \(\pi \), and consequently, in a scenario lacking interference, it would reach its goal in four steps. In this example, we assume that an adversary begins injecting attacks at time step 160. This basic attack involves altering a single state: whenever the victim enters state \(s_{11}\), the adversary deceives it into thinking it entered state \(s_{01}\) instead of \(s_{11}\). As a result, the attack generates the adversarial transition \(\langle s_{21},up,s_{01},r \rangle \) rather than the legitimate transition \(\langle s_{21},up,s_{11},r \rangle \).

Fig. 3
figure 3

(a) Deterministic \(3\times 3\) grid world, and (b) graphical representation of the sequences \(\Gamma ^\mathcal {C}\) of the state-based and transition-based detection mechanisms

In this scenario, it is assumed that there are two detectors based on the approach outlined in Section 5. One focuses on transitions, while the other focuses solely on single states. For simplicity, both detectors have the same number of centroids in \(\mathcal C\) as there are legitimate transitions or states in the task. Figure 1(b) displays \(\Gamma ^\mathcal {C}\) values for each detection mechanism. It indicates the distance of transition or state at the time step t from the closest centroid in \(\mathcal C\). We observe that the state-based detector does not exhibit any bias in \(\Gamma ^\mathcal {C}\) due to this attack, making it undetectable (red line in Fig. 1(b)). If the adversarial transition \(\langle s_{21}, up,s_{01},r \rangle \) is analyzed statically, considering the states individually, both \(s_{21}\) and \(s_{11}\) belong to the task’s state distribution, making the argument valid. However, the transition-based detection mechanism (blue line in Fig. 3(b)) highlights that the sequence \(\langle s_{21},up,s_{01},r \rangle \) violates the environmental dynamics. Hence, in a sequential decision-making task, it is crucial to examine not only whether the states belong to the original distribution of states [11, 23], but also the consistency in the environment’s dynamics.

Thus, defenses relying solely on recognizing known states would not identify this type of attack. However, our approach, which analyzes the entire transition would be able to detect that this transition has never occurred in a system free of attacks, and would therefore be able to detect that the agent is the victim of a falsification of the transitions it receives. In addition, analyzing the entire transition would enable us to identify all attacks on the RL systems depicted in the Fig. 1.

6.3 Context change detection: from Free-of-attack to under-attack context

The objective of this section is to verify the capability of the proposed method in detecting a change in context from a free-of-attack scenario to an under-attack scenario. We assume that a policy \(\pi \) has already been learned, and Algorithm 1 employs this policy to generate the transition space partition \(\mathcal C\) and a list of thresholds \(\beta \) for each of the suggested domains. We assume that a policy \(\pi \) has already been learned, and Algorithm 1 employs this policy to generate the transition space partition \(\mathcal C\) and a list of thresholds \(\beta \) for each of the suggested domains.

Fig. 4
figure 4

Evolution of the distances in the sequence \(\Gamma ^\mathcal {C}\) for each domain

Figure 4 illustrates how the \(\Gamma ^\mathcal {C}\) sequence for each domain evolves over time. To enhance the clarity of the illustration, we calculate a simple moving average of 1000 transitions in order to smooth the trend in the four domains. Additionally, the vertical dashed line denotes the point where the adversary begins injecting attacks. This demarcation point signifies the transition from a context free of attacks to one that is under attack. From the vertical line in Fig. 4, a line for each attack type is plotted. A clear change in the trend of the \(\Gamma ^\mathcal {C}\) sequence due to the adversary’s attacks is evident. It is observed that the MO attack causes the highest distortions in Fig. 4 as it produces transitions that are more dissimilar from the training set than the other attacks. The Uniform, ST, and Q attacks produce comparable results. These attacks employ the same strategy for perturbation injection, but the ST and Q attacks differ by performing fewer attacks, only when their metric exceeds a threshold. Our method can produce diverse signals when the agent is being attacked in any test environment. This is due to the fact that we analyze the sequential nature of the transitions rather than focusing strictly on the perceived states.

The objective is to identify any alteration in the sequence trend \(\Gamma ^\mathcal {C}\) as soon as possible. Several approaches may be used to accomplish change point detection, but we opted for the model fitting method presented in Section 5.2. When the logical expression \(attack=[d_{\tau _{t}}>\beta _{j}]\) evaluates to true, transition \(\tau _{t}\) is designated as an adversarial transition, and execution is stopped. Table 3 displays the accuracy of our detector with the given logical expression and \(k=1024\) clusters. The value of k adequately covers training transitions in the stochastic grid domain, resulting in the detection of almost all attacks.

Table 3 Accuracy of attacks detected using the radius of each cluster as a threshold with 1024 clusters

The results in Table 3 reveal successful detection of the most harmful attacks generated by all analyzed strategies. Notably, the “mountain car” domain poses the most difficult challenges in detecting these disturbances, given its two variables and minor variations in transitions compared to the original distribution. In contrast, MO attack is detected in all domains with significant success. This strategy follows optimality criteria, leading to an attack policy of diverting the victim from its initial trajectory. The success of these attacks results from the victim’s movement through less-visited regions of the state space, providing more diverse transitions during analysis. Additionally, this approach involves unnecessary attacks that have no impact on the initial transition causing inaccurate anomalies that our detector is unable to identify.

There are two methods to enhance success rates: modifying the threshold parameter, \(\beta \), to differentiate the adversarial transitions more effectively, or increasing the number of clusters for these domains. Subsequently, we will aim to identify these attacks by separately modifying \(\beta \) in the next subsection.

6.4 Detection of adversarial transitions from a classification perspective

In this section, we will evaluate the quality of our detection mechanism. Specifically, we will measure the classifier’s ability to distinguish between adversarial and non-adversarial transitions. We analyze accuracy by examining the threshold \(\beta \) and the number of clusters. In this evaluation, we assume that the parameter \(\beta _{j}\) remains constant for all clusters to simplify the process. To conduct a sensitivity analysis, we calculate the ROC curve, which displays the true positive rate (TPR) along the y-axis and the false positive rate (FPR) on the x-axis. This approach visualizes the detection system’s performance as the threshold \(\beta \) increases. Therefore, each point on the ROC curve represents a different threshold at which our detector produces different results in terms of true positive rate versus false positive rate. Initially, when \(\beta \) is at 0, all transitions are identified as adversarial. As the value of \(\beta \) increases, a higher number of transitions are classified as non-adversarial, leading to a reduction in false positives by correctly recognizing these transitions as non-adversarial. Nevertheless, some attacks might be incorrectly identified. Therefore, we utilize ROC curves to determine the optimal value of \(\beta \). Eventually, \(\beta \) reaches its maximum value and the detection system classifies all transitions as non-adversarial. We perform an exhaustive search for all thresholds, and if the area under the curve (AUC) is equal to 1, it indicates that the detection system is effectively classifying all transitions. Conversely, an AUC close to 0.5 indicates that the classifier is behaving randomly.

Figure 5 contains all the results of our defense. Each row in Fig. 5 contains the results of each domain, i.e., we show the results of grid, carpole, mountain car and acrobot respectively. Additionally, each column shows the results grouped by each clustering configuration, we show the results for \(k = 64, 256, 1024\) respectively. In each figure we plot a line for each attack evaluated: MO attack with \(w=0.7\) (blue), MO attack with \(w=0.2\) (orange), Uniform (purple), \(Q-\) (green) and ST attacks (red). In addition to the ROC curve, we attach the AUC score of each line in the legend. This allows us to distinguish the configuration with the best performance for each attack and domain.

As the number of clusters k increases, we obtain better results. Nonetheless, the value of k increases the complexity of building the transition partition. For this reason, we suggest that a higher value of k is not necessary. If we focus on the last column in Fig. 5 (\(k = 1024\)), in general, our defense obtains promising results in all the domains detecting all kinds of attacks. We achieve AUC scores over 0.8 in the grid and the carpole domains. Nevertheless, smooth variations in the transitions of the mountain car are enough to mislead the victim. From the point of view of our detector, it is more difficult to distinguish the instances perturbed by smooth attacks. As we have analyzed the signal generated for all domains, when introducing perturbations in both free-of-attack and under-attack scenarios, in the domain of acrobot, it is observed that the signal generated in a free-adversary environment is higher than in other domains. However, upon introducing attacks, there is a noticeable increase in this signal. This confirms that, as with other domains, the perturbed states differ from the original ones and thus our detection system is capable of detecting these perturbations. For this reason, we obtain worse AUC scores in this domain especially, under the Uniform and the ST attacks, as we observe in Fig. 5(i) and (l).

Fig. 5
figure 5

ROC-curve experimentation results. Each plot contains the five attacks evaluated: MO attack with \(w=0.7\) (blue), MO attack with \(w=0.2\) (orange), Uniform (purple), \(Q-\) (green) and ST attacks (red). The Area Under the Curve (AUC) of each ROC curve is in the legend. Each column shows the results using 64, 256 and 1024 clusters, respectively

Fig. 6
figure 6

Elbow method for our domains: (a) Grid, (b) Cartpole, (c) Mountain car, and (d) Acrobot

Obviously, as the number of clusters increases, the original transitions are better distinguished from the adversarial ones. We show that this occurs in the grid domain, comparing Fig. 5(c) where the number of clusters is almost similar to the number of undisturbed transitions. In other words, a low number of centroids means that some regular instances get a farther distance from their nearest centroid, as we observe in Fig. 5(a). Then, the detector may classify these instances as attacks, i.e., as adversarial transitions, when they are non-adversarial ones, increasing the number of false positives. Therefore, in these cases, the AUC score is lower. In general, MO attacks are easier to detect. Especially the version which maximizes the harm to the victim. This attack learns the best perturbation to drive the victim to undesired states. Once the victim is in an unknown place, the enemy stops attacking. However, the victim generates transitions that differ from the training dataset because these transitions do not belong to the instances used to learn. In this way, our distance metric generates larger values that exceed the threshold and, it produces that our detector classifies these transitions as attacks increasing the ratio of false positives. Such false positives can be reduced by increasing the exploration rate during training to capture more trustworthy examples. After the MO attack, \(Q-\) attack is the next attack more detectable. We obtain AUC scores over 0.8 in all the domains. This attack attempts to mislead the victim in states close to its goal. In contrast to MO, \(Q-\) attack does not drive the victim to undesired locations. This attack deviates the victim at the end of the episode, allowing our detector to identify these anomalies correctly. ST attack chooses an attack randomly when the difference between the best and the least preferred action exceeds a threshold. As equal to the MO-attack, the enemy attacks in some critical points, leading the victim to uncharted states. As a result, we obtain lower AUC scores in this attack. The same occurs in the Uniform attack. The AUC score of the Uniform attack is over 0.8 in the grid and carpole domain. However, the performance of our detector decreases in the Mountain Car. The reason is that some of these perturbations do not alter enough the trajectory of the victim. This type of perturbation makes it more difficult to detect the Uniform strategy than the rest of the evaluated attacks.

Table 4 Distortion generated for different k values in the proposed domains

In the Acrobot domain, we achieve strong results, with a success rate exceeding 80%, in identifying attacks generated against the attack strategies tested. Although the perturbations introduced in this domain are small, they are applied to a larger number of variables compared to the other domains. For this reason, the anomaly detection results obtained in this domain are very high. Consequently, as the threshold \(\beta \) increases, both the true positive ratio and the false positive ratio experience parallel increments. Furthermore, even our approach is successful against the multi-objective attack strategy. Compared to the Q-value and ST strategies, the other attacks create more disruptions during the first steps of the episode. There are numerous additional transitions in the initial steps of episodes with distinct domain initialization. Consequently, our detection system can create a significant number of clusters in the beginning part of the domain and less toward the end. For this reason, attack strategies that initiate a greater number of attacks at the start are more challenging to detect. This is due to the fact that the original transitions bear a closer resemblance to the anomalous ones.

6.5 Ablation study of k

Selecting the appropriate value for k in the k-means clustering algorithm is a crucial step that significantly influences the outcome of the clustering process. This choice essentially determines how the data is grouped and organized into distinct clusters. If k is incorrectly chosen, it can lead to inaccurate representation of the data’s underlying structure. For instance, a very high k might result in an excessive number of clusters, making it difficult to extract meaningful insights. On the other hand, choosing a very low k could oversimplify the representation, overlooking valuable patterns and groupings in the data. Striking the right balance with k is essential to derive meaningful and actionable insights from the clustering process, which is vital for our purpose.

In our approach, dealing with the under fitting problem holds heightened importance. Inaccurate representation of the data structure could lead us to misclassify typical transitions as anomalies, significantly inflating the count of false positives. Consequently, if a notable surge in false positives is noted, adjusting the k value to generate additional prototypes is a viable approach. The objective is to enhance the fidelity of representing the initial data and achieve a more precise differentiation between normal transitions and anomalies.

To determine the optimal k for our detector, we execute an elbow method to determine the optimal value of centroids to use in a k-means clustering algorithm. This method find the equilibrium between the number of clusters needed to reduce the distortion of the clustering points and the time to execute the algorithm. The optimal k is typically selected at this point, balancing the trade-off between model complexity and clustering quality. The elbow method provides a visual aid and a quantitative basis for selecting a suitable number of clusters, enhancing the effectiveness and interpretability of clustering results in various applications. Plots in Fig. 6 illustrates that for our four domains, the number of optimal clusters is always less than 1024, following the elbow method. For this reason, we opted for 1024 clusters due to the sufficiency of the distortion obtained in identifying adversarial transitions more effectively in all domains covered in this paper, and, in addition, computing the cluster set does not entail an excessively high time investment.

In addition to these plots, we also generate Table 4 with the average of the distortion and its standard deviation metrics for these set of clusters. The table presents the results of 10 executions with randomly initialized centroids. Table 4 demonstrates that the distortion is significantly reduced when using 1024 clusters, regardless of random initialization. In all cases, the standard deviation approaches zero, thereby guaranteeing robust detection of adversarial transitions, as demonstrated in Sections 6.3 and 6.4.

The graphs and tables showcased in this section provide clear evidence that, within the chosen domains, distortion noticeably diminishes once the number of clusters reaches 1024. Hence, we have opted to employ this specific k-value for our approach. Given that our proposed approach is designed to be applicable across various domains, it becomes imperative to undertake a comparable analysis in order to ascertain the suitable k-value for preventing under fitting in the data structure of each distinct domain.

7 Conclusions

This paper describes a novel clustering-based approach for detecting outlier transitions in RL. We evaluate our detector against four state-of-art attacks in three well-known domains. Next, we summarize the main conclusions found in this paper:

(i) A novel framework for attack detection. The main contribution of this paper is a novel framework for attack detection based on the perturbations these attacks produce in the transition and reward dynamics that the victim perceives from the environment. The experiments in Section 6.3 demonstrate that the victim goes from a free-of-attack context to an under-attack context whenever an adversary begins to inject attacks, and it is precisely this context change that must be detected to prevent catastrophic consequences. Therefore, this novel perspective opens the door to the application of change-point detection approaches that are able to identify changes in the transition and reward dynamics perceived for the victim.

(ii) A novel model-free cluster-based detector. Clustering of the transition space allows, on the one hand, not having to know the transition and reward dynamics of the environment, and, on the other hand, to transform a multi-variate detection problem into a more compact and tractable uni-variate one as described in Section 5. This is a significant advantage concerning other change-point detection approaches [6, 7, 28].

(iii) Exploitation of the coherence of the environmental dynamics. In contrast to the majority of the previous works which, focus on detecting single adversarial observations, we analyze adversarial transitions. The evaluation in Section 6.2 demonstrates that detectors focused on transitions instead of single observations can capture the dynamics of a sequential decision task. Therefore, transition-based detectors exploit the coherence of the transition and reward functions that make them better for adversarial attack detection in the context of RL.

(iv) Sensitivity of the proposed approach to the parameter \(\beta \). Obviously, the success of the proposed approach is subject to the radius \(\beta \) defined for each of the clusters. It is mandatory to detect attacks as soon as possible in safety-critical domains. So this paper suggests that this parameter should be heuristically predefined before system deployment. In this case, \(\beta \) is tuned as described in (6), but other initialization could be investigated. Notwithstanding the above, we have also analyzed the sensitivity of the detection approach to \(\beta \) by using ROC curves (Section 6.4). ROC curves make a sweep of the \(\beta \) values and return for each one the rate of true and false positives. Therefore, we can easily choose the best threshold after plotting the ROC curve. Such a posteriori analysis could be interesting for non-safety-critical domains.

(v) A complex attack is more difficult to detect. As we show in the evaluation section, if the attack drives the victim to undesired or unexplored states, the victim will generate a higher amount of non-adversarial transitions, which our detector classifies as adversarial ones. Therefore, the number of false positives increases, and the performance of our detector decreases.

As future work, we would extend our approach to analyze the performance of other change-point detection techniques. We also would implement some methods to reconstruct the attacked transitions to allow the victim to continue its trajectory through its goal.