1 Introduction

Modeling and solving complex problems involving multiple interacting agents is made possible through the powerful paradigm of MARL [1,2,3]. It is widely applicable in various fields such as autonomous vehicles [4], traffic light control [5], games [6,7,8] and intelligent energy grids [9], highlighting its immense potential.

Selective reincarnation is a recent advancement in MARL [10], which enables past computations to be reused [11]. This approach reduces overall computational cost and allows the system to adapt to changing environments by leveraging prior knowledge such as model weights, offline datasets, and other computational assets. Reincarnation is selectively performed based on criteria such as maximum and average returns, ensuring the most effective agents are reincarnated. Recent studies utilizing reincarnation have demonstrated improvements in policy selection, integrated learning paradigms, and the utilization of prior computations, effectively enhancing the efficiency and effectiveness of policy gradient algorithms in practical settings [12,13,14].

Selective reincarnation has brought improvements, but it also introduces new vulnerabilities in the face of adversarial attacks. Observation poisoning is one such attack that can degrade the performance of well-trained neural network policies by perturbing the observation space [15]. This issue extends to crowdsensing systems where false data can be injected to interfere with analysis results [16]. With the increasing prevalence of these attacks, it is urgent to develop robust models that can withstand such threats [17]. The safety of MARL systems is critical for their successful deployment in real-world scenarios like autonomous driving and robotics [18]. Ignoring safety in RL can lead to catastrophic outcomes [19]. Recent studies [20,21,22,23] highlight the potential of safe RL to enhance the reliability of AI systems. This is particularly relevant in the context of MARL, where action [24], policy [25], and reward [26] poisoning attacks pose significant threats to system performance. Therefore, it is necessary [27,28,29] to test and evaluate the susceptibility of selective reincarnation to adversarial attacks, which aids in the essential step of development of robust defenses and resilient algorithms [30,31,32,33,34,35,36].

Our research explores the impact of observation poisoning on the decisions regarding which agents to selectively reincarnate within a MARL framework. We conducted extensive experiments using the HalfCheetah environment [37] and the IDDPG [38] algorithm, introducing triggers such as Gaussian noise addition, observation reversal, random shuffling, and scaling into the teacher dataset of the MARL system. To focus on selective reincarnation and its susceptibility to poisoning attacks, we exploit the ‘Good-Medium’ dataset [10]. This dataset comprises about the final 40% of teachers’ experiences, stored in the Off-the-Grid MARL framework [10, 39]. We noted significant influences on reincarnation decisions and quantified this influence using Kendall’s tau metric [40]. Our study provides valuable insights into the robustness of selective reincarnation in MARL against poisoning attacks, paving the way for developing more secure and reliable systems for real-world application.

Along with the susceptibility of the selective reincarnation, our findings comprehensively assess agent combinations under the various attack scenarios, offering insights into their vulnerability. For instance, the combination back ankle (BA), front ankle (FA), and front knee (FK) is most vulnerable, with an overall vulnerability score of 46%, calculated as the average vulnerability across multiple attack scenarios. In contrast, BA was the only agent to demonstrate resilience, showing a 10% negative vulnerability score, which suggests that it performs better under attack conditions than in the baseline scenario.

In the next section, we discuss prior work that aligns with our research, establishing the significance of studying the vulnerability and resilience of MARL’s selective reincarnation against adversarial attacks.

2 Related Work

Our research intersects three primary areas of prior work: “Selective Reincarnation in MARL” [10], “Adversarial Attacks in Deep RL, specifically Observation Poisoning”, and “Robustness Evaluation in MARL”.

2.1 Selective Reincarnation in MARL

MARL has garnered attention due to its ability to model complex interactions between multiple agents. Recent literature has explored the concept of reincarnation in MARL which involves reusing prior computations based on past performances. This approach has shown significant benefits, such as improved computational efficiency and adaptability, as discussed in a work by [11]. Transfer learning has been another area of interest in MARL. Study [41] introduce an ontology-based approach to facilitate knowledge transfer across agents, which aligns with the broader theme of reusing knowledge. Moreover, [42] proposes methods to transfer knowledge from trained agents to newer ones, resulting in improved training efficiency and performance. Selective reincarnation, a form of reincarnation in which we can select which agents to reincarnate, in MARL has been found to improve learning efficiency by reusing previous computations across selected agents [10]. In a cooperative, heterogeneous HalfCheetah MARL setup, it shows faster convergence and better returns than starting anew or complete reincarnation. However, careful selection of agents to reincarnate is crucial as incorrect selection can yield inferior results. Our research focuses on the unexplored aspect of the robustness of selective reincarnation in MARL against susceptibility to poisoning attacks, specifically observation poisoning.

2.2 Adversarial Attacks in Deep RL: Observation Poisoning

The vulnerability of MARL systems, especially in the face of adversarial attacks, has been a pressing concern. The paper [43] discusses the challenges posed by dynamic environments and the need for continuous coordination among agents. This work underscores the importance of our research, which focuses on the vulnerabilities introduced by observation poisoning.

Observation poisoning, an adversarial tactic, can majorly derail an agent’s learning by manipulating its observation space, thereby threatening RL system robustness. Research shows that even slight disruptions can significantly affect Deep RL agents, inducing them to adopt sub-optimal policies [44]. A two-stage optimization-based attack can efficiently introduce adversarial noise into RL, heavily impacting performance [45]. Backdoors attack using triggers in deep RL agents hamper their performance [46]. Notably, a small amount of poisoned training data can lead to successful backdoor attacks, highlighting system vulnerabilities [47].

2.3 Robustness Evaluation in MARL

Ensuring that MARL systems are robust, especially as they are deployed in diverse and challenging environments, is crucial. Due to the varied landscape of adversarial attacks on MARL, particularly on input observations, it is essential to understand these threats and develop appropriate evaluation metrics and defense strategies.

Like other domains in machine learning [48,49,50], one standard attack on MARL systems is the Gaussian noise addition (GNA), which introduces subtle yet effective adversarial strategies by adding noise to agents’ observations. Attackers can mislead agents and adversely affect their learning trajectories through this manipulation. The significance of defending against Gaussian noise addition is emphasized in research such as [51], showcasing the profound impact of such a seemingly simple attack on MARL systems.

Shuffling and reversal attacks are also potent adversarial tactics that can drastically alter an agent’s perception of the environment without changing the actual state of the environment. These manipulations lead to sub-optimal learning outcomes. Multiple works, such as [16, 45, 52,53,54,55,56], highlight the importance of understanding and mitigating the risks associated with these shuffling-based attacks in MARL.

Although scaling attacks have been extensively studied in broader machine learning contexts [45, 48, 57,58,59], their impact on MARL systems remains less explored. These attacks manipulate the magnitude of agents’ observations, leading to skewed perceptions and decisions. Our work assesses the robustness of selective reincarnation in the face of diverse poisoning attacks in MARL, including scaling attacks. We focus on understanding how observation poisoning affects agent performance within a selectively reincarnated HalfCheetah framework and its implications for selecting agents for reincarnation. Our exploration sheds light on potential vulnerabilities, offering valuable insights into the challenges and susceptibility of selective reincarnation in MARL to poisoning attacks.

In summary, it is crucial to defend against various poisoning attacks on MARL systems, including Gaussian noise addition, shuffling and reversal attacks, and scaling attacks. Our in-depth analysis provides valuable findings into the robustness of selective reincarnation against these diverse poisoning attacks in MARL. These insights not only highlight potential vulnerabilities but also serve as a foundational resource for researchers exploring defense mechanisms.

3 Methodological Foundations and Evaluation Metrics

In order to provide a comprehensive understanding of the impact of observation poisoning on MARL systems, this section introduces the key methodologies and statistical framework we employ. From the foundational principles of agent behavior to the specialized metric like Kendall’s Tau for assessing ranking changes and the “Overall Vulnerability” for assessing the vulnerability of agent combinations, we lay down the theoretical groundwork for our experimental setup and analysis.

3.1 Independent Deep Deterministic Policy Gradient (IDDPG)

In the field of MARL, especially in the context of fully cooperative MARL with shared rewards, agents frequently operate within a framework known as the Decentralized Partially Observable Markov Decision Process (Dec-POMDP) [60]. This framework is described by a tuple \( M = (N, S, \{Ai\}, \{Oi\}, P, E, \rho _0, r, \gamma ) \). Here, \( N \) signifies the set of \( n \) agents, and \( s \in S \) represents the true state of the environment. Each agent \( i \) is privy to partial observations from the environment, determined by an emission function \( E(o_t|s_t, i) \). At every timestep \( t \), agents receive local observations \( o_i^t \) and decide on actions \( a_i^t \), culminating in a joint action \( a_t \). Agents maintain an observation history \( o_i^{0:t} \), which influences their policy \( \mu _i(a_i^t|o_i^{0:t}) \). The environment then transitions based on \( P(s_{t+1}|s_t, a_t) \) and allocates a shared reward via \( r(s, a) \).

Given the Dec-POMDP framework, the inherent challenge lies in the decentralized nature of the environment, where each agent only has access to partial observations. Thus, the primary objective is to find a joint policy \( \pi = \{\pi _1, \pi _2,..., \pi _n\} \) for all agents \( i \in N \), such that the expected return \( G_i = \sum _{t=0}^{T} \gamma ^t r_i^t \) for each agent \( i \), following its policy \( \pi _i \), is maximized, taking into account the policies \( \pi _j \) of all other agents \( j \ne i \).

To address this challenge, the IDDPG [38, 61] methodology is introduced. For each agent \( i \), a Q-function \( Q_i^\theta (o_i^{0:t}, a_i^t) \) is established, conditioned on its observation history and action. Concurrently, a policy network \( \mu _i^\phi (o_i^t) \) is set up for each agent, mapping observations to actions. The Q-function is trained to minimize the temporal difference (TD) loss:

$$\begin{aligned} L_Q(D_i, \theta _i) = E_{o_i^t,a_i^t,r_t,o_i^{t+1} \sim D} \left[ (Q_i^{\theta _i}(o_i, a_i) - r_t - \gamma \hat{Q}^{\theta _i}(o_i^{t+1}, \hat{\mu }^{\phi }(o_t+1)))^2 \right] \end{aligned}$$

where, \( \hat{Q}^\theta \) and \( \hat{\mu }^\phi \) are target versions of the Q-network and policy network, respectively. The variable \( D_i \) represents the experience replay buffer for each agent \( i \), which stores past experiences in the form of tuples \( (o_i^t, a_i^t, r_t, o_i^{t+1}) \). These stored experiences are sampled to train the Q-function and the policy network, thereby minimizing the temporal difference (TD) loss and policy loss, respectively. Simultaneously, the policy network is optimized to predict the action that maximizes the Q-function, resulting in the minimization of the policy loss:

$$\begin{aligned} L_\mu (D_i, \phi _i) = E_{o_i^t \sim D} \left[ -Q_i^{\theta _i}(o_i^t, \mu _i^{\phi _i}(o_i^t)) \right] \end{aligned}$$

IDDPG is fundamental to our analysis, setting the stage for evaluating observation poisoning within MARL systems, with practical implementation details to follow in 4.2.

3.2 Observation Poisoning Techniques

In our HalfCheetah MARL framework [37], agents acquire observations from the environment, represented as a \(d \times d\) matrix, s, with \( d = 10 \), defining the dimensionality of the observations. The following perturbation techniques have been applied to these observations in the teacher dataset. Despite some techniques causing minor perturbations, their impact on the learning process was found to be significant:

  • Gaussian Noise Addition: Gaussian noise is added to the observations to induce randomness and foster diverse learning experiences. Each element of the observation matrix s is independently perturbed:

    $$\begin{aligned} s'_{ij} = s_{ij} + \epsilon _{ij}; \quad \epsilon _{ij} \sim \mathcal {N}(0, \sigma ^2), ~\sigma = 0.01 \end{aligned}$$

    The perturbations, despite being minimal, have significant effects on the learning process.

  • Observation Reversal: The order of rows in the matrix s is inverted, disrupting the potential learning from crucial environmental state information:

    $$\begin{aligned} s'_{ij} = s_{d-i+1,j}, \quad \forall i, j \in \{ 1, 2, \ldots , d \} \end{aligned}$$
  • Random Shuffling: Rows of s are randomly shuffled, causing disarray similar to the observation reversal technique:

    $$\begin{aligned} s'_{ij} = s_{\pi (i),j}, \quad \pi \in \{ \pi : \{ 1, 2, \ldots , d \} \rightarrow \{ 1, 2, \ldots , d \} \mid \pi \text { is bijective} \} \end{aligned}$$
  • Scaling: Elements of s are scaled by a constant factor \(\alpha \) to subtly modify the observation values, potentially affecting agent learning:

    $$\begin{aligned} s'_{ij} = \alpha \times s_{ij}; \quad \alpha = 1.1 \end{aligned}$$

Upon implementing these poisoning techniques on the teacher dataset used for training the reincarnating agents, the reincarnating agents will receive these perturbed observations, which will affect the learning process that is implemented using IDDPG. This affects their learning process, which is implemented using IDDPG.

3.3 Rank Correlation Analysis

We evaluated the impact of observation poisoning on the performance of reincarnating agents using Kendall’s tau correlation coefficient [40], which assesses the association between pre- and post-poisoning performance rankings. The performance measure can be either average returns or maximum returns.

We begin by defining the sign function \( \text {sgn}(z) \) as follows:

$$\begin{aligned} \text {sgn}(z) = {\left\{ \begin{array}{ll} +1 &{} \text {if } z > 0 \\ -1 &{} \text {if } z < 0 \\ 0 &{} \text {if } z = 0 \end{array}\right. } \end{aligned}$$
(1)

Where z represents the difference in ranks between two data points. Using Eq. 1, we define Kendall’s \(\tau \) as shown in Eq. 2:

$$\begin{aligned} \tau = \frac{2}{n(n-1)} \sum _{i < j} \text {sgn}(x_i - x_j) \cdot \text {sgn}(y_i - y_j) \end{aligned}$$
(2)

In Eq. 2, n is the total number of ranked data points, while \(x_i\) and \(y_i\) represent the ranks of individual data points in the pre-and post-poisoning rankings, respectively.

A \(\tau \) value close to 1 signifies minimal ranking changes due to poisoning, while a value near \(-1\) implies significant ranking disruption. This measure offers a quantitative assessment of the influence of poisoning techniques, validating the statistical significance of performance changes caused by poisoning.

3.4 Quantifying Vulnerability of Agent Combinations

In addressing the challenge of adversarial threats in selective reincarnation in MARL systems, this section introduces a mathematical quantification method to assess the ‘Overall Vulnerability’ of agent combinations to observation poisoning attacks, with a focus on evaluating the vulnerability of reincarnating agent combinations.

To quantify the concept of “Overall Vulnerability”, we employed the following mathematical formula:

$$\begin{aligned} V_c = \frac{100}{8} \sum _{i \in A} \left( \frac{{B_{\text {max}, c} - P_{\text {max}, i, c}}}{B_{\text {max}, c}} + \frac{{B_{\text {avg}, c} - P_{\text {avg}, i, c}}}{B_{\text {avg}, c}} \right) \end{aligned}$$

where \( V_c \) represents the Overall Vulnerability for a given agent combination \( c \). \( A \) denotes the set of considered attacks, specifically: {noise, reversal, scaling, shuffling}. The terms \( B_{\text {max}, c} \) and \( B_{\text {avg}, c} \) denote the maximum and average returns for the base case with agent combination \( c \), respectively. Similarly, \( P_{\text {max}, i, c} \) and \( P_{\text {avg}, i, c} \) represent the maximum and average returns for the \(i\)-th poisoned case with agent combination \( c \), respectively. For a detailed explanation of the ‘maximum return’ and ‘average return’ metrics, refer to Sect. 4.3.

4 Experimental Setup

Building on the approaches and statistical framework discussed earlier, this section provides an exhaustive exposition of our experimental design, including the specifics of IDDPG training. Our focus is to scrutinize the influence of observation poisoning on MARL systems, particularly in the HalfCheetah environment. Here, we detail the HalfCheetah setup, elaborate on the performance metrics, and outline the step-by-step procedure followed, thereby setting the stage for the forthcoming results and analysis.

4.1 The HalfCheetah Setup

The HalfCheetah environment [62, 63], part of the Mujoco simulation software suite [64], simulates a two-legged robot designed for rapid forward motion. While Mujoco’s default configuration perceives the HalfCheetah as a singular entity, our adaptation [37], based on Multi-Agent Mujoco (MaMujoco), views it as a system comprised of six cooperative agents: the back ankle (BA), back knee (BK), back hip (BH), front ankle (FA), front knee (FK), and front hip (FH). For a more intuitive understanding, Fig. 1 visually depicts these agents and their interconnections. These agents work collectively toward the objective of forward progression. We selected this setup due to its relevance to practical robotics, complex agent dynamics, continuous action domains, and compatibility with the MARL framework. This configuration is also well-suited for the selective reincarnation process, which utilizes a dataset of experiences derived from Tabula Rasa.

Fig. 1
figure 1

The illustration of HalfCheetah as a collection of six different agents

4.2 IDDPG Training Details

Having outlined the theoretical foundations of IDDPG in 3.1, we now turn to details its practical implementation [10] in our multi-agent setup. Each agent employs an individual policy network for action decision-making given an environment observation. Concurrently, each agent is equipped with a critic network tasked with estimating the value of taking a specific action in a given state.

To ensure stability during the training process, target networks are utilized for both the policy and the critic. Separate optimizers are employed for the policy and critic networks to facilitate an effective and efficient training process. During the exploration phase, each agent is programmed to interact with the environment for a default of 10,000 timesteps. Agents designated for selective reincarnation undergo additional training on a specialized teacher dataset for 200,000 timesteps.

The hyperparameters for the experiments include a default batch size of 32 for training, a discount factor of 0.99, a lambda value of 0.6 for temporal-difference bootstrapping, and a noise standard deviation of 0.1 for exploration.

4.3 Performance Metrics

In our evaluation, two primary metrics were employed to assess the performance of the MARL system in the HalfCheetah environment: the ‘maximum return’ and the ‘average return’. These metrics offer a systematic evaluation of both the system’s peak performance capability and its overall consistency.

  • Maximum Return: The ‘maximum return’, denoted as \( R_{\text {max}} \), encapsulates the highest return secured during the training process, averaged across all seeds. If \( R(t, s) \) represents the return at timestep \( t \) for seed \( s \), the maximum return is formulated as:

    $$\begin{aligned} R_{\text {max}} = \max _{t \in [1, T]} \left( \frac{1}{S} \sum _{s \in \mathcal {S}} R(t, s) \right) \end{aligned}$$

    where \( T \) is the total number of timesteps and \( \mathcal {S} \) is the set of seeds.

  • Average Return: The ‘average return’, denoted as \( R_{\text {avg}} \), offers a measure of the consistent performance of the system, with the return averaged over all timesteps and seeds. It is defined as:

    $$\begin{aligned} R_{\text {avg}} = \frac{1}{TS} \sum _{t \in [1, T]} \sum _{s \in \mathcal {S}} R(t, s) \end{aligned}$$

We utilized these metrics to evaluate the HalfCheetah MARL system within a Decentralised Partially Observable Markov Decision Process (Dec-POMDP) [60] framework. The aim was to determine a joint policy that maximizes each agent’s return in relation to other agents’ policies. The effects of observation poisoning techniques on these metrics were then examined.

4.4 Experimental Procedure

Inspired by [10], we developed our own variant procedure 4.4.1 from scratch and used the ‘Good-Medium teacher dataset’ of stored experiences based on their guidelines. We utilize the multi-agent environment Mujoco with a configuration of “HalfCheetah”. This environment involves multiple agents interacting with each other in a simulated physical space. The primary algorithm used for training the agents is IDDPG, which is a variant of the Deep Deterministic Policy Gradient algorithm tailored for multi-agent settings.

4.4.1 Procedure for Training, Dataset Poisoning, Agent Reincarnation, and Performance Evaluation

In this subsection, we present the procedure that encompasses the training, dataset poisoning, reincarnation, and performance evaluation of the agents. The procedure is methodically laid out in seven distinct steps. Subsequent to the enumeration, a more in-depth description of each step provides clarity and insight into the methodology and processes involved. The steps are:

  1. 1.

    Initial Training of Agents and Creation of Teacher Datasets.

  2. 2.

    Observation Poisoning.

  3. 3.

    Enumeration of Agent Combinations for Reincarnation.

  4. 4.

    Retraining of the System for Each Combination.

  5. 5.

    Performance Evaluation.

  6. 6.

    Grouping and Sorting of Reincarnating Agents Based on Metric Value.

  7. 7.

    Kendall’s Tau Rank Calculation.

Description of Each Step in Procedure 4.4.1:

  1. 1.

    Initial Training of Agents and Creation of Teacher Datasets: The six agents are initially trained on the HalfCheetah environment using Independent Deep Deterministic Policy Gradient (IDDPG) over 1 million training steps. Training experiences are saved as teacher dataset [39].

  2. 2.

    Observation Poisoning: The teacher dataset is manipulated through triggers like Gaussian noise addition, observation reversal, shuffling, and scaling. This “observation poisoned” dataset is later given to the reincarnating agents.

  3. 3.

    Enumeration of Agent Combinations for Reincarnation: All \(2^6 = 64\) subsets of agent combinations for reincarnation are enumerated. Each subset accesses its corresponding offline teacher dataset during retraining on the HalfCheetah environment.

  4. 4.

    Retraining of the System for Each Combination: Each agent combination undergoes training for 200k timesteps, post-teacher data removal, and an additional 50k timesteps on student data alone, with this process repeated over five seeds (0–4).

  5. 5.

    Performance Evaluation: Using ‘maximum return’ and ‘average return’ metrics as explained in the performance metrics subsection, we assessed the system’s performance and speed to convergence, respectively.

  6. 6.

    Grouping and Sorting of Reincarnating Agents Based on Metric Value: Reincarnating agents are grouped by number involved, then within groups, sorted descendingly by either maximum or average return.

  7. 7.

    Kendall’s Tau Rank Calculation: Calculate Kendall’s tau to assess the degree of correspondence between the orders of agent combinations based on the performance metrics. It serves to measure the impact of poisoning on the ordering of reincarnating agents.

5 Results, Analysis and Discussion

We carried out exhaustive experiments to study the impact of observation poisoning on the HalfCheetah MARL system, specifically focusing on reincarnation decisions. Kendall’s tau correlation coefficient, a non-parametric statistic, was used to quantify the order correlation between the different poisoning techniques and reincarnation decisions based on maximum and average returns.

Tables 1 and 2 summarize the experiment outcomes, each showcasing the performance metrics for different reincarnating agent combinations under diverse observation poisoning attacks. Each configuration represents a subset of the six agents. Table 1 focuses on maximum returns, while Table 2 displays average returns. The reported values are average values over five seeds and the standard deviation to show result variability. To avoid excessive listing, we present only the best and worst-performing configurations for each number of reincarnating agents, illustrating the performance range for each attack scenario.

Fig. 2
figure 2

Bar chart depicting the overall vulnerability percentages of different reincarnating agent combinations under the four observation poisoning attacks

The bar chart in Fig. 2 visualizes the overall vulnerability of different reincarnating agent combinations under the four poisoning attacks. Notably, the agent combination ‘BA’ exhibits the lowest vulnerability, while ‘BA, FA, FK’ shows the highest. This analysis lends empirical support to the notion that some agent combinations are inherently more robust against observation poisoning attacks than others.

The findings of this study from Table 1 spotlight the considerable variation in the performance of different agent combinations when subjected to diverse types of poisoning attacks. For instance, the BA, BK, BH, FK, FH configuration, despite its superior performance in the base case, i.e., without poisoning (\(5441.5 \pm 289.4\)), fails to keep its place in a noisy environment, being outperformed by the BA, BK, FA, FK, FH configuration (\(4923.9 \pm 156.8\)). In the reversal attack, even the BA FA configuration, with a lower maximum return (\(4687.7 \pm 588.0\)), outperforms it. Interestingly, the fully reincarnated configuration stands robust against the scaling attack with a return of \(4352.9 \pm 164.8\). The result suggests that the agents’ adaptability depends heavily on the nature of the attack, emphasizing the need for versatile combinations.

An analysis of the average returns, as detailed in Table 2, suggested that some agent combinations maintained more consistent performances while others exhibited high variability. It is apparent from the associated standard deviations. Table 2 reveals that the BA, FA pairing yields the highest average returns under the ‘Reversal Attack,’ albeit with the highest variance at a standard deviation of 566.4. Meanwhile, the FK, FH pairing under the ‘Base Case’ offers high average returns with less volatility, indicated by a standard deviation 297.1. It suggests the ‘Base Case’ FK, FH pairing’s superior stability, despite BA, FA’s high returns under the ‘Reversal Attack’, highlighting the trade-off between performance and consistency across configurations.

The fully reincarnated configuration, representing the scenario of all agents reincarnated simultaneously, provided an intriguing benchmark. It performed well under the base case and noise addition attacks but struggled against reversal and random shuffling attacks. It suggests that these latter types of attacks particularly disrupt inter-agent cooperation and coordination.

This data provides comprehensive insights into the resilience of each agent to observation poisoning and their ability to recover through reincarnation; by examining both the peak and average performance of different agent configurations under attack scenarios, we highlight the potential vulnerability of reincarnation decisions in HalfCheetah MARL systems to observation poisoning.

Table 1 Maximum return values for best and worst runs of reincarnated agents with & without observation poisoning
Table 2 Average return values for best and worst runs of reincarnated agents with & without observation poisoning

The Overall Vulnerability chart serves as a supplementary guide to the tables, providing a quick, at-a-glance view of which reincarnating agent combinations are most and least vulnerable to poisoning attacks. This added layer of analysis aids in the strategic decision-making process for configuring agents in different attack scenarios.

Table 3 Comparison of poisoning techniques based on evaluation metrics for different numbers of reincarnating agents

Table 3 offers a comparative analysis of various poisoning techniques using Kendall’s Tau correlation coefficient, measuring their impact on performance rankings across diverse agent reincarnations. An initial ranking positions ‘FH’ first, followed by ‘FK’, ‘FA’, ‘BK’, ‘BH’, and ‘BA’. Post the shuffling attack, the order significantly alters, with ‘BH’ leading, then ‘BA’, ‘BK’, ‘FA’, ‘FH’, and ‘FK’, a shift evidenced by the Kendall’s Tau value of \(-\)0.733, signaling substantial ranking reshuffle.

These trends are also evident in Table 3, where techniques like noise addition and reversal significantly impact rankings, especially with a single reincarnating agent. However, as the number of reincarnating agents increases, the impact lessens. Nevertheless, the scaling technique consistently influences rankings regardless of the number of reincarnating agents. This drastic ranking effect underscores poisoning’s disruptive potential in reincarnating agent selection, reinforcing the need for robust strategies to counter such disturbances.

Table 3 crucially informs reincarnating agent selection decisions by illuminating poisoning techniques’ impacts on rankings, helping identify susceptible and sturdy agents. The table thus becomes a vital strategic resource, providing extensive data on the varied poisoning techniques’ efficacy and impact across diverse agent combinations.

In summary, our findings reveal that selective incarnation remains vulnerable, even in the presence of subtle changes in observations affecting all agents uniformly. Notably, certain agent combinations exhibit high susceptibility to poisoning attacks, while others display remarkable resilience. This empirical data provides valuable insights essential for informed decision-making regarding reincarnation strategies.

6 Conclusion and Future Work

This study has uncovered the critical role of agent combinations and reincarnation scenarios in determining the resilience of HalfCheetah MARL systems to different observation poisoning attacks. This highlights the importance of incorporating adaptive strategies and maintaining a balance between performance and consistency in system design.

Our research specifically focuses on the applying basic, environment-independent triggers in the context of selective reincarnation in MARL. We have demonstrated that selective reincarnation, when faced with observation poisoning, exhibits varying levels of vulnerability based on agent combinations, underscoring the necessity for targeted defenses. Through rigorous experimentation, we have quantitatively measured the impact of poisoning using Kendall’s Tau metric, thereby providing a statistical foundation for assessing the robustness of MARL systems against adversarial threats. Our findings revealed that certain agent combinations are inherently more resilient, offering pathways to enhance system security and stability.

Our future research will focus on broadening the study of observation poisoning attacks in the selective reincarnation of MARL by testing various poisoning methods and advanced triggers in a variety of environments, including multi-agent Humanoid and HumanoidStandup. We aim to understand how selective reincarnation performs in cooperative, competitive, and mixed environments, assess the resilience of multi-agent systems to advanced poisoning attacks, identify and understand the effects of poisoning, and gain a comprehensive understanding of the strengths and limitations of current methods. This effort will inspire the creation of more adaptable and resilient multi-agent system strategies. Through this thorough analysis, we will highlight areas for improvement and contribute to the development of more effective defense strategies in the field.