Advertisement

Autonomous Agents and Multi-Agent Systems

, Volume 33, Issue 1–2, pp 216–274 | Cite as

A probabilistic argumentation framework for reinforcement learning agents

Towards a mentalistic approach to agent profiles
  • Régis RiveretEmail author
  • Yang Gao
  • Guido Governatori
  • Antonino Rotolo
  • Jeremy Pitt
  • Giovanni Sartor
Article

Abstract

A bounded-reasoning agent may face two dimensions of uncertainty: firstly, the uncertainty arising from partial information and conflicting reasons, and secondly, the uncertainty arising from the stochastic nature of its actions and the environment. This paper attempts to address both dimensions within a single unified framework, by bringing together probabilistic argumentation and reinforcement learning. We show how a probabilistic rule-based argumentation framework can capture Markov decision processes and reinforcement learning agents; and how the framework allows us to characterise agents and their argument-based motivations from both a logic-based perspective and a probabilistic perspective. We advocate and illustrate the use of our approach to capture models of agency and norms, and argue that, in addition to providing a novel method for investigating agent types, the unified framework offers a sound basis for taking a mentalistic approach to agent profiles.

Keywords

Probabilistic argumentation Markov decision process Reinforcement learning Norms 

1 Introduction

Probabilistic argumentation (PA) and reinforcement learning (RL) address, from different angles, issues pertaining to bounded rationality.

Formal argumentation addresses bounded rationality by modelling defeasible reasoning. Formal argumentation frameworks represent partial knowledge as arguments and relations (e.g. attack and support) amongst them, resolve conflicts arising from competing arguments by assessing their comparative strengths, derive defeasible conclusions and update such conclusions in the light of new information. Recently, formal argumentation has also been studied in probabilistic settings, leading to PA frameworks, so that the alternative statuses of arguments are events having probability values. Such probabilistic investigations endow formal argumentation with the ability to address bounded rationality from a qualitative and quantitative perspective.

On the other hand, RL addresses bounded rationality by modelling agents interacting with their environment in a trial-and-error style: agents receive ‘rewards’ (i.e. positive reinforcements) if their actions lead to desirable outcomes in the long run, and receive ‘punishments’ (i.e. negative reinforcements) otherwise. The target of agents is to maximise their long-term reward. RL agents learn to achieve this target either by selecting the actions that appear to be preferable according to the agents’ previous experiences (exploitation), or by trying other actions that have potential to bring higher long-term rewards (exploration). RL is widely used as a technique for sequential decision making in stochastic environments (‘stochastic’ means here that the outcome of an action is not deterministic), so that agents can learn the optimal behaviour without being explicitly taught.

This paper brings together PA and RL: we quantitatively measure argument values by a utility function, and automatically learn utility values by RL. Arguments with higher utilities will have higher chance to be used to back certain behaviours. We ensure that potentially useful arguments are drawn with some non-zero probability, so as to avoid RL from being stuck into some local optimum.

By integrating PA and RL in a computational framework, we aim at providing a fresh comprehensive tool to model intelligent adaptive behaviour, in two research domains:
  • argument-based theoretical models of natural agency, that is, models that explain and forecast how diverse kinds of natural agents will behave or would behave under particular contexts or circumstances;

  • argument-based operational models of agent-based systems, that is, computer systems of adaptive agents using argumentation and RL to make rational choices in highly stochastic environments, with noisy (i.e. imperfect and uncertain) information.

As to argument-based models of natural agency, our intention is to provide an alternative to procedural platforms, which are traditionally used to model or simulate intelligent behaviour. The distinctive advantage of the proposed approach is the provision of formal and declarative argument-based specifications for the study and the executable nature of the agents. Since the execution process strictly follows the specifications, the system designer no longer needs to manually check the specifications. Because the specifications are declarative and argument-based, the framework can facilitate the exchange of arguments amongst scientists from different perspectives (e.g. social scientists, economists, jurists or computer scientists), and help them to build formal, executable and rich interdisciplinary theories of natural agencies.

As to argument-based models for agent-based systems, our intention is to investigate a possible combination of PA and RL, as complementary ways to endow bounded rational agents with the ability to cope with uncertain environments, towards smarter agent-based applications. Though a large amount of theoretical works have focused on argumentation, we must reckon that argumentation has not found yet much applications in ‘real life’ computer systems. RL is certainly much more successful in this regard, as it has been widely applied in real-life applications. Hence, we hope that argumentation will find useful applications by associating it with RL.

Our framework supports the fine-grained characterisation of cognitive profiles of RL agents. So far, logic-based frameworks have been used to characterise agent types, i.e. to analyse the reasoning of agents and in particular various ways to approach conflicts between varied motivations (e.g. self-interest vs. norm compliance). By combining PA and RL, we will characterise agents and their argument-based motivations from a logic as well as a probabilistic perspective, and thus we will provide a novel method to investigate agent types. In particular, since formal argumentation supports formal accounts of normative and legal reasoning, we hope that this framework will ease the analysis and construction of models where both norms and uncertainty play an important role.

This paper is organised as follows. In Sect. 2, we further motivate the combination of PA and RL for our purposes. The RL framework is formally presented in Sect. 3. Our argumentation setting is introduced in Sect.4, and its PA development is detailed in Sect. 5. We discuss in Sect.  6 how to build argument-based RL agents on the basis of our combination of PA and RL. In Sect. 7, we illustrate how the framework can be exploited to characterise profiles of argument-based RL agents from a logic-based and probabilistic perspective, before concluding in Sect.  8.

2 Motivations

In this section, we further motivate our approach to bounded agents by combining PA and RL, and we do so from three perspectives. Firstly, we introduce reasons for combining these two computational frameworks to build our agents (Sect.  2.1). Secondly, we motivate our proposal with respect to possible applications (Sect. 2.2). Finally, the approach is motivated by limitations of related work (Sect. 2.3).

2.1 Probabilistic argumentation and reinforcement learning

When modelling, specifying or building an agent reasoning with some knowledge within a formal framework, we face the problems of how to represent the knowledge and how to reason with this knowledge. We adopt a declarative approach, rather than a procedural one, as we think that declarative models can be more easily updated and used to test alternative hypotheses, essential for both a reasoning agent and scientists investigating such an agent. Declarative models can also be viewed as executable specifications [3, 4]. As declarative language, we employ a logic-based formalism to capture the cognitive states of an agent and to enable reasoning on the basis of such states. So, statements of this language represent an agent’s beliefs, internalised obligations, desires and actions. Defeasible inferences are applied to such cognitive statements to generate further statements to determine agent’s actions.

The choice of a declarative representation of knowledge coupled with defeasible inferences leads us to adopt a non-monotonic logic-based framework. As non-monotonic logic-based framework, we endorse formal argumentation. This allows us to take advantage of work on argumentation in knowledge representation, non-monotonic logic reasoning and decision making [5, 43], and in normative reasoning [42, 50] through the modular integration of different aspects of reasoning, such as the construction of arguments and the acceptance of arguments and statements [7, 12]. By adopting an argument-based approach, declarative models and hypotheses can be more naturally updated and argued upon.

Whilst argumentation is well adapted to reason with partial information and conflicts, it does not deal with measures of uncertainty as conceived in studies of randomness or stochasticity. To deal with uncertainty as related to the degree of credibility of arguments and statements, we supplement argumentation theory with probability theory. In particular, we adopt an energy-based model [29] for argumentation [44, 45, 46]. Energy-based models attach a scalar quantity called an energy to any assignments of the variables in the model. Then, on the basis of energies, each assignment is given a probability. Making inferences with an energy-based model consists of comparing the energies associated with various assignments of the variables, and choosing the one with the smallest energy. In this view, the energy reflects the utility or the credit one puts into a possible assignment of the argumentation system; an assignment consisting in associating every argument with an acceptance status. On this basis we determine the probability that an argument has a certain status, and compute the probability distribution of arguments’ statuses. The energies of these configurations can be fixed by human operators or learnt from data or experiences by means of machine learning.

To use RL to learn the energies of argumentation configurations, we formulate the energy-based model of probabilistic argumentation as Markov Decision Processes (MDPs). MDPs are widely used to formulate the sequential decision making (SDM) problems: rather than making a ‘one-shot’ decision to maximise the immediate utility, agents in SDM problems need to select sequences of actions covering manifold situations to maximise the long-term accumulated utility. Considering stochastic environments, in which an action a in a state s may trigger transitions to different states along with different utility values, a learning agent does not know a priori the distributions over the possible utilities and states. Through RL, an agent aims at achieving or at least approximating a policy that maximises the rewards in the long run, where a policy maps perceived states of the environment to actions to be performed when in those states. In our PA framework for RL argument-based agents, a policy maps perceived states of the environment to possible statuses of its arguments, and each argument in turn is for or against some executable actions. By trial-and-error, RL learns the best policy and its corresponding distribution of arguments’ statuses, hence the energies.

While RL is used to learn energies of particular argumentation configurations of an agent, the use of PA can be helpful to address problems of common RL settings. In particular, a common problem pertains to documenting decisions [32], i.e. the problem questioning why some actions are good or bad in a state, and the development of corresponding policies. To address this problem, mental statements, such as beliefs, desires and actions, can be justified or discarded by the interplay of arguments to back decision-making. Another common problem pertains to the size of the space state, whose expansion involves an exponential growth of the computational requirements for obtaining optimal policies. This is the famous curse of dimensionality faced by common RL algorithms [54]. In this regard, a combination of PA and RL can be useful to apply arguments to new situations similar to previously encountered ones [55]. So, by representing (mental) states with statements supported by arguments, the decision-making process of our RL agents may become more informative and understandable, and thus possibly more easily reusable.

In many real applications, agents have only partial knowledge of the environment, and SDM problems in such scenarios can be modelled as partially observed MDPs (POMDPs). In POMDPs , an agent partially observes its current state, makes decisions by deriving defeasible conclusions, and updates these conclusions in light of new observations. If the agent has full knowledge of the current state, then a POMDP boils down to an MDP. In this paper, and as a necessary first step, we operate argument-based MDPs, i.e. an argument-based knowledge representation and reasoning setting for MDPs, paving the way to POMDP counterparts.

We will not provide our agent with the ability to ‘observe’ norms or normative rules (this is an issue left to future research), but we will endow them with some initial normative knowledge. By doing so, we can also explore the significance of providing a RL agent with such normative knowledge to select arguments leading to useful decisions.

2.2 Applications

The probabilistic argument-based setting with RL paves the way of two kinds of applications. The first kind of applications concerns ways to animate an agent, e.g. to provide an agent with a mechanism for reasoning and learning. The second applications deal with modelling the agent, to describe this agent with conciseness and precision. These applications can respond to two types of research investigations:
  • argument-based theoretical models about natural agents,

  • argument-based operational models for computer systems.

When investigating theoretical models describing people behaviours, these theoretical models may inspire the design of operational models of computer systems (as in bio- and socio-inspired computing), while the exercise of operational models may shed light on people behaviours.

Logic-based frameworks are well-established to investigate operational models for computer systems. Yet logic-based investigations are sometimes criticised with regard to their applications in social sciences, and in particular agent-based social simulation (ABSS). Experimental insights are possible from social simulations, and logic-based investigations can be employed to study some relevant concepts and their mathematical properties (axiomatization, decidability, complexity, etc.). Nevertheless, such investigations are sometimes deemed to add nothing interesting to understand social phenomena, and might lead to “empty formal logic papers without any results” [20].

In contrast to recurrent criticisms of logic-based approaches in ABSS, these approaches are defended by other researchers. For example, [21] argues that logic can be useful in ABSS because a logical analysis based on (a) a philosophical or sociological theory, (b) observations and data about a particular social phenomenon, and (c) intuitions or a blend of them can provide the requirements and the specification for an ABSS system and more generally MAS. Moreover, a logic-based system might help to check the validity of the ABSS model and to adjust it by way of having a clear understanding of the formal model underpinning it.

As far as we are concerned, and for both argument-based investigations of theoretical models and operational models, since formal argumentation is particularly suitable to provide formal accounts of legal reasoning, our framework is meant to support the analysis and construction of models where norms play an important role. In particular, norm-governed computer systems could profit from argument-based analyses and techniques developed for modelling legal reasoning in order to formalise normative systems tailored to govern computer systems. This view is in line with Conte et al. [17] enquiring how to fill the gap between the frameworks of autonomous agents and legal theory. In this view, if the interaction between models of normative systems and cognitive agents requires integrating works in the legal and MAS domains, then a common formalism, for instance argumentation, could significantly facilitate such an integration.

2.3 Related work

Since desires, goals, plans and actions often conflict or can be justified in alternative ways, computational models of argument have been set up to model decision making and practical reasoning, i.e. reasoning about what it is best for a particular agent to do in a given situation.

An early line of investigations was initiated by J. Fox and S. Parsons, where argumentation frameworks were proposed for practical reasoning [36], for instance for making decisions about the expected value of actions [23]. Another early line of research was carried out by Pollock [38] at the crossroad of artificial intelligence and philosophy by proposing a general theory of rationality and its implementation in OSCAR, an architecture for an autonomous rational agent using arguments. More recent work are various, although they often employ (adaptations of) Dung’s argumentation frameworks to capture arguments for (in)compatible goals or desires, and plans to achieve them. For example, Amgoud [2] proposed an argument-based model for decision-making in two phases: in the first inference phase, arguments in favour/against each option (action) are built and evaluated in relation to some semantics; in the second comparison phase, pairs of alternative options are compared using a given criterion. Other work took a scheme-based approach where argument schemes and critical questions are applied to practical reasoning. For instance, Atkinson and Bench-Capon [6] described an approach to practical reasoning based on Action-based Alternating Transition System (AATS) [56] to reason rigorously about actions and their effects, and for the presumptive justifications of action through the instantiation of argument scheme, which are then examined through a series of critical questions. Practical reasoning can be also considered with norms, leading to ‘normative practical reasonning’. For example, Oren [35], inspired by [6], describes a formal normative model based on AATS along with argument schemes and critical questions to reason about how goals and obligations lead to preferences over the possible executions of the system, while Shams et al. [51] propose another formal framework for normative practical reasoning that is able to generate consistent plans for a set of conflicting goals and norms. Above-mentioned approaches on argument-based practical reasoning have the advantage to determine decisions with some argued and rich explanations meant to be easy to understand, but they incorporate no (reinforcement) learning.

There has been some work devoted to integrating argumentation frameworks and RL, so as to increase learning speed. For example, Gao [24, 25] proposed the argumentation accelerated reinforcement learning (AARL) framework, which can be applied to both single-agent and cooperative multi-agent problems. They built a variant of the value-based argumentation framework [10] so as to represent people’s (possibly conflicting) domain-specific beliefs and derive the ‘good’ action for each agent by using some argumentation semantics. The framework uses potential-based reward shaping technique to recommend ‘good’ actions to agents, and showed that their approach can improve the learning speed in single-agent and cooperative multi-agent problems. However, the AARL framework does not take into account the degree of uncertainty attached to arguments.

It is often assumed that reasoning about beliefs should be sceptical while reasoning about actions should be credulous. These assumptions lead to logic-based systems that combine different semantics accounts for epistemic and practical reasoning, see e.g. [39]. Beyond computational models of argument, the use of different semantics is also backed by the idea that actions have no truth values, unlike statements that underlie beliefs. Besides, when degrees of uncertainty are attached to arguments and their conclusions, it is interesting to investigate to what extent a unique semantics could be devised to cover epistemic and practical reasoning patterns, and we will do so.

To address the combination of epistemic and practical reasoning with learning abilities, we propose a PA framework allowing us to integrate seamlessly an argument-based knowledge representation setting, probability theory and reinforcement learning. Computational models of argumentation and probabilistic considerations can be combined in various ways (see e.g. [28, 44, 52]). In this paper, we reappraise the setting of [47] for multi-agent systems (MAS) with the approach of ‘probabilistic labellings’ as developped in [44]. Instead of exposing diverse inference rules for epistemic and practical reasoning [39], our PA framework uses a unique labelling specification for both types of reasoning. Instead of planning [2, 6, 35, 51], we embrace RL. As a consequence, instead of an AATS to reason about actions and their effects (as in [6, 35, 51]), we adopt the standard MDP setting for RL, so that we can capture the uncertainty on actions, their effects and the environment. Instead of using an AATS along with arguments schemes and critical questions to generate arguments, we represent MDPs and RL agents with arguments, leading to what we call argument-based MDPs and RL. By doing so, we obtain an argument-based knowledge representation framework combined with a classical decision setting that directly uses a probability distribution and a utility (value) function over alternative decisions. This allows us to fill a gap between argument-based practical reasoning where arguments are meant to qualitatively explain decisions but where the valuation of decisions is elusive, and classical decision theory where choices are typically attached some values but can be difficult to explain in detail from a qualitative perspective with arguments. The gap between argument-based practical reasoning and classical decision making was not empty in terms of research, see e.g. work back to [23], and thus our intention is also to provide an RL alternative to address it.

As to argument-based MDPs and RL, it turns out that Y. Gao et al’s AARL framework [24, 25] can be viewed as a special case of our probabilistic argumentation framework. In their approach, they selected the ‘applicable’ arguments, i.e., the arguments whose premises are satisfied, in each state, and only used these applicable arguments to derive the ‘good’ actions for agents in this state. As we will see, this can be implemented in our setting, by giving the probability zero to all labellings in which these arguments are inapplicable.

Besides the above works which are rather oriented towards operational models for computer systems, a model which includes argumentation, probability and learning may provide a fresh approach to normative social psychology and cognition, and facilitate the interaction-integration of manifold studies on law and norms. Empiricist approaches to the law view single norms as socio-psychological entities, i.e., as beliefs, accompanied by conative states, entertained by individual agents [37, 48]. Theorists of legal logic and argumentation [1] prefer to focus on ideal patterns for normative reasoning and rational practical interaction. Cognitive scientists focus on norms as a specific instances of socio-cognitive content, and on learning processes of imitation and adaptation through which agents detect social norms and endorse them [15, 16]. Some attempts to integrate psychological and logical-argumentative aspects exist (see e.g. [50]), but they fail to take into account learning. As our normative agents can endorse norms within a probabilistic argument-based setting, reason and act according to arguments, and learn through experience the reliability and utility of arguments and combination of them, we are able to start merging psychological, logic-argumentative and cognitive insights into a new synthesis.

In the rest of the paper, we thus investigate a rule-based PA framework to study and animate argument-based executable specifications of MDPs, where argument-based RL agents can cope with the uncertainty arising from partial information and conflicting reasons as well as the uncertainty arising from the stochastic nature of its actions and its environment.

3 Reinforcement learning setting

In this section, we outline Markov decision processes (MDPs) (Sect. 3.1), a mathematical framework widely used to model reinforcement learning problems. Then, we introduce a popular RL algorithm, SARSA (Sect.  3.2), and we briefly illustrate its use in an MDP (Sect. 3.2).

3.1 Markov decision processes

An MDP is a mathematical framework for modelling decision making when outcomes of actions are partly unknown. In this work, we focus on finite horizon MDPs with discrete states and actions. In addition, we assume the next state is only determined by the current state and action, thus we work with the first-order Markovian assumption (higher-order Markovian assumptions can be adjusted to the first-order Markovian assumption by including historical information in the state). We study these MDPs because they have been widely used in RL research, and provide the simple yet generic mathematical foundation for our RL framework.

Definition 3.1

(Markov Decision process - MDP) An MDP is a tuple (SAPR), where
  • S is the set of states,

  • A is the set of actions,

  • \(P(s' \mid s,a)\) is the transition probability of moving from state s to \(s'\) by performing action a,

  • \(R(s' \mid s,a)\) gives the immediate reward received when action a is executed in state s, moving to state \(s'\).

While the above definition is a common characterisation of MDPs which features abstract atomic actions, we can note that these actions may be replaced by some structured concepts such as complex ‘attitudes’ or ‘behaviours’. We will do so later when building up argument-based MDPs. In the remainder of this section, and for the sake of clarity, MDPs are left characterised in terms of such abstract atomic actions.

Example 3.1

We will illustrate our discourse with a small MDP, as illustrated in Fig. 1. Now we look into the four components of this MDP as follows.
  • States. There are two states in this MDP: safe and danger. Formally, the state set is \(S = \{\mathrm{safe}, \mathrm{danger}\}\).

  • Actions. There are two actions in this MDP: care and neglect, both these actions are available in each state. Formally, we have \(A = \{\mathrm{care}, \mathrm{neglect}\}\).

  • Transition probability. From Fig. 1 we can see that when the agent is in state safe, if it performs action care, it has 0.99 probability to remain in state safe, and 0.01 probability to transit to state danger. Formally, this transition dynamic can be represented by \(P(\mathrm{safe}|\mathrm{safe}, \mathrm{care}) = 0.99\) and \(P(\mathrm{danger}|\mathrm{safe}, \mathrm{care}) = 0.01\). Other transition probabilities can be represented similarly. Of course, in many real applications (e.g. Robot Soccer games [53] and helicopter control [34]), the transition probabilities are unknown.

  • Rewards. We give two rewards below, and other reward functions can be obtained similarly: from Fig. 1, when the agent is in state safe and performs action care, if the agent remains in state safe, it receives a reward of 1; otherwise, if the agent is transited to state danger, it receives a reward of -11. These rewards can be denoted as \(R(\mathrm{safe}|\mathrm{safe},\mathrm{care}) = 1\) and \(R(\mathrm{danger}|\mathrm{safe},\mathrm{care}) = -11\).

Fig. 1

An MDP graph. Each transition from an action to a state is represented by an arrow labelled with its probability and associated material payoff

MDPs provide a simple framework to model RL problems, because of their memoryless property (also known as the Markovian property): the next state is only determined by the current state and the current action, not influenced by all earlier states or actions. Formally,
$$\begin{aligned} P(s_{t+1}|s_t,a_t) = P(s_{t+1} | s_t, a_t,\ldots , s_0, a_0), \end{aligned}$$
(3.1)
where \(s_t\) and \(a_t\) are the state and action in time step t, respectively. Similarly, we can see that the reward functions in MDPs also have this memoryless property. Due to this property, when making decisions in each state, instead of storing and taking into account all earlier states and actions, we only need to look at the state and action in the current state, thus significantly simplifying the decision making problem [54].
The goal of planning in an MDP is to find a policy \(\pi : S \rightarrow A\), specifying for each state the action to take, so as to maximise the discounted sum of future rewards. The Q-value\(Q^\pi (s,a)\) is the expected discounted long-term return for executing action a in state s and following a policy \(\pi \) thereafter:
$$\begin{aligned} Q^\pi (s, a) = E[r_t+\gamma r_{t+1} + \gamma ^2 r_{t+2} \ldots |s_t=s, a_t = a , \pi ], \end{aligned}$$
(3.2)
where \(E(\cdot )\) denotes the expected value and \(\gamma \in \mathbb {R}, \gamma \in [0,1]\) is the discount factor, indicating how ‘short-sighted’ the agent is (the smaller the value of \(\gamma \), the more short-sighted the agent is, because it puts less weight on later received rewards). The Q-values satisfy the Bellman equation for a fixed policy \(\pi \):
$$\begin{aligned} Q^\pi (s,a) = \sum _{s'}P(s'|s,a)[R(s'|s,a) + \gamma Q^\pi (s', \pi (s'))]. \end{aligned}$$
(3.3)
For an optimal policy \(\pi ^*\), the Q-value is:
$$\begin{aligned} Q^*(s,a) = \sum _{s'}P(s'|s,a)[R(s'|s,a) + \gamma \max _{a'} Q^*(s',a')]. \end{aligned}$$
(3.4)
From Eqs. (3.3) and (3.4), we can see that the Q-value of the current state-action pair Q(sa) can be recursively defined by the Q-value of the ensuing state-action pair \(Q(s', a')\). By doing this, the problem for obtaining the current Q-value can be solved by obtaining the next Q-value and the immediate reward (this is trivial, because immediate rewards can be directly observed). Since the next state is ‘closer’ to the termination state, computing the next state-action pair’s Q-value is slightly simpler than computing the current state’s value function. From a computer science perspective, Bellman’s equation breaks the decision problem in MDPs into smaller subproblems, and suggests that the decision making problems in MDPs have optimal substructure, i.e. an optimal solution can be constructed efficiently from optimal solutions of its subproblems. As a result, dynamic programming (DP) can be applied to find the optimal policies in an MDP [9, 18].
From a more mathematical perspective, Bellman’s equation guarantees the convergence property of the Q-value functions. To be more specific, Eq. (3.3) can be rewritten as:
$$\begin{aligned} T^{\pi } Q^\pi (s,a) = Q^\pi (s,a) \end{aligned}$$
(3.5)
where \(T^{\pi }\) is the Bellman operator underlying\(\pi \) such that
$$\begin{aligned} (T^{\pi }Q)(s,a) = \sum _{s'}P(s'|s,a)[R(s'|s,a) + \gamma Q(s',\pi (s'))]. \end{aligned}$$
(3.6)
This is a linear system of equations in \(Q^\pi \) and \(Q^\pi \) is an affine linear operator [30]; if \(0< \gamma < 1\) then \(T^{\pi }\) is a maximum-norm contraction and the fixed-point equation \(T^{\pi } Q^\pi (s,a) = Q^\pi (s,a)\) has a unique solution for any \(s \in S\) [11].
Given \(Q^*\), the optimal policy \(\pi ^*\) at state s can be obtained as follows:
$$\begin{aligned} \pi ^*(s) = \mathop {{{\,\mathrm{\arg \!max}\,}}}\limits _a Q^*(s,a). \end{aligned}$$
(3.7)
Since the computational requirements for obtaining the optimal policies grow exponentially with the state size [54], directly applying DP techniques to MDPs can hardly solve any real MDP problems. Furthermore, in most real applications, the transition probabilities are not available a priori; thus DP cannot be directly applied. RL algorithms amount to techniques that find the optimal policies. In the next subsection, we will introduce a popular RL algorithm called SARSA, which is (under certain conditions) guaranteed to find the optimal policy without prior knowledge about the transition probabilities.

3.2 SARSA algorithm

State-Action-Reward-State-Action (SARSA) [49] is a widely used RL algorithm for solving finite-horizon MDPs with discrete states and actions. SARSA is famous for its simplicity and convergence property: under certain conditions, SARSA is guaranteed to converge to optimal policies with probability 1. The pseudo code of SARSA is given in Algorithm 1. Note that an ‘episode’ in this algorithm (line 2) means the procedure ranging from putting the agent at the initial state to the agent reaching a terminal state. The initial and terminal states should be given by the system designer.
In Algorithm 1, \(\alpha \in \mathbb {R}, \alpha \in [0,1]\) (line 8) is the learning rate; when selecting actions (line 4 and 7), a greedy in the limit with infinite exploration (GLIE) policy should be used to guarantee that SARSA can converge to the optimal policies after an infinite long time of learning [49]. In particular, a GLIE policy requires that i) each action is executed infinitely often in every state that is visited infinitely often, and ii) in the limit (i.e. after infinitely long time of learning), the policy is greedy with respect to the Q-value function with probability 1. In practice, softmax [54] is widely used as an approximated GLIE policy. By using softmax, in each state s, the agent chooses to perform action a with probability:
$$\begin{aligned} \frac{e^{Q(s, a)/\tau }}{ \sum _{a_{j} \in A(s)} e^{Q(s, a_{j})/\tau }}, \end{aligned}$$
(3.8)
where \(A(s) \subseteq A\) is the set of available actions in state s, and \(\tau \in \mathbb {R}, \tau > 0\) is used to balance between exploration (i.e. choosing some random action) and exploitation (i.e. choosing the greedy action with respect to the current Q-values). As learning proceeds, \(\tau \) should asymptotically approach 0, so as to ensure that softmax meets the requirement of the GLIE policies. Each iteration of the loop between line 5 and 9 is referred to as a learning step. Each learning step is a round of interaction between the SARSA-based agent and the environment.

Example 3.2

To illustrate how the SARSA algorithm works, we apply SARSA to the MDP illustrated in Fig. 1. We let the learning rate \(\alpha = 0.1\), the discount rate \(\gamma = 0.9\) and the temperature \(\tau = 1\). Also, we initialise all Q-values to 0 (line 1 in Algorithm 1).

Suppose the agent is currently in state safe; thus, s is safe (line 3). Then we choose an action in this state. Since \(Q(\mathrm{safe},\mathrm{care}) = Q(\mathrm{safe},\mathrm{neglect}) = 0\), according to the softmax policy (Eq. 3.8), the probabilities of choosing care and neglect are both 0.5. Suppose we choose care, i.e. we let \(a = \mathrm{care}\) (line 4). Then we perform this action to observe its outcome (line 6). Suppose that, by performing care, the agent is transited to state \(\mathrm{safe}{}\) and receives a reward of 1 (note that the agent does not know the transition probability a priori). Thus, \(s' = \mathrm{safe}\) and \(r = 1\). We use softmax to choose action \(a'\) in state \(s'\) (line 7). Again, since all Q-values remain their original values (i.e. 0), both actions have the same probability to be chosen. Suppose \(a' = \mathrm{neglect}\). Then we can update the Q-value of state-action pair (sa) (line 8), the new Q(sa) value is 0.1, i.e. \(Q(\mathrm{safe}, \mathrm{care}) = 0.1\). After s and a are updated (line 9), the first learning step finishes.

In the second learning step (starts from line 5), recall that \(s= \mathrm{safe}\) and \(a = \mathrm{neglect}\). By performing \(a_t\) in \(s_t\), suppose that the agent is transited to state safe and thus receives a reward of 2. Thus, \(s' = \mathrm{safe}\) and \(r = 2\). Then we have to choose action in \(s'\), using the softmax policy (line 7). Since \(Q(\mathrm{safe}, \mathrm{care}) = 0.1\) and \(Q(\mathrm{safe},\mathrm{neglect}) = 0\), by Eq.  (3.8), we can obtain that the probability of choosing care and neglect are approximately 0.52 and 0.48, respectively. Thus, we see that after the first learning step, care has a higher probability to be chosen. Suppose \(a' = \mathrm{care}\), and then we can update \(Q(\mathrm{safe},\mathrm{neglect})\) (line 8). Recall that \(Q(s',a') = Q(\mathrm{safe}, \mathrm{care}) = 0.1\). We can easily obtain that the new \(Q(\mathrm{safe}, \mathrm{neglect})\) value is 0.209. After we update s and a (line 9), the second learning step finishes.

Since there is no terminal state in this MDP, in theory the learning loop continues infinitely. In practice, we can exit the loop after certain learning episodes. \(\square \)

As Example 3.2 illustrates, to learn a good policy, SARSA only needs to store a Q-value table for each state-action pair and performs very simple Q-value updates (line 8) in every learning step; after enough learning time and under certain conditions (roughly speaking, \(\tau \) and \(\alpha \) values should approach 0 in certain rates; see [49] for detailed conditions), the Q-value of each state-action pair is guaranteed to converge to its optimal value (defined in Eq. 3.4) with probability 1, and the optimal policies can thus be derived. If the numbers of states and actions are huge, we may use function approximation techniques to approximate the values of Q(sa), so as to avoid huge expenses in storing the Q(sa) matrix.

4 Argumentation setting

In this section, we present a minimalist rule-based argumentation framework and its abstract account (Sect.  4.1). Then, we move on to the specification of the acceptance labelling of arguments (Sect.  4.2) and statements (Sect. 4.3).

4.1 Argumentation framework

We first present a rule-based argumentation framework and its abstract account, specifying so the structures on which we will develop our probabilistic setting. This rule-based argumentation setting is ‘minimalist’, so that we avoid the discussion of features which are unnecessary to our goal of providing a basic probabilistic account of argumentation. The argumentation setting is inspired from [19, 40], and uses some (adaptations of) their definitions.

The language of the framework is built from literals. A literal is either an atomic formula (atom) or its negation, and we usually denote a literal as \(\varphi \). For any literal \(\varphi \), the complementary literal is a literal corresponding to the negation of \(\varphi \).

Notation 4.1

We write \(- \varphi \) to denote the complementary literal of \(\varphi \): if \(\varphi \) is an atom p then \(-\varphi \) is \(\lnot p\), and if \(\varphi \) is \(\lnot p\) then \(-\varphi \) is p.

We adopt a discrete temporal setting: the timeline is discretised into a set of ‘time slices’ or ‘time steps’ or ‘instants (of time)’ or ‘timestamps’ etc., so that the studied system is described by temporal modal literals, that we also call statements, taken at intervals that are regularly spaced with a predetermined time granularity \(\varDelta \).

Definition 4.1

(Temporal modal literals, a.k.a Statements) Let
  • \(Atoms \) be a set of propositional atoms,

  • \(Lit = Atoms \cup \{ \lnot p \mid p \in Atoms \}\) a set of literals,

  • \(Mod = \{ \square _1, \ldots , \square _n \}\) a set of modal operators, and

  • \(Times \) a discrete totally ordered set of timestamps \(\{t_1, t_2, \ldots \}\), such that \(t_{i+1}-t_i = \varDelta \), \(\varDelta \in \mathbb {R}^{+}\).

The set of temporal modal literals (statements) with respect to the above sets is the set \(\{ \square \varphi \mathrm {\,at\,} {t} \mid \square \in Mod,\, \varphi \in Lit,\, t \in Times \}.\)

Notation 4.2

  1. 1.

    A set of statements will be usually denoted \(\varPhi \), and the set of statements holding at time t as \(\varPhi ^t\), such that \(\varPhi ^t = \{ (\square \varphi \mathrm {\,at\,} t) \mid (\square \varphi \mathrm {\,at\,} t)\in \varPhi \} \).

     
  2. 2.

    A statement may be denoted \(\varphi \) when the modal and temporal information has no importance.

     

Given a statement, we may be interested only by its modal literal part without the timestamp, or its ‘atemporalised’ statement.

Definition 4.2

(Atemporalised statements)
  • Given a statement \((\square \varphi \mathrm {\,at\,} t)\), its atemporalised statement is \((\square \varphi )\).

  • Given a set of statements \(\varPhi ^t\), its atemporalised set of statements is \(\varPhi = \{ \square \varphi \mid (\square \varphi \mathrm {\,at\,} t) \in \varPhi ^t \}.\)

Definition 4.1 can be ‘instantiated’ to specify a particular set of statements (including its constituents such as its set of instants of time), but we will often omit to mention such a set to avoid overloading the presentation.

Given a set of statements, we can build defeasible rules, so that some statements (defeasibly) support any particular statement.

Definition 4.3

(Defeasible rule) Let \(\textsf {LabRules}\) denote a set of arbitrary labels. A defeasible rule over a set of statements \(\varPhi \) has the form \(r : \square _1 \varphi _1 \mathrm {\,at\,} t_1,\ldots , \square _n \varphi _n \mathrm {\,at\,} t_n, \sim \square '_1 \varphi '_{1} \mathrm {\,at\,} t'_{1}, \ldots , \sim \square '_m \varphi '_{m} \mathrm {\,at\,} t'_{m} \Rightarrow \square \varphi \mathrm {\,at\,} t\) where
  1. 1.

    \(r \in \textsf {LabRules}\) is the unique identifier of the rule,

     
  2. 2.

    \(\square _1 \varphi _1 \mathrm {\,at\,} t_1,\ldots , \square _n \varphi _n \mathrm {\,at\,} t_n, \square '_1 \varphi '_{1} \mathrm {\,at\,} t'_{1}, \ldots , \square '_m \varphi '_{m} \mathrm {\,at\,} t'_{m} \in \varPhi \) (\(0 \le n\) and \(0 \le m\)) are statements. \(\square _1 \varphi _1 \mathrm {\,at\,} t_1,\ldots ,\square _n \phi _n \mathrm {\,at\,} t_n\) is the antecedent of the rule, which is the conjunction of statements. The set \(\mathrm {Body}(r) = \{\square _1 \varphi _1 \mathrm {\,at\,} t_1,\ldots ,\square _n \varphi _n \mathrm {\,at\,} t_n, \sim \square '_1 \varphi '_{1} \mathrm {\,at\,} t'_{1}, \ldots , \sim \square '_m \varphi '_{m} \mathrm {\,at\,} t'_{m}\}\) is the body of the rule r;

     
  3. 3.

    \((\square \varphi \mathrm {\,at\,} t) \in \varPhi \) is the consequent (or head) of the rule, which is a single statement. The consequent of the rule r is denoted \(\mathrm {Head}(r)\), \(\mathrm {Head}(r) = (\square \varphi \mathrm {\,at\,} t).\)

     

Note that we put no constraints on the timestamps of statements in the body and the head of a defeasible rule, because that is not necessary at this stage. However, some constraints are assumed later (to discard ‘retroactive’ rules for example) when operating on the Markov setting of classic MDPs.

Notation 4.3

Given a set of rules \(Rul\), we will use a notational shortcut \(Rul[\square \varphi \mathrm {\,at\,} t ]\) to denote the set of defeasible rules whose head is \((\square \varphi \mathrm {\,at\,} t)\):
$$\begin{aligned} Rul[\square \varphi \mathrm {\,at\,} t] = \{ r \mid r\in Rul, \mathrm {Head}(r)=(\square \varphi \mathrm {\,at\,} t)\}. \end{aligned}$$
Moreover, the identifier r of a defeasible rule may have a superscript t to indicate the time t associated with the consequent of this rule.

In the remainder, we may simply say ‘rule’ instead of ‘defeasible rule’. And for the sake of simplicity, we discard the use of ‘strict’ rules in our setting, because they did not appear essential for our purposes.

Rules may lead to conflicting statements (we ensure later that two conflicting statements cannot be both accepted). For this reason, we assume that a conflict relation is defined over statements to express conflicts in addition to those corresponding to negation.

Definition 4.4

(Conflict relation) A conflict relation over a set of statements\(\varPhi \), denoted \(\textit{Conflict} \), is a binary relation over \(\varPhi \), i.e. \(\textit{Conflict} \subseteq \varPhi \times \varPhi \), such that for any \((\square \varphi _1\mathrm {\,at\,} t), (\square \varphi _2 \mathrm {\,at\,} t)\in \varPhi \), if \(\varphi _1\) and \(\varphi _2\) are complementary (i.e. \(\varphi _1 = -\varphi _2\)), then \((\square \varphi _1\mathrm {\,at\,} t)\) and \((\square \varphi _2 \mathrm {\,at\,} t)\) conflict, i.e. \((\square \varphi _1\mathrm {\,at\,} t, \square \varphi _2\mathrm {\,at\,} t) \in \textit{Conflict} \).

In addition, if a statement \((\square \varphi _1\mathrm {\,at\,} t)\) conflicts with \((\square \varphi _2 \mathrm {\,at\,} t)\) for any modality \(\square \), i.e. \((\square \varphi _1\mathrm {\,at\,} t, \square \varphi _2\mathrm {\,at\,} t) \in \textit{Conflict} \), then we say that literal \(\varphi _1\) conflicts with \(\varphi _2\), denoted \(\textit{Conflict} (\varphi _1, \varphi _2)\) in the remainder.

Two rules may have conflicting heads, and in this case one rule may prevail over the other one. To express this, we use a rule superiority relation \(\succ \) over rules, so that \(r_1\succ r_2\) states that rule \(r_1\) is superior to rule \(r_2\).

Definition 4.5

(Superiority relation over defeasible rules) A superiority relation over a set of rules\(Rul\), denoted \(\succ \), is an antireflexive and antisymmetric binary relation over \(Rul\), i.e. \( \succ \subseteq Rul\times Rul\).

A defeasible theory is built from a set of rules, conflicts over statements and a superiority relation over rules.

Definition 4.6

(Defeasible theory) A defeasible theory is a tuple \(T = \langle Rul, \textit{Conflict}, \succ \rangle \) where \(Rul\) is a set of defeasible rules, \(\textit{Conflict} \) is a conflict relation, and \(\succ \) is a superiority relation over defeasible rules.

One may advance that the conflict relation over statements should be antireflexive or symmetric. Similarly, one may argue that the superiority relation should be transitive. However, we are aiming at building a minimalist rule-based argumentation framework for reinforcement learning agents, where the addition of such properties would overload it, maybe without bringing any significant advantages. Hence, instead of overloading our approach with such properties, further specification of such relations is delegated to the designer of a defeasible theory.

By combining the defeasible rules in a theory, we can build arguments. We use the definition of arguments as given in [40], adapted to our setting.

Definition 4.7

(Argument) Let \(\textsf {LabArgs}\) denote a set of arbitrary labels. An argumentA constructed (or built) from a defeasible theory \(T =\langle Rul, \textit{Conflict}, \succ \rangle \) is a finite rule-based construct of the form \(A:\, A_1, \ldots A_n, \sim \square _1 \phi _{1} \mathrm {\,at\,} t_1, \ldots , \sim \square _m \varphi _m \mathrm {\,at\,} t_m \Rightarrow _r \square \varphi \mathrm {\,at\,} t\), where
  • \(A \in \textsf {LabArgs}\) is the unique identifier of the argument; and

  • \(0 \le n\) and \(0 \le m\), and \(A_1, \ldots , A_n\) are arguments constructed from T; and

  • \(r: \mathrm {Conc} (A_1), \ldots , \mathrm {Conc} (A_n), \sim \square _1 \varphi _{1} \mathrm {\,at\,} t_1, \ldots , \sim \square _m \varphi _m \mathrm {\,at\,} t_m \Rightarrow \square \varphi \mathrm {\,at\,} t\) is a rule in \(Rul\).

For any argument A, the function \(\mathrm {Conc} (A)\) returns its conclusion, \(\mathrm {Sub} (A)\) all its subarguments, \(\mathrm {DirectSub} (A)\) its direct subarguments, \(\mathrm {Rules} (A)\) all the rules in the argument and, finally, \(\mathrm {TopRule} (A)\) the top defeasible rule in the argument:
$$\begin{aligned} \begin{array}{l} \mathrm {Conc} (A) = \square \varphi \mathrm {\,at\,} t\\ \mathrm {Sub} (A) = \mathrm {Sub} (A_1) \cup \ldots \cup \mathrm {Sub} (A_n) \cup {A}\\ \mathrm {DirectSub} (A) = \{A_1, \ldots , A_n \}\\ \mathrm {TopRule} (A) = r\\ \mathrm {Rules} (A) = \mathrm {Rules} (A_1) \cup \ldots \cup \mathrm {Rules} (A_n) \cup \{\mathrm {TopRule} (A)\}\\ \end{array} \end{aligned}$$

An argument without subarguments has the form \(A:\quad {\sim \square _1 \varphi _{1} \mathrm {\,at\,} t_1, \ldots , \sim \square _m \varphi _m \mathrm {\,at\,} t_m} \Rightarrow _r \varphi \), and it is called an assumptive argument. In the remainder, arguments are finite, and any argument bottoms out in assumptive arguments.

Definition 4.8

(Assumptive argument) An argument A is an assumptive argument if, and only if, its set of direct subarguments is empty, i.e. \(\mathrm {DirectSub} (A)= \emptyset \).

In this minimalist rule-based setting, the conclusion of any assumptive argument is an assumption. Any assumption is thus a statement which is the head of a rule with no antecedent.

Notation 4.4

 
  1. 1.

    Given a defeasible theory T, the set of assumptions is denoted \({\mathrm {Assum}}(T)\).

     
  2. 2.

    The set of assumptive arguments whose conclusions are a set of assumptions Assum is denoted \({\mathrm {AssumArg}}(Assum)\).

     

Arguments may conflict and thus attacks between arguments may appear. We consider two types of attacks: rebuttals (clash of incompatible conclusions) and undercuttings1 (attacks on negation as failure premises). In regard to rebuttals, we assume that there is a preference relation over arguments determining whether two rebutting arguments mutually attack each other or only one of them (being preferred) attacks the other. The preference relation over arguments can be defined in various ways on the basis of the preference over rules. We adopt a simple last-link ordering according to which an argument A is preferred over another argument B, denoted as \(A \succ B\), if, and only if, the rule \(\mathrm {TopRule} (A)\) is superior to the rule \(\mathrm {TopRule} (B)\), i.e. \(\mathrm {TopRule} (A) \succ \mathrm {TopRule} (B)\).

Definition 4.9

(Attack) An argument Battacks an argument A (denoted \(B \leadsto A\)), if, and only if, B rebuts or undercuts A, where
  • B rebuts A (on \(A'\)) if, and only if, \(\exists A' \in \mathrm {Sub} (A)\) such that \(\mathrm {Conc} (B)\) and \(\mathrm {Conc} (A')\) are in conflict, i.e. \(\textit{Conflict} (\mathrm {Conc} (B),\mathrm {Conc} (A'))\), and \(A' \not \succ B\),

  • B undercuts A (on \(A'\)) if, and only if, \(\exists A' \in \mathrm {Sub} (A)\) such that \(\sim \mathrm {Conc} (B)\) belongs to the body of \(\mathrm {TopRule} (A')\), i.e. \((\sim \mathrm {Conc} (B)) \in \mathrm {Body}(\mathrm {TopRule} (A'))\).

On the basis of arguments and attacks between arguments, we use the common definition of an argumentation framework [19].

Definition 4.10

(Argumentation graph) An argumentation graph constructed from a defeasible theory T is a tuple \(\left\langle \mathscr {A}, \leadsto \right\rangle \) where \(\mathscr {A}\) is the set of all arguments constructed from T, and \(\leadsto \subseteq \mathscr {A}\times \mathscr {A}\) is a binary relation of attack.

Notation 4.5

Given a graph \(G = \left\langle \mathscr {A}, \leadsto \right\rangle \), the set of arguments \(\mathscr {A}\) may be denoted \(\mathscr {A}_{G}\). Similarly, the set of statements appearing in the rules of the defeasible theory underlying a graph G may be denoted \(\varPhi _G\).

Fig. 2

An argumentation graph, where direct subargument relations are represented by double arrows (these relations are not formally part of an argumentation graph, but it is useful to picture them as we will see soon). Argument \(\textsf {B}\) attacks \(\textsf {C}\), arguments \(\textsf {C}\) and \(\textsf {D}\) attack each other. Arguments \(\textsf {B1}\) and \(\textsf {B2}\) are subarguments of argument \(\textsf {B}\)

An example of argumentation graph is illustrated in Fig. 2. As exposed later, the relation of subargument is primordial in a probabilistic setting, since the ‘belief’ in an argument necessarily imply the beliefs in its subarguments. And for this reason, we now define subtheories and associated legal subgraphs.

Definition 4.11

(Defeasible subtheory) Let T denote a defeasible theory \(\langle Rul, \textit{Conflict}, \succ \rangle \). A defeasible subtheoryU of T is a defeasible theory \(\langle Rul', \textit{Conflict}, \succ \rangle \) such that \(Rul' \subseteq Rul\).

Definition 4.12

(Legal subgraph) A legal subgraphH of (or induced by) an argumentation graph \(G = (\mathscr {A}, \leadsto )\) constructed from a defeasible theory T is a graph \((\mathscr {A}_H, \leadsto _H )\) such that:
  • \(\mathscr {A}_H\) is the set of all arguments constructed from a defeasible subtheory \(U \in T\), and

  • \(\leadsto _H = \leadsto \cap (\mathscr {A}_H \times \mathscr {A}_H)\).

In other words, H is an induced legal subgraph of G if any argument of H appears with all its subarguments and H has exactly the attacks that appear in G over the same set of arguments (see Fig. 3 for an illustration).
Fig. 3

The graph a is a legal subgraph of the graph in Figure 2, while b is not

To recap, our argumentation framework is based on defeasible theories featuring defeasible rules, conflicts amongst statements and a superiority relation over these rules to resolve conflicts. Given a defeasible theory, we can built arguments and attacks between arguments, leading to an argumentation graph and its legal subgraphs. Next, we will see how to label any argument with an argumentation status.

4.2 Labelling of arguments

Given an argumentation graph, we can compute sets of acceptable or discarded arguments, i.e. arguments that will survive or not to attacks and counter-attacks. To do that, we label arguments as reviewed in [7], but slightly adapted to our upcoming probabilistic setting. Accordingly, we distinguish three labellings: \(\{{{{\small {\textsf {ON}}}}}, {{\small {\textsf {OFF}}}}\}\)-labelling, \(\{{{\small {\textsf {IN}}}},{{\small {\textsf {OUT}}}},{{\small {\textsf {UND}}}}\}\)-labelling and \(\{{\small {\textsf {IN}}},{\small {\textsf {OUT}}},{\small {\textsf {UND}}},{\small {\textsf {OFF}}}\}\)-labelling. In a \(\{{\small {\textsf {ON}}},{\small {\textsf {OFF}}}\}\)-labelling, each argument is associated with one label which is either \({\small {\textsf {ON}}}\) or \({\small {\textsf {OFF}}}\) to indicate whether an argument is expressed or not (i.e. the event of an argument occurs or not). In a \(\{{\small {\textsf {IN}}},{\small {\textsf {OUT}}},{\small {\textsf {UND}}}\}\)-labelling, each argument is associated with one label which is either \({\small {\textsf {IN}}}\), \({\small {\textsf {OUT}}}\), or \({\small {\textsf {UND}}}\): a label ‘\({\small {\textsf {IN}}}\)’ means the argument is accepted while a label ‘\({\small {\textsf {OUT}}}\)’ indicates that it is rejected. The label ‘\({\small {\textsf {UND}}}\)’ marks the status of the argument as undecided. The \(\{{\small {\textsf {IN}}},{\small {\textsf {OUT}}},{\small {\textsf {UND}}},{\small {\textsf {OFF}}}\}\)-labelling extends a \(\{{\small {\textsf {IN}}},{\small {\textsf {OUT}}},{\small {\textsf {UND}}}\}\)-labelling with the \({\small {\textsf {OFF}}}\) label to indicate that an argument is omitted, that is, depending on the context, not believed or unexpressed.

Definition 4.13

(Argument labelling) Let G be an argumentation graph, and \(\textsf {ArgLabels}\) a set of labels for arguments. An \(\textsf {ArgLabels}\)-labelling of a set \(\mathscr {A} \subseteq \mathscr {A}_G\) is a total function \(\mathrm {L}: \mathscr {A} \rightarrow \textsf {ArgLabels}\).

For instance, given an argumentation graph G:
  • a \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling of G is a total function \(\mathrm {L}: \mathscr {A}_G \rightarrow \{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\);

  • a \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labelling of G is a total function \(\mathrm {L}: \mathscr {A}_G \rightarrow \{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\);

  • a \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling of G is a total function \(\mathrm {L}: \mathscr {A}_G \rightarrow \{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\).

If the domain of a labelling \(\mathrm {L}\) is the set of arguments of an argumentation graph G, then we say that \(\mathrm {L}\) is a labelling of the argumentation graph G.

Notation 4.6

  1. 1.

    The set of all possible \(\textsf {ArgLabels}\)-labelling assignments of a set of arguments \(\mathscr {A}\) is denoted as \(\mathscr {L}_{\textsf {ArgLabels}}(\mathscr {A})\), and, given an argumentation graph G, we may write \(\mathscr {L}_{\textsf {ArgLabels}}(G)\) instead of \(\mathscr {L}_{\textsf {ArgLabels}}(\mathscr {A}_G)\).

     
  2. 2.

    Given an \(\textsf {ArgLabels}\)-labelling \(\mathrm {L}\), the set of arguments with a label l may be denoted \(l(\mathrm {L})\), i.e. \(l(\mathrm {L}) = \{A \mid \mathrm {L}(A) = l \}\). For example, \({\small {\textsf {ON}}}(\mathrm {L}) = \{A\mid \mathrm {L}(A) = {\small {\textsf {ON}}}\}\).

     
  3. 3.

    A \(\{{\small {\textsf {IN}}},{\small {\textsf {OUT}}},{\small {\textsf {UND}}}\}\)-labelling \(\mathrm {L}\) may be represented as a tuple \(\langle {\small {\textsf {IN}}}(\mathrm {L}), {\small {\textsf {OUT}}}(\mathrm {L}), {\small {\textsf {UND}}}(\mathrm {L}) \rangle \), and a \(\{{\small {\textsf {IN}}},{\small {\textsf {OUT}}},{\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling \(\mathrm {L}\) as a tuple \(\langle {\small {\textsf {IN}}}(\mathrm {L}), {\small {\textsf {OUT}}}(\mathrm {L}),{\small {\textsf {UND}}}(\mathrm {L}), {\small {\textsf {OFF}}}(\mathrm {L}) \rangle \).

     

Generally, not all labellings in \(\mathscr {L}_{\textsf {ArgLabels}}(G)\) are meaningful or have satisfactory properties. An X-\(\textsf {ArgLabels}\)-labelling specification identifies for any argument graph G a subset of \(\mathscr {L}_{\textsf {ArgLabels}}(G)\).

Definition 4.14

(Argument labelling specification) Let G denote an argumentation graph, and \(\textsf {ArgLabels}\) a set of labels, an X-\(\textsf {ArgLabels}\)-labelling specification of a set of arguments \(\mathscr {A} \subseteq \mathscr {A}_G\) identifies a set of \(\textsf {ArgLabels}\)-labellings of \(\mathscr {A}\), denoted as \(\mathscr {L}^X_{\textsf {ArgLabels}}(\mathscr {A})\), such that \(\mathscr {L}^X_{\textsf {ArgLabels}}(\mathscr {A}) \subseteq \mathscr {L}_{\textsf {ArgLabels}}(\mathscr {A})\).

We will focus on specifications of \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labellings and \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labellings based on complete \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labellings [7].

Definition 4.15

(Complete\(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labelling) A complete\(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labelling of an argumentation graph G is a \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labelling such that for every argument A in \(\mathscr {A}_G\) it holds that:
  • A is labelled \({\small {\textsf {IN}}}\) if, and only if, all attackers of A are \({\small {\textsf {OUT}}}\),

  • A is labelled \({\small {\textsf {OUT}}}\) if, and only if, A has an attacker \({\small {\textsf {IN}}}\).

Since a labelling is a total function, if an argument is not labelled \({\small {\textsf {IN}}}\) or \({\small {\textsf {OUT}}}\), then it is \({\small {\textsf {UND}}}\).

An argumentation graph may have several complete \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labellings: we will focus on the unique complete labelling with the smallest set of labels \({\small {\textsf {IN}}}\) (or equivalently with the largest set of labels \({\small {\textsf {UND}}}\)) called the grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labelling [7, 19].

Definition 4.16

(Grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labelling) A grounded\(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labelling\(\mathrm {L}\) of an argumentation graph G is a complete \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labelling of G such that \({\small {\textsf {IN}}}(\mathrm {L})\) is minimal (w.r.t. set inclusion) amongst all complete \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labellings of G.

A standard algorithm for generating the grounded labelling of an argumentation graph is given in Algorithm 2 [31]. It begins by labelling \({\small {\textsf {IN}}}\) all arguments not being attacked or whose attackers are \({\small {\textsf {OUT}}}\) (line 4), and then it iteratively labels \({\small {\textsf {OUT}}}\) any argument attacked by an argument labelled \({\small {\textsf {IN}}}\) (line 5). The iteration continues until no more arguments can be labelled \({\small {\textsf {IN}}}\) or \({\small {\textsf {OUT}}}\), and it terminates by labelling \({\small {\textsf {UND}}}\) any argument remained unlabelled (line 7).

Moving to forthcoming probabilistic argumentation, it is possible to extend these standards by adding a label \({\small {\textsf {OFF}}}\) to indicate excluded arguments from reasoning. The idea is to match any legal subgraph with a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling by ‘switching off’ arguments outside the considered subgraph, and we do the similar operation to define grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labellings.

Definition 4.17

(Legal\(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling) Let H denote a legal subgraph of an argumentation graph G. A legal\(\{{\small {\textsf {ON}}},{\small {\textsf {OFF}}}\}\)-labelling of G with respect to H is a \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling of G such that
  • every argument in \(\mathscr {A}_H\) is labelled \({\small {\textsf {ON}}}\),

  • every argument in \(\mathscr {A}_G \backslash \mathscr {A}_H\) is labelled \({\small {\textsf {OFF}}}\).

Definition 4.18

(Grounded\(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling) Let H denote a legal subgraph of an argumentation graph G. A grounded\(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling of G with respect to H is a \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling such that
  • every argument in \(\mathscr {A}_H\) is labelled as in the grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labelling of H,

  • every argument in \(\mathscr {A}_G \backslash \mathscr {A}_H \) is labelled \({\small {\textsf {OFF}}}\).

See Fig. 4 for an illustration of a \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling. An argumentation graph has a unique grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labelling, but it has as many grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labellings as the number of legal subgraphs.
Fig. 4

A grounded \(\{\textsf {IN}, \textsf {OUT}, \textsf {UND}, \textsf {OFF}\}\)-labelling

Algorithm 2 can be slightly adapted to compute a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling given an arbitrary set of arguments labelled \({\small {\textsf {OFF}}}\), see Algorithm 3.

We can note that a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling of an argumentation graph G can be computed in a time that is polynomial in the number of arguments of G.

Lemma 4.1

(Time complexity of a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling) The time complexity of a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling of an argumentation graph G is \(O(|\mathscr {A}_G|^c)\).

To recap, given an argumentation graph, we can label arguments following a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling and its grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling counterpart, using a polynomial algorithm. Next, we look at the labelling of statements.

4.3 Labelling of statements

Up to this point, we have worked with the labelling of arguments with no consideration for the labelling of statements, and in particular no consideration for the labelling of the conclusions supported by arguments.

Labellings of statements can be performed in different manners, see e.g. [8]. From an abstract point of view, given a set of statements, a labelling of this set is a total function associating any statement with a label.

Definition 4.19

(Statement labelling) Let \(\varPhi \) be a set of statements, and \(\textsf {LitLabels}\) a set of labels on statements. A \(\textsf {LitLabels}\)-labelling of \(\varPhi \) is a total function \(K: \varPhi \rightarrow \textsf {LitLabels}\).

Per se, a labelling of literals is just a function mapping a set of statements to a set of labels, but such a labelling may rely on an acceptance labelling of arguments. For our very purposes, we will use acceptance statement labellings [8], where, given an argumentation graph built from a defeasible theory, a statement labelling is built with respect to any labelling of a specific set of argument labellings of the graph.

Definition 4.20

(Acceptance statement labelling) Let
  • G denote an argumentation graph built from a defeasible theory,

  • \(\mathscr {L}=\mathscr {L}^{X}_{\textsf {ArgLabels}}(G)\) the set of X-\(\textsf {ArgLabels}\)-labellings of G,

  • \(\varPhi \) a set of statements, and

  • \(\textsf {LitLabels}\) a set of labels on statements.

A \(\textsf {LitLabels}\)-labelling of \(\varPhi \) and from \(\mathscr {L}\) is a total function \(\mathrm {K}: \mathscr {L}, \varPhi \rightarrow \textsf {LitLabels}\).

For example, if we write \(\mathrm {K}(\mathrm {L},\varphi ) = \textsf {in}\), then it means that, given the argument labelling \(\mathrm {L}\), the statement \(\varphi \) is labelled \(\textsf {in}\).

Various acceptance statement labellings can be specified [8]. We will focus on the simplest labelling we can think of for our purposes, the bivalent\(\{\textsf {in}, \textsf {no}\}\)-labelling according to which a statement is either accepted or not, without further sophistication. If a statement is accepted then it is labelled ‘\(\textsf {in}\)’, otherwise it is labelled ‘\(\textsf {no}\)’.

Definition 4.21

(Bivalent\(\{\textsf {in}, \textsf {no}\}\)-labelling ofstatements) Let
  • G denote an argumentation graph built from a defeasible theory,

  • \(\mathscr {L}=\mathscr {L}^{\textsf {grounded}}_{\{\textsf {IN}, \textsf {OUT}, \textsf {UND}, \textsf {OFF} \}}(G)\) the set of grounded \(\{ {\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labellings of G, and

  • \(\varPhi \) a set of statements.

A bivalent\(\{\textsf {in}, \textsf {no}\}\)-labelling of \(\varPhi \) and from \(\mathscr {L}\) is a total function \(\mathrm {K}: {\mathscr {L}}, \varPhi \rightarrow \{\textsf {in}, \textsf {no}\}\), such that, for any statement \(\varphi \in \varPhi \) and any labelling \(L \in \mathscr {L}\), \(\varphi \) is labelled \(\textsf {in}\), i.e. \(\mathrm {K}(\mathrm {L}, \varphi ) = \textsf {in}\), if and only if \(\exists A \in {\small {\textsf {IN}}}(\mathrm {L}) : \mathrm {Conc} (A) = \varphi \).
Referring to Definition 4.21, given an argument labelling \(\mathrm {L}\), a bivalent\(\{\textsf {in}, \textsf {no}\}\)-labelling assignment \(\mathrm {K}(\mathrm {L}, \cdot )\) can be straightforwardly associated with a bivalent \(\{\textsf {in}, \textsf {no}\}\)-labelling \(\mathrm {K}'\) of \(\varPhi \) such that, for every \(\phi \in \varPhi \), \(\mathrm {K}(\mathrm {L}, \phi ) = \mathrm {K}'(\phi )\). Accordingly, such an assignment may be called a bivalent \(\{\textsf {in}, \textsf {no}\}\)-labelling of \(\varPhi \) and from \(\mathrm {L}\), or simply a bivalent \(\{\textsf {in}, \textsf {no}\}\)-labelling leaving the argument labelling \(\mathrm {L}\) implicit.

An algorithm for computing the bivalent \(\{\textsf {in}, \textsf {no}\}\)-labelling of a set of statements and from a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling is presented in Algorithm 4. We can note that, given a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling of an argumentation graph G, the bivalent \(\{\textsf {in}, \textsf {no}\}\)-labelling of a set of statements \(\varPhi \) can be computed in a time that is linear in the number of arguments of G times the number of statements in \(\varPhi \) (more efficient algorithms may exist).

Lemma 4.2

(Time complexity of a bivalent \(\{\textsf {in}, \textsf {no}\}\)-labelling) Given a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}\), \({\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling \(\mathrm {L}\) and a finite set of statements \(\varPhi \), the time complexity of computing the bivalent \(\{\textsf {in}, \textsf {no}\}\)-labelling assignment \(\mathrm {K}(\mathrm {L}, \phi )\) for every \(\phi \in \varPhi \) is \(O(|\varPhi | \times |\)\({\small {\textsf {IN}}}\)\((\mathrm {L})|)\).

We can accommodate Definition 4.21 so that \(\mathscr {L}=\mathscr {L}^{\textsf {legal}}_{ {\{\textsf {ON}, \textsf {OFF} \}} }(G)\), i.e. \(\mathscr {L}\) is the set of legal \(\{ {\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labellings of the argumentation graph G. In this case, we can straightforwardly consider the corresponding set of grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labellings, so that with a slight notational shortcut, we have the bivalent \(\{\textsf {in}, \textsf {no}\}\)-labelling \(\mathrm {K}(\mathrm {L}, \varPhi )\) where \(\mathrm {L}\) is a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling.

Eventually, as we considered atemporalised statements, we consider atemporalised bivalent \(\{\textsf {in}, \textsf {no}\}\)-labellings.

Definition 4.22

(Atemporalised bivalent\(\{\textsf {in}, \textsf {no}\}\)-labelling) Let \(\varPhi ^t\) denote a set of statements holding at time t, and \(\varPhi \) its set of atemporalised statements. Given a bivalent \(\{\textsf {in}, \textsf {no}\}\)-labelling \(K^t\) of \(\varPhi ^t\), its atemporalised bivalent\(\{\textsf {in}, \textsf {no}\}\)-labelling is a bivalent \(\{\textsf {in}, \textsf {no}\}\)-labelling K of \(\varPhi \) such that, for any modal literal \(\square \varphi \in \varPhi \), \(\square \varphi \) is labelled \(\textsf {in}\), i.e. \(\mathrm {K}(\square \varphi ) = \textsf {in}\), if and only if \(\mathrm {K}^t(\square \varphi \mathrm {\,at\,} t) = \textsf {in}\).

In summary, given a labelling of arguments (and in particular a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling), a set of statements can be labelled according to a bivalent \(\{\textsf {in}, \textsf {no}\}\)-labelling, with an efficient algorithm. And from a bivalent \(\{\textsf {in}, \textsf {no}\}\)-labelling, we can hold its atemporalised bivalent \(\{\textsf {in}, \textsf {no}\}\)-labelling.

5 Probabilistic argumentation setting

In this section, we endorse probabilistic labellings of arguments as developped in [44] (Sect. 5.1), compact representations (Sect. 5.2) and ‘memoryless’ properties to deal with the complexity of the temporal setting (Sect. 5.3).

5.1 Probabilistic labellings

Having laid out an approach encompassing legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labellings as well as grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)-labellings and grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labellings, we are now ready to introduce the treatment of probabilistic uncertainty as proposed in the approach of probabilistic labellings [44]. Given an X-\(\textsf {ArgLabels}\)-labelling specification for an argumentation graph G, a probability value is assigned to each element of the set \(\mathscr {L}^{X}_{\textsf {ArgLabels}}(G)\) of labellings, which represents the sample space. Intuitively, each labelling is viewed as a possible outcome, having a certain probability to occur.

Given an X-\(\textsf {ArgLabels}\)-labelling specification for an argumentation graph G, there is the choice between defining the sample space as either (i) the set \(\mathscr {L}^{X}_{\textsf {ArgLabels}}(G)\) of labellings of G, or (ii) the set \(\smash {{\mathscr {L}_{\textsf {ArgLabels}}(G)}}\) (and the probability for any labelling which is in \(\smash {{\mathscr {L}_{\textsf {ArgLabels}}(G)} \backslash \mathscr {L}^{X}_{\textsf {ArgLabels}}(G)}\) is set to 0). In fact, this distinction does not really matter in practice, but it turns out that the second definition of the sample space better fits the use of factors (as we will conceive them soon), and it is the reason the second definition is favoured here.

Notation 5.1

Whilst our notational nomenclature refers to X-\(\textsf {ArgLabels}\)-labellings (e.g. one may refer to \(\textsf {legal}\)-\(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labellings, or \(\textsf {grounded}\)-\(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labellings and so on) where X and \(\textsf {ArgLabels}\) refer to an X-\(\textsf {ArgLabels}\)-labelling specification, for the sake of notational conciseness we will sometimes speak of \(\mathscr {S}\)-labellings, using a single symbol \(\mathscr {S}\) to synthesise the pair of symbols X-\(\textsf {ArgLabels}\), i.e. \(\mathscr {S} = X\)-\(\textsf {ArgLabels}\).

Definition 5.1

(Probabilistic labelling frame) A probabilistic labelling frame is a tuple \(\langle G, \mathscr {S}, \langle \varOmega ,F,P \rangle \rangle \) where G is an argumentation graph, \(\mathscr {S}\) denotes an X-\(\textsf {ArgLabels}\)-labelling specification, and \(\langle \varOmega , F, P \rangle \) is a probability space such that:
  • the sample space \(\varOmega \) is the set of labellings of G, \(\varOmega = {\mathscr {L}_{\textsf {ArgLabels}}(G)}\),

  • the \(\sigma \)-algebra F is the power set of \(\varOmega \),

  • the function P from \(F(\varOmega )\) to [0, 1] is a probability distribution satisfying Kolmogorov axioms, such that for any labelling \(\mathrm {L}\) not in the set \(\mathscr {L}^{X}_{\textsf {ArgLabels}}(G)\) of labellings of G, the probability of \(\mathrm {L}\) is 0, i.e. \(\smash {\forall \mathrm {L}\in \varOmega \backslash \mathscr {L}^{X}_{\textsf {ArgLabels}}(G),~P(\{\mathrm {L}\})=0 }\).

Fig. 5

Illustration of a probabilistic argumentation frame, based on the argumentation graph given in Fig. 2, and with respect to the legal \(\{\textsf {ON}, \textsf {OFF}\}\)-labelling (left) and the grounded counterpart (right). The probability distribution is here arbitrary

Since any legal \(\{{\small {\textsf {ON}}},{\small {\textsf {OFF}}}\}\)-labelling can be trivially mapped to one and only one grounded \(\{{\small {\textsf {IN}}},{\small {\textsf {OUT}}},{\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling (as we can visualise in Fig. 5), we can straightforwardly map a probabilistic labelling frame \(\langle G, \mathrm {legal}\text{- }\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}, \langle \varOmega ,F,P \rangle \rangle \) into a probabilistic labelling frame \(\langle G, \mathrm {grounded}\text{- }\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}, \langle \varOmega ', F', P' \rangle \rangle \): let \(\mathrm {L}\) denote a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling in \(\varOmega \), and let \(\mathrm {L}'\) denote its grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling counterpart in \(\varOmega '\), such that \(\mathrm {L}(A) = {\small {\textsf {ON}}}\) if, and only if, \(\mathrm {L}'(A) = {\small {\textsf {IN}}}\) or \(\mathrm {L}'(A) = {\small {\textsf {OUT}}}\) or \(\mathrm {L}'(A) = {\small {\textsf {UND}}}\); we may state \(P(\mathrm {L}) = P'(\mathrm {L}')\) so that the probability of a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling equals the probability of its legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling counterpart. Therefore, using this mapping, we may compute the probability of a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling by computing the probability of a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling, and vice versa. In the rest of the paper, we will focus on legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labellings and grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labellings for our probabilistic setting, and we will often use the above-mentioned mapping, leaving the labelling specification \(\mathscr {S}\) of a probabilistic labelling frame \(\langle G, \mathscr {S}, \langle \varOmega ,F,P \rangle \rangle \) possibly holding for \(\textsf {legal}\)-\(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\) or \(\textsf {grounded}\)-\(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\).

As we have defined our probabilistic space, we can work with random variables, i.e. functions (traditionally defined by an upper case letter as X, Y or Z for example) from the sample space \(\varOmega \) into another set of elements. Accordingly, for every argument A, we introduce a categorical random variable called a random labelling denoted \(L_A\) from \(\varOmega \) into a set of labels, presumably \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\) or \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\). So the event \(L_A = {\small {\textsf {ON}}}\) is a shorthand for the outcomes \(\{ \mathrm {L}\in \varOmega \mid \mathrm {L}(A) = {\small {\textsf {ON}}}\}\), or \(\{ \mathrm {L}\in \varOmega \mid \mathrm {L}(A) = {\small {\textsf {IN}}} \text{ or } \mathrm {L}(A) = {\small {\textsf {OUT}}} \text{ or } \mathrm {L}(A) = {\small {\textsf {UND}}}\}\) in case of \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labellings.

We also introduce random variables for the labelling of statements. Accordingly, for any statement \(\varphi \), we introduce a categorical random variable which is denoted \(K_\varphi \) and which can take value in the set \(\textsf {LitLabels}\) of labels of a specified \(\textsf {LitLabels}\)-labelling of statements. These random variables for statements are also called random labellings.

Notation 5.2

  1. 1.

    When the context does not give rise to any ambiguity, we will abbreviate the random labelling \(L_A\) by simply A (in italics).

     
  2. 2.

    We denote Val(X) the set of values that a random labelling can take. For example, Val\((L_A) = \{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\).

     
  3. 3.

    We use upper boldface type to denote sets of random labellings. So \(\mathbf L \) denotes a set of random labellings \(\{L_{A1}, \ldots , L_{An}\}\), and \(\mathbf {K}\) denotes a set of random labellings \(\{K_{\varphi 1}, \ldots , K_{\varphi n}\}\).

     
  4. 4.

    We use lower boldface type to denote assignments to a set of random labellings, i.e. assignments of values to the variables in this set. So given \(\mathbf {L} = \{L_{A_1}, \ldots , L_{A_n}\}\), a possible assignment is \(\mathbf {l} = \{L_{A_1} = {\small {\textsf {ON}}}, \ldots , L_{A_n}= {\small {\textsf {OFF}}}\}\).

     
  5. 5.

    A labelling assignment \(\mathbf {l} = \{L_{A_1} = l_1, \ldots , L_{A_n} = l_n\}\) may be used to denote the assignment corresponding to a labelling \(\mathrm {L}\) (and vice-versa), such that \(L_{A_i} = l_i\) if, and only if, \(\mathrm {L}(A_i) = l_i\). The same shortcut applies for the labelling of statements.

     
  6. 6.

    The joint distribution over a set \(\mathbf L = \{L_{A1}, L_{A2}, \ldots , L_{An} \}\) of random labelling is formally denoted \(P(\{L_{A1}, L_{A2}, \ldots , L_{An} \})\), but we will often write it \(P(L_{A1}, L_{A2}, \ldots , L_{An})\).

     

Example 5.1

Referring to the probabilistic labelling frame illustrated in Fig. 5, we can assert \(P(\{ L_{\textsf {B1}} = {\small {\textsf {IN}}}, L_{\textsf {B2}} = {\small {\textsf {IN}}}, L_{\textsf {B}} = {\small {\textsf {IN}}}\} ) = 1/4 \). As an alternative notation, given the assignment \(\mathbf {l} = \{ L_{\textsf {B1}} = {\small {\textsf {IN}}}, L_{\textsf {B2}} = {\small {\textsf {IN}}}, L_{\textsf {B}} = {\small {\textsf {IN}}}\}\), we have \(P(\mathbf {l}) = 1/4 \). \(\square \)

Eventually, we are interested by the probability of statements’ statuses. In that regard, the marginal probability of a statement labelled k is the sum of labellings in the sample space where the statement is labelled as such:
$$\begin{aligned} P(K_\varphi = k) = \sum _{\mathrm {L}\in \varOmega : \mathrm {K}(\mathrm {L}, \varphi )= k} P(\mathrm {L}). \end{aligned}$$
(5.1)
Of course, since the number of labellings in a sample space is exponential in the number of arguments, the way to compute the probability in Eq. 5.1 is not efficient, and we do not employ it in the remainder. To address this computational complexity, and as a workaround, we can use Monte-Carlo methods to estimate probabilities. To do so, we may conceive a binary random variable for every possible status of a statement \(\varphi \) (we denote for instance such a random variable \(K^{\textsf {on}}_\varphi \), \(K^{\textsf {in}}_\varphi \), etc), such that the set of values of these variables is 0 or 1, i.e. Val\(\smash {(K^{k}_\varphi ) = \{0, 1 \}}\). Given a labelling \(\mathrm {L}\) of arguments, if the statement \(\varphi \) is labelled k in \(\mathrm {L}\), i.e. if \(\mathrm {K}(\mathrm {L}, \varphi )= k\), then \(K^{k}_\varphi = 1\), otherwise \(K^{k}_\varphi = 0\). Then the crudest method to compute an estimate \(\smash {\hat{P}(K_{\varphi } = k)}\) of the probability of the labelling of a statement \(\varphi \) is to sample N labellings \(\{ \mathrm {L}_1, \ldots , \mathrm {L}_N\}\), and for each labelling, to compute its corresponding \(\textsf {LitLabels}\)-labelling:
$$\begin{aligned} \hat{P}(K_{\varphi } = k) = \frac{1}{N}\cdot \sum ^{N}_{i=1} K^{k}_{\varphi , i}. \end{aligned}$$
(5.2)
Unfortunately, even such simple Monte-Carlo method may be hindered by the space complexity of sample spaces in our probabilistic argumentation setting. For this reason, we may lay out compact representations and memoryless settings for probabilistic argumentation, as proposed next.

5.2 Compact representation

In our probabilistic settings for argumentation, instead of specifying the probability of every possible labelling, we can resort to compact representations of the sample space; and we will do it by considering factors allowing us to break down a joint probability into a product of manageable parts.

So, given a set \(\mathbf L \) of random labellings, we employ (positive) factors (see [29]): a factor is a function, denoted \(\phi \), from the possible set of assignments Val\((\mathbf L )\) to positive real numbers \(\mathbb {R}^+\). The set \(\mathbf L \) is called the scope of the factor \(\phi \). On this basis, we can write the joint distribution of random labellings as a Gibbs distribution parametrised by a set of factors.

Definition 5.2

(Gibbs distribution) A distribution P is a Gibbs distribution parametrised by a set of factors \(\varPhi = \{ \phi _1(\mathbf L _1), \ldots , \phi _n(\mathbf L _n) \}\) if
$$\begin{aligned} P(L_{A1}, L_{A2}, \ldots , L_{An}) = \frac{\prod _i \phi _i(\mathbf L _i)}{Z_{\varPhi }} \end{aligned}$$
where \(Z_{\varPhi }\) is normalising the distribution:
$$\begin{aligned} Z_{\varPhi } = \sum _{ L_{A1}, \ldots , L_{An} } \prod _i \phi _i(\mathbf L _i). \end{aligned}$$

Example 5.2

Referring to the graph given in Fig.  2, we may (arbitrarily) decompose the joint distribution of random labellings into two factors:
$$\begin{aligned} P(L_\textsf {B1}, L_\textsf {B2}, L_\textsf {B}, L_\textsf {C}, L_\textsf {D} ) = \phi _1(L_\textsf {B1}, L_\textsf {B2}, L_\textsf {B}) \cdot \phi _2(L_\textsf {C}, L_\textsf {D} ). \end{aligned}$$
(5.3)
The factors can be represented as the tables given in Fig.  6. From the given factors, we can compute the joint probability of any \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labellings of the argumentation graph. For example, let us look at the probability of the \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling where all the arguments are labelled \({\small {\textsf {ON}}}\). Since \(\phi (L_\textsf {B1} = {\small {\textsf {ON}}}, L_\textsf {B2} = {\small {\textsf {ON}}}, L_\textsf {B} = {\small {\textsf {ON}}}) = 1\) and \(P(L_\textsf {C} = {\small {\textsf {ON}}}, L_\textsf {D} ={\small {\textsf {ON}}}) = 100\), we can determine the following joint probability (amongst others):
$$\begin{aligned} P(L_\textsf {B1} = {\small {\textsf {ON}}}, L_\textsf {B2} = {\small {\textsf {ON}}}, L_\textsf {B} = {\small {\textsf {ON}}}, L_\textsf {C} = {\small {\textsf {ON}}}, L_\textsf {D} ={\small {\textsf {ON}}}) \approx 0. \end{aligned}$$
(5.4)
\(\square \)
Fig. 6

Tables corresponding to the factors \(\phi _1(L_\textsf {B1}, L_\textsf {B2}, L_\textsf {B})\) and \(\phi _2(L_\textsf {C}, L_\textsf {D} )\)

We interpret the Gibbs distribution in a log-linear model, and thus we can write factors under an exponential form:
$$\begin{aligned} \phi _i(\mathbf L _i) = e^{ - E_i(\mathbf L _i) } \end{aligned}$$
(5.5)
where \(E_i(\mathbf L _i)\) is called the energy function of the set \(\mathbf L _i\) of random labellings. A possible justification of such energy-based model in argumentation is given in [45]. Using this exponential form of factors, we can rewrite a Gibbs distribution as follows:
$$\begin{aligned} P(L_{A1}, L_{A2}, \ldots , L_{An}) = \frac{ e^{ \sum _i {-E(\mathbf L _i) } } }{Z_{\varPhi }}. \end{aligned}$$
(5.6)
The energies might be given by a human operator, or they can be learnt via some machine learning techniques. In the next section, we propose that the energy of some factors can be learnt by RL.
Factors along with an energy-based model are useful to provide a compact representation of probability distributions. Yet, another and more intuitive compact representation consists in attaching a probability to each rule. The basic idea is that the probability to label \({\small {\textsf {ON}}}\) an argument \(A_1, \ldots A_n, \sim \square _1 \varphi _{1} \mathrm {\,at\,} t_1, \ldots , \sim \square _m \varphi _m \mathrm {\,at\,} t_m \Rightarrow _r \square \varphi \mathrm {\,at\,} t\), given that the subarguments \(A_1, \ldots A_n\) are labelled \({\small {\textsf {ON}}}\), corresponds to the probability of the ‘event’ of the rule r. This event can be captured by the random variable \(R_r: \varOmega \rightarrow \{0,1 \}\) such that:
  • \(R_r(\mathrm {L}) =1\) if, and only if, the rule r is included in the set of rules of any argument A labelled \({\small {\textsf {ON}}}\) (or \({\small {\textsf {IN}}}\) or \({\small {\textsf {OUT}}}\) or \({\small {\textsf {UND}}}\)) in labelling \(\mathrm {L}\),

  • \(R_r(\mathrm {L}) = 0\) otherwise.

In this view, it may appear intuitive and practical to specify the probability \(P(R_r = 1) = p\) next to the label of the rule r, leading us to the definition of probabilistic defeasible rules.

Definition 5.3

(Probabilistic defeasible rule) A probabilistic defeasible rule over a set of statements \(\varPhi \) has the form:
$$\begin{aligned} r, p :\ \square _1 \varphi _1 \mathrm {\,at\,} t_1,\ldots , \square _n \varphi _n \mathrm {\,at\,} t_n, \sim \square '_1 \varphi '_{1} \mathrm {\,at\,} t'_1, \ldots , \sim \square '_m \varphi '_m \mathrm {\,at\,} t'_m \Rightarrow \square \varphi \mathrm {\,at\,} t \end{aligned}$$
such that
  • p is a real number in [0, 1] (the marginal probability of the rule);

  • \( r:\ \square _1 \varphi _1 \mathrm {\,at\,} t_1,\ldots , \square _n \varphi _n \mathrm {\,at\,} t_n, \sim \square '_1 \varphi '_{1} \mathrm {\,at\,} t'_1, \ldots , \sim \square '_m \varphi '_m \mathrm {\,at\,} t'_m \Rightarrow \square \varphi \mathrm {\,at\,} \, t \) is a defeasible rule over \(\varPhi \).

Definition 5.4

(Probabilistic defeasible theory) A probabilistic defeasible theory is a tuple \(\langle Rul, \textit{Conflict}, \succ \rangle \) where \(Rul\) is a set of probabilistic defeasible rules, \(\textit{Conflict} \) is a conflict relation, and \(\succ \) is a superiority relation over defeasible rules.

A probabilistic defeasible theory is just a slight development of a defeasible theory (Definition 4.6) where every rule is given with its marginal probability. From a probabilistic defeasible theory, we can thus straightforwardly hold the defeasible theory where the probabilities on rules are omitted, and from which we can build an argumentation graph and thus a probabilistic argumentation frame. Then, the probabilities on rules are simply constraints on the (Gibbs) probability distribution P of the frame, see [44].

Concerning computational complexity, a naive approach to draw a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling is to make a table recording all the possible legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labellings, then compute the probability of every \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling (through the Gibbs–Boltzmann distribution), and finally draw a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling from this distribution. Unfortunately, this approach is not efficient because the number of legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labellings of an argumentation graph G with \(n= \mathscr {A}_G\) arguments is \(2^n\) in the worst case. To address the complexity of drawing legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labellings, probabilistic rules and probabilistic defeasible theories provide a more efficient means to draw legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labellings.

Lemma 5.1

(Time complexity of drawing a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling) Let \(T =\langle Rul, \textit{Conflict}, \succ \rangle \) denote a probabilistic defeasible theory, the time complexity of drawing a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling of the argumentation graph G constructed from T is \(O( |Rul|\cdot |\mathscr {A}_G|^2 )\).

Proof

Firstly, the complexity of generating a random number is O(1). Thus the complexity of drawing \(m \le |Rul|\) defeasible rules is \(O(|Rul|)\). Secondly, suppose that the argumentation graph G built from defeasible theory T is built ex ante, along with a mapping from rules to arguments associating every rule r with the set of arguments \(\mathscr {A}_r\) necessarily built with r. For every rule r, if rule r was not drawn then every argument in \(\mathscr {A}_r\) is labelled \({\small {\textsf {OFF}}}\). Deleting arguments labelled \({\small {\textsf {OFF}}}\) in \(\mathscr {A}_G\) is \(O(|Rul|\cdot |\mathscr {A}_G|\cdot |\mathscr {A}_G|)\). Every argument which is not labelled \({\small {\textsf {OFF}}}\) is labelled \({\small {\textsf {ON}}}\). Therefore, the time complexity of drawing a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling of the argumentation graph G constructed from T is \(O(|Rul|\cdot |\mathscr {A}_G|^2)\). \(\square \)

So, factors (along with energy-based models) and probabilistic rules allow us to have compact representations of probabilistic frames. In practice, we can specify a probabilistic argumentation frame from a probabilistic defeasible theory, however such a theory will be often left implicit in the remainder. We use later these constructs to account for MDPs and learning agents. We will restrict ourselves to a memoryless account of MDPs, and in this regard, we endorse next a ‘memoryless’ account of argumentation.

5.3 Memoryless argumentation

Let us recap. Given a defeasible theory (Definition 4.6), we first construct an argumentation graph by building arguments (Definition 4.7) and the attacks between them (Definition 4.9). This argumentation graph is used in a probabilistic labelling frame (Definition 5.1), and to deal with joint probabilities in this probabilistic labelling frame, we factorise arguments into several ‘groups’ scoped by factors so that we can write the joint distribution of random labellings as a Gibbs distribution (Definition 5.2). Such a distribution can have an exponential form (so that, as we will see soon, energies can be shaped by RL), but it can also account for probabilistic defeasible rules and probabilistic defeasible theories (Definition 5.4).

However, a major issue with the framework so far is that argumentation graphs may appear very large, especially in a temporal setting. Let us illustrate this point with the following example.

Example 5.3

Suppose the following rules:
$$\begin{aligned} \begin{array}{l p{1cm} l} \textsf {r}^0_1: \quad \quad \Rightarrow \textsf {Hold}\mathrm{a} \mathrm {\,at\,} 0 &{} &{} \textsf {r}^{t+1}_2: \textsf {Hold}\mathrm{a} \mathrm {\,at\,} t \Rightarrow \textsf {Hold}\mathrm{a} \mathrm {\,at\,} t+1\\ \textsf {r}^{t+1}_3:\textsf {Hold}\mathrm{a} \mathrm {\,at\,} t \Rightarrow \textsf {Hold}\mathrm{b} \mathrm {\,at\,} t+1 \quad &{} &{} \textsf {r}^{t+1}_4: \textsf {Hold}\mathrm{b} \mathrm {\,at\,} t \Rightarrow \textsf {Hold}\mathrm{a} \mathrm {\,at\,} t+1 \end{array} \end{aligned}$$
We can build the following argument heading at time 0,
$$\begin{aligned} \mathrm {A0}: \quad \Rightarrow _{\textsf {r}^0_1} \textsf {Hold}\mathrm{a}\mathrm {\,at\,} 0 \end{aligned}$$
And at time 1,
$$\begin{aligned}&{\mathrm {AA1}:} \quad \mathrm {A0} \Rightarrow _{\textsf {r}^1_2} \textsf {Hold}\mathrm{a}\mathrm {\,at\,} 1\\&{\mathrm {AB1}:} \quad \mathrm {A0} \Rightarrow _{\textsf {r}^1_3} \textsf {Hold}\mathrm{b}\mathrm {\,at\,} 1 \end{aligned}$$
At time 2,
$$\begin{aligned}&{\mathrm {AAA2}:} \quad {\mathrm {AA1}} \Rightarrow _{\textsf {r}^2_2} \textsf {Hold}\mathrm{a}\mathrm {\,at\,} 2\\&{\mathrm {AAB2}:} \quad {\mathrm {AA1}} \Rightarrow _{\textsf {r}^2_3} \textsf {Hold}\mathrm{b}\mathrm {\,at\,} 2\\&{\mathrm {ABA2}:} \quad {\mathrm {AB1}} \Rightarrow _{\textsf {r}^2_4} \textsf {Hold}\mathrm{a}\mathrm {\,at\,} 2 \end{aligned}$$
At time 3,
$$\begin{aligned}&{\mathrm {AAAA3}:} \quad {\mathrm {AAA2}} \Rightarrow _{\textsf {r}^3_2} \textsf {Hold}\mathrm{a}\mathrm {\,at\,} 3\\&{\mathrm {AAAB3}:} \quad {\mathrm {AAA2}} \Rightarrow _{\textsf {r}^3_3} \textsf {Hold}\mathrm{a}\mathrm {\,at\,} 3\\&{\mathrm {AABA3}:} \quad {\mathrm {AAB2}} \Rightarrow _{\textsf {r}^3_4} \textsf {Hold}\mathrm{a}\mathrm {\,at\,} 3\\&{\mathrm {ABAA3}:} \quad {\mathrm {ABA2}} \Rightarrow _{\textsf {r}^3_2} \textsf {Hold}\mathrm{a}\mathrm {\,at\,} 3\\&{\mathrm {ABAB3}:} \quad {\mathrm {ABA2}} \Rightarrow _{\textsf {r}^3_3} \textsf {Hold}\mathrm{b}\mathrm {\,at\,} 3 \end{aligned}$$
And so on. We can see that the number of arguments heading at time t grows with t. By contrast, the number of statements holding at t remains stable because multiple arguments have the same top conclusion. \(\square \)

To address the size of argumentation graphs in a temporal setting, we take advantage of the memoryless property in MDPs (see Sect. 3.1). In that regard, we differentiate a logic-based memoryless setting and a probabilistic-based memoryless setting.

With regard to the logic-based memoryless setting, we further specify the rules that we will retain in the remaining. Recall that, in Definition 4.3, all rules were defined to have the following form:
$$\begin{aligned} r :\square _1 \varphi _1 \mathrm {\,at\,} t_1,\ldots , \square _n \varphi _n \mathrm {\,at\,} t_n, \sim \square '_1 \varphi '_{1} \mathrm {\,at\,} t'_{1}, \ldots , \sim \square '_m \varphi '_{m} \mathrm {\,at\,} t'_{m} \Rightarrow \square \varphi \mathrm {\,at\,} t \end{aligned}$$
with no constraints on the timestamps \(t_1,\ldots , t_n, t'_{1}, \ldots , t'_{m}\) and t. From now, we retain that rules are meant to determine statements holding at time t from statements holding at time t or \(t -\varDelta \), i.e. the timestamps of a rule are such that \(t_i = t - \varDelta \) or \(t_i = t\), and \(t'_i = t- \varDelta \) or \(t'_i = t\).
With regard to the probabilistic-based memoryless setting, we distinguish two different memoryless settings. In the first setting, the arguments whose top conclusion holds at t only depend on the arguments whose top conclusion holds at \(t-\varDelta \). Formally:
$$\begin{aligned} P(\mathbf {L}^t \mid \mathbf {L}^{t-\varDelta } \cup \ldots \cup \mathbf {L}^0 ) = P(\mathbf {L}^t \mid \mathbf {L}^{t-\varDelta } ), \end{aligned}$$
(5.7)
where \(\mathbf {L}^t\) is the set of random labellings of arguments whose top conclusions hold at t. An alternative setting is to let the labelling of conclusions holding at t only depend on the labelling of conclusions holding at \(t-\varDelta \):
$$\begin{aligned} P(\mathbf K ^t \mid \mathbf K ^{t-\varDelta } \cup \ldots \cup \mathbf K ^0 ) = P(\mathbf K ^t \mid \mathbf K ^{t-\varDelta } ). \end{aligned}$$
(5.8)
In the remainder of this paper, we adopt the logic-based memoryless setting in combination with the the second probabilistic-based memoryless setting (Eq. 5.8), because the number of arguments is generally larger than that of conclusions (as Example 5.3 illustrated), which may make the first probabilistic-based memoryless setting too demanding in terms of computational complexity.

Both the logic-based and probabilistic-based memoryless settings introduced above establish a Markovian model of transitions between labellings. We will use labellings of statements to represent states and, furthermore, use our PA framework to represent the MDPs and build RL algorithms. The details of our argumentation-based MDP and RL framework are presented in the next section. In the rest of the section, and as a preparation of our argument-based MDP setting, we first deal with the representation of states with labellings, and then present how we can model the (Markovian style) transition between states by using the memoryless setting we introduced above.

A state is a labelling of statements, depending on the labelling specification.

Definition 5.5

(\(\textsf {LitLabels}\)-state) Let \(\varPhi ^t\) be a set of statements holding at time t. A \(\textsf {LitLabels}\)-state at time t with respect to \(\varPhi ^t\) is a \(\textsf {LitLabels}\)-labelling of \(\varPhi ^t\).

For our purposes, only \(\{\textsf {in}, \textsf {no}\}\)-states will be considered, and a \(\{\textsf {in}, \textsf {no}\}\)-state at time t with respect to \(\varPhi ^t\) may be simply called a state, leaving the set \(\varPhi ^t\) implicit.

Notation 5.3

  1. 1.

    A \(\{\textsf {in}, \textsf {no}\}\)-state at time t with respect to \(\varPhi ^t\) may be denoted \(\mathrm {K}^t\) or \(s_t\).

     
  2. 2.

    The set of all \(\{\textsf {in}, \textsf {no}\}\)-states with respect to \(\varPhi ^t\) is denoted \(\mathscr {K}(\varPhi ^t)\).

     

In Definition 5.5, a state at time t includes timestamps because any statement representation includes a timestamp. However, one may prefer to work with atemporalised states (as in classic MDP settings). For this reason, and similarly as we did for atemporalised bivalent \(\{\textsf {in}, \textsf {no}\}\)-labellings (Definition 4.22), we define atemporalised states.

Definition 5.6

(Atemporalised\(\{\textsf {in}, \textsf {no}\}\)-state) Let \(\mathrm {K}^t\) denote a \(\{\textsf {in}, \textsf {no}\}\)-state at time t, its atemporalised\(\{\textsf {in}, \textsf {no}\}\)-state is the atemporalised bivalent \(\{\textsf {in}, \textsf {no}\}\)-labelling of \(\mathrm {K}^t\).

Now we have to present how to specify the transition between states in line with the memoryless setting of labellings, as specified in Eq. (5.8). As Example 5.3 illustrates, the number of arguments heading at some time t may appear unmanageable from a computational perspective. As a workaround, the transition from a state \(s_{t-\varDelta }\) to the state \(s_t\) will be specified by a probabilistic labelling frame \(\langle G^t, \mathscr {S}, \langle \varOmega , F, P \rangle \rangle \) such that the argumentation graph \(G^t\) is built from a transition defeasible theory\(T^t\).

Of course, given a time sequence \(0, \varDelta , 2\cdot \varDelta , \ldots , n\cdot \varDelta \), we will not write a transition defeasible theory by hand for every instant \(0, \varDelta , 2\cdot \varDelta , \ldots , n\cdot \varDelta \). Instead, we will have a template theory, denoted \(T^*\), where most variables (in particular temporal variables) are free to hold for any value of its domain. From a template theory \(T^*\), we can consider the ground theory, denoted \(T^{\sigma }\), of all the possible ground rules, conflicts, and superiority relations.

Consequently, any factor is constrained such that its scope only includes random labellings of arguments whose top conclusion holds at time t or \(t'\) with \(|t - t'| \le \varDelta \). As for notation, a factor with such a scope is denoted \(\phi ^t\).

Then, the transition from a state \(s_t\) to another state \(s_{t+1}\) will be performed through the ground theory relating the statements holding at time t and those statements presumably holding at time \(t+1\).

Accordingly, we take advantage of the memoryless setting so as to compute the state labellings ‘step by step’ in order to avoid computing the labellings in all time steps at once: given the state labelling at t, we use the memoryless setting to obtain the labelling at time \(t+1\). To do this ‘step by step’ computation, we formally define the transition theory heading at time t as follows.

Definition 5.7

(Transition theory heading at timet) Let \(T^{\sigma } = \langle Rul^{\sigma }, \textit{Conflict} ^\sigma , \succ ^{\sigma } \rangle \) denote a ground defeasible theory, and t a ground instant of time. The transition theory heading at timet is the defeasible theory \(T^{t} = \langle Rul^t \cup {AssumRul}^{t-\varDelta }, \textit{Conflict} ^{\sigma }, \succ ^{\sigma } \rangle \) where
$$\begin{aligned}&Rul^t = \{ r \mid r \in Rul^{\sigma }, \mathrm {Head}(r) = \square \varphi \mathrm {\,at\,} t \} \\&{AssumRul}^{t-\varDelta }= \{ \quad \Rightarrow \square \varphi \mathrm {\,at\,} t-\varDelta \mid (\square \varphi \mathrm {\,at\,} t-\varDelta ) \in \mathrm {Body}(r), r \in Rul^t \}. \end{aligned}$$
From Definition 5.7 we can see that rules in the transition theory heading at time t fall into two categories:
  • rules heading at time t, i.e. \(Rul^t\), and

  • rules leading to assumptions holding at time \(t-\varDelta \) appearing in the body of the rules in \(Rul^t\).

The argumentation graph built from a transition theory \(T^{t}\), denoted by \(G^{t}\), is called the transition graph of the transition theory \(T^{t}\). Accordingly, we have two categories of arguments:
  • arguments whose conclusions hold at time t, and

  • assumptive arguments whose conclusions hold at time \(t-\varDelta \).

By the memoryless setting, we only need to consider these two categories of arguments in order to compute the state at \(t+\varDelta \) based on the state at t. More precisely, given a state \(s_t\) and a transition theory \(T^{t+\varDelta }\) heading at time \(t+\varDelta \), we will draw the next state \(s_{t+\varDelta }\). The next state is a labelling derived from a labelling \(\mathrm {L}^{t+\varDelta }\) of arguments of the transition graph \(G^{t+\varDelta }\), conditioned on a labelling of the assumptive arguments whose conclusions hold at time t according to the state \(s_t\).

Notation 5.4

A labelling of a graph \(G^{t+\varDelta }\) is denoted \((\mathrm {L}^{t+\varDelta }, \mathrm {L}^{t})\), where \(\mathrm {L}^{t+\varDelta }\) is the labelling of arguments whose conclusions hold at \(t+\varDelta \), and \(\mathrm {L}^{t}\) is the labelling of assumptive arguments whose conclusions hold at t.

Definition 5.8

(Argument-based transition) Let
  • \(\langle G^{t+\varDelta }, \mathscr {S}, \langle \varOmega , F, P \rangle \rangle \) denote a probabilistic labelling frame, where \(G^{t+\varDelta }\) is built from the transition theory \(T^{t+\varDelta }\) heading at time \(t+\varDelta \),

  • \({Assum}^{t}\) the assumptions at time t, \({Assum}^{t} = \{(\square \varphi \mathrm {\,at\,} t)\mid (\square \varphi \mathrm {\,at\,} t) \in {\mathrm {Assum}}(T^{t+\varDelta }) \}\),

  • \(\varPhi \supseteq \varPhi _G\) a set of statements,

  • \(\mathscr {K}(\varPhi ^t)\) the set of states at time t,

  • \(\mathscr {K}(\varPhi ^{t+\varDelta })\) the set of states at time \(t+\varDelta \),

  • \(s_t = \mathrm {K}^t\) a state at time t, \(\mathrm {K}^t\in \mathscr {K}(\varPhi ^t) \),

  • \(s_{t+\varDelta } = \mathrm {K}^{t+\varDelta }\) a state at time \(t+\varDelta \), \(\mathrm {K}^{t+\varDelta }\in \mathscr {K}(\varPhi ^{t+\varDelta }) \).

The argument-based transition function through \(\langle G^{t+\varDelta }, \mathscr {S}, \langle \varOmega , F, P \rangle \rangle \) is a probability function \(P: \mathscr {K}(\varPhi ^t) \times \mathscr {K}(\varPhi ^{t+\varDelta }) \rightarrow [0,1] \) such that:

\(P(s_{t+\varDelta } \mid s_{t}) = P( \mathbf {k}^{t+\varDelta } \mid \underset{A \in \textsf {ON}^t}{\bigcup \, \{ L_A = {\small {\textsf {ON}}}\}} \quad \underset{ A \in \textsf {OFF}^t }{\bigcup \, \{ L_A = {\small {\textsf {OFF}}}\} } )\)

where2\({\small {\textsf {ON}}}^t = assumpArg( {Assum}^{t}\cap \textsf {in}(\mathrm {K}^t))\), and \({\small {\textsf {OFF}}}^t = assumpArg( {Assum}^{t}\backslash \textsf {in}(\mathrm {K}^t))\).

An argument-based transition through \(\langle G^{t+\varDelta }, \mathscr {S}, \langle \varOmega , F, P \rangle \rangle \) is a transition from the state \(s_t = \mathrm {K}^t\) to a state \(s_{t+\varDelta } = \mathrm {K}^{t+\varDelta }\), abbreviated \(\mathrm {K}^{t}, \langle G^{t+\varDelta }, \mathscr {S}, \langle \varOmega , F, P \rangle \rangle \rightarrow \mathrm {K}^{t+\varDelta }\), such that
$$\begin{aligned} s_{t+\varDelta } \sim P(s_{t+\varDelta } \mid s_{t}) \end{aligned}$$

From the state \(\smash {s_{t+\varDelta }}\) and the transition theory \(\smash {T^{t+2\cdot \varDelta }}\) we can make a transition to the state \(\smash {s_{t+2\cdot \varDelta }}\), and so on.

Theorem 5.1

Let \(P(s_{t+\varDelta } \mid s_{t})\) denote an argument-based transition function, and \(\smash {\mathrm {K}^{t}, \langle G^{t+\varDelta }, \mathscr {S}, \langle \varOmega , F, P \rangle \rangle \rightarrow \mathrm {K}^{t+\varDelta }}\) an argument-based transition from the state \(s_{t} = \mathrm {K}^{t}\) to the next state \(s_{t+\varDelta } = \mathrm {K}^{t+\varDelta }\).
$$\begin{aligned} \sum _{s_{t+\varDelta }} P(s_{t+\varDelta } \mid s_{t}) = 1. \end{aligned}$$

Proof

By definition, an argument-based transition through \(\langle G^{t+\varDelta }, \mathscr {S}, \langle \varOmega , F, P \rangle \rangle \) is a transition from the state \(s_t = \mathrm {K}^t\) to a state \(s_{t+\varDelta } = \mathrm {K}^{t+\varDelta }\), such that \(\mathrm {K}^{t+\varDelta }\) is the labelling of statements holding at time \(t+\varDelta \) of a labelling \(\mathrm {L}^{t+\varDelta }\) of arguments drawn from \(\varOmega \):
$$\begin{aligned} \mathbf {l}^{t+\varDelta } \sim P\Bigg ( \mathbf {L}^{t+\varDelta } \mid \underset{A \in {\mathrm {AssumArg}}( {Assum}^{t}\cap \textsf {in}(\mathrm {K}^t))}{\bigcup \, \{ L_A = {\small {\textsf {ON}}}\}} \quad \underset{ A \in {\mathrm {AssumArg}}( {\mathrm {Assum}}^{t} \backslash \textsf {in}(\mathrm {K}^t)) }{\bigcup \, \{ L_A = {\small {\textsf {OFF}}}\} } \Bigg ). \end{aligned}$$
We have:
$$\begin{aligned} \sum _{ (\mathrm {L}^{t+\varDelta }, \mathrm {L}^t)\in \varOmega } P\Bigg ( \mathbf {l}^{t+\varDelta } \mid \underset{A \in {\mathrm {AssumArg}}( {Assum}^{t}\cap \textsf {in}(\mathrm {K}^t))}{\bigcup \, \{ L_A = {\small {\textsf {ON}}}\}} \quad \underset{ A \in {\mathrm {AssumArg}}( {\mathrm {Assum}}^{t} \backslash \textsf {in}(\mathrm {K}^t)) }{\bigcup \, \{ L_A = {\small {\textsf {OFF}}}\} } \Bigg )= 1. \end{aligned}$$
For any labelling \(\mathrm {L}^{t+\varDelta }\), we have one and only one state labelling \(\mathrm {K}(\mathrm {L}^{t+\varDelta },\cdot ) = \mathrm {K}^{t+\varDelta }\)\((= s_{t+\varDelta })\), therefore:
$$\begin{aligned} \sum _{s_{t+\varDelta }} P(s_{t+\varDelta } \mid s_{t}) = 1. \end{aligned}$$
\(\square \)

Our step by step computational setting defines thus a discrete-time Markov chain in terms of a graph of states over which any transition is ‘argued’ through an argumentation graph. Given a probabilistic labelling frame, we may consider all the possible states, and the transition matrix, so that we can reuse common mathematical techniques to study such systems. It is also possible to run Monte-Carlo simulations of an argument-based Markov chain. In this regard, the argument-based transition to a state \(s_{t+\varDelta }\) can be computed in a time that is polynomial in the number of arguments of the argumentation graph \(G^{t+\varDelta }\) heading at \(t+\varDelta \).

Theorem 5.2

(Time complexity of an argument-based transition) Let \(\smash {\mathrm {K}^{t},} \langle G^{t+\varDelta }, \mathscr {S}, \langle \varOmega , F, P \rangle \rangle \smash {\rightarrow \mathrm {K}^{t+\varDelta }}\) denote an argument-based transition, where \(G^{t+\varDelta }\) is constructed from a probabilistic defeasible theory. Given the state \(\smash {\mathrm {K}^{t}}\), the time complexity of the argument-based transition is \(\smash {O(|Rul|\cdot |\mathscr {A}_G|^2 + |\mathscr {A}_G|^c)}\).

Proof

To draw a state, we have three steps. In the first step, we draw a \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling of \(\smash {G^{t+\varDelta }}\) built from the probabilistic theory \(\langle Rul, \textit{Conflict}, \succ \rangle \), this step is \(O(|Rul|\cdot |\mathscr {A}_G|^2)\), see Lemma 5.1. In the second step, we compute the grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling \(\mathrm {L}^{t+\varDelta }\) of the \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling of \(\smash {G^{t+\varDelta }}\), this step is \(O(|\mathscr {A}_G|^c)\), see Lemma 4.1. In the third step, we compute the labelling \(\smash {\mathrm {K}^{t+\varDelta }}\) of a set \(\varPhi \) of statements from the labelling \(\mathrm {L}^{t+\varDelta }\), i.e. \(\mathrm {K}(\mathrm {L}^{t+\varDelta }, \varPhi ) = \mathrm {K}^{t+\varDelta } \), this step is \(\smash {O(|\varPhi | \times |{\small {\textsf {IN}}}(\mathrm {L}^{t+\varDelta })|)}\), see Lemma 4.2. Therefore, the time complexity of an argument-based transition is \(\smash {O(|Rul|\cdot |\mathscr {A}_G|^2 + |\mathscr {A}_G|^c)}\). \(\square \)

Given a time sequence \(0, \varDelta , \ldots , n\cdot \varDelta \), we have a sequence of transition theories \(\smash {T^0, T^\varDelta ,}\smash {\ldots , T^{n\cdot \varDelta }}\) and thus a sequence of probabilistic labelling frames \(\smash {\langle G^0, \mathscr {S}, \langle \varOmega , F, P^0 \rangle \rangle ,} \langle G^\varDelta , \mathscr {S}, \langle \varOmega , F, P^\varDelta \rangle \rangle , \ldots , \langle G^{n\cdot \varDelta }, \mathscr {S}, \langle \varOmega , F, P^{n\cdot \varDelta } \rangle \rangle \). Every probability distribution \(\smash {P^t}\) is parametrised by a set of factors \(\smash {\varPhi ^{t} }\), and for the sake of simplicity we posit that the probabilistic dependences amongst template arguments remain unchanged over time. So for any template argumentation graph \(\smash {G^t}\), the scope of factors remains unchanged. However, since we aim at capturing learning agents, we will see in the next section that the values of factors may change over time as they are updated by RL.

In summary, we set up a logic-based and probabilistic-based memoryless argument-based transition framework to compute states ‘step by step’, so as to avoid computing all time steps’ labellings at once. By doing so, we have established an argument-based representation of Markov chains. This representation is our basis to build argument-based MDPs and reinforcement learning agents in the next section.

6 Argument-based Markov decision processes and learning agents

In this section, we show how an MDP setting for RL can be captured in our PA framework. We first specify argument-based constructs towards argument-based MDPs (Sect.  6.1), and possible factorisations of probability distributions (Sect. 6.2). Then we articulate these constructs to build argument-based MDPs and RL agents (Sect. 6.3).

6.1 Environment and agent representation

As a further instantiation of our argumentation setting, we now specify the statements (Definition 4.1) which are used to illustrate our approach in the remainder. At this stage, we do not overload our argumentation structure or labelling semantics with further particular relations or constraints based on these ‘modalised’ statements, and by doing so we can obtain a large flexibility when modelling an agent. We give only the informal meaning of these statements, which is sufficient for our purposes.

\(\textsf {Hold}_{\textsf {obj}} \varphi \mathrm {\,at\,} t\)      

It holds, from an objective point of view, that \(\varphi \) at time t.

\(\textsf {Hold}_i \varphi \mathrm {\,at\,} t\)

It holds, from the point of view of agent i, that \(\varphi \) at time t.

\(\textsf {Des}_i \varphi \mathrm {\,at\,} t\)

The agent i desires \(\varphi \) at time t.

\(\textsf {Obl}_{i} \varphi \mathrm {\,at\,} t\)

It is obligatory for agent i to do \(\varphi \) at time t.

\(\textsf {Do}_i \varphi \mathrm {\,at\,} t\)

The agent i attempts to do \(\varphi \) at time t.

Epistemic information is indicated by the modality \(\textsf {Hold}_{\textsf {obj}}\) and \(\textsf {Hold}_i\). The subscript \(\textsf {obj}\) indicates that an expression objectively holds, i.e. that is the case (rather than being merely believed by an agent). We may say that \(\textsf {obj}\) embodies the objective point of view, that it only accepts what is ‘true’.

Obligations and desires, expressed by modalities \(\textsf {Obl}_i\) and \(\textsf {Des}_i\) respectively, are here to illustrate the ability of our argument-based framework, as we will see later, to explicitly investigate qualitative features and their interactions that are often left implicit in common MDP settings. Desires (and the reasoning leading to them) determine what is desirable or undesirable, and to what extent an agent wants to reach or avoid it (assuming that the environment does not interfere), while the compliance or the infringement of an obligation determines environmental interferences such as punitive sanctions. In this paper, obligations and desires are not meant to be learnt through reinforcement. In that regard, we do not exploit here the contention according to which an agent desires to perform an action in light of its superior utility, but we remark that the proposed framework may be extended to host such considerations.

Actions, expressed by the modality \(\textsf {Do}_i\), are not necessarily successful, because agents operate in a probabilistic non-monotonic framework and their behaviour is governed by defeasible arguments. In this framework, we do not cater for intentions. Thus, \((\textsf {Do}_i \varphi \mathrm {\,at\,} t)\) stands for i’s attempt to do \(\varphi \) at time t regardless of any intention. The content \(\varphi \) of the action \((\textsf {Do}_i \varphi \mathrm {\,at\,} t)\) may characterise the feature of a state, e.g. we may write \((\textsf {Do}_i \mathrm{safe}\mathrm {\,at\,} t)\), or it may characterise an action, e.g. we may write \((\textsf {Do}_i \mathrm{care}\mathrm {\,at\,} t)\) where ‘\(\mathrm{care}\)’ denotes a careful action eventually leading to a safe state. Actions \((\textsf {Do}_i \varphi \mathrm {\,at\,} t)\) are distinct from abstract atomic actions as conceived in MDPs in Sect. 3. MDP actions are reappraised later as ‘attitudes’, which will allow us to provide more fine-grained characterisations of agency, for example to characterise the absence or omission of any actions.

Since we are willing to model RL agents, for which reinforcement signals are essential, we also assume sanction statements of the form \(\textsf {Hold}_i \mathrm{util}(u, \alpha )\) to indicate a scalar utility u received by agent i, where \(\alpha \) is a unique identifier to distinguish the utility amongst others (we may omit the argument \(\alpha \) to avoid overloading the presentation). So if the statement (\(\textsf {Hold}_i \mathrm{util}(10) \mathrm {\,at\,} t\)) occurs to be labelled \(\textsf {in}\), then the agent receives the reward 10 at time t. Such expression of sanctions within the argumentation setting will allow us to build complex utility functions shaping the mental attitudes of an agent in interaction with a (normative) environment. For example, we may experiment the case where the overall utility of something is the sum of its intrinsic utilities as well as extrinsic utilities resulting from positive and negative environmental interferences, such as sanctions from the infringement of an obligation.

Now, the above constructions are not enough to distinguish whether an expression (for example an obligation) holds in the mind of the agent i or from an objective point of view. This is particularly important to model the internalisation of an obligation. To distinguish whether an expression holds in the mind of agent i or from an objective point of view, there are two technical alternatives, either:
  • we build two different theories and each theory contains the rules associated with an agent i or the environment, or

  • we build only one theory and we prefix every rule to indicate ‘where’ the rule holds (i.e. in the mind of the agent or in the wild).

In this paper, the first alternative is employed because it turned out to ease the modelling of agents perceiving its environment and internalising it. Accordingly, we have two theories.
  • one defeasible theory, denoted \(T_{\textsf {obj}},\) representing the environment; and

  • one defeasible theory, denoted \(T_{i},\) representing the agent i.

Example 6.1

Let us capture the MDP in Example 3.1 by means of
  • an environment theory \(\textsf {T}_{\textsf {obj}} = \langle Rul_{\textsf {obj}}, \textit{Conflict} _{\textsf {obj}}, \emptyset \rangle \), and

  • an agent theory \(\textsf {T}_{i} = \langle Rul_{i}, \textit{Conflict} _{i}, \emptyset \rangle \).

The two theories are specified as follows.
  • The set \(Rul_{\textsf {obj}}\) exactly contains the environment rules below, with their informal meaning. When no accident occurs at time t then the state is safe at time t, otherwise it is dangerous:
    $$\begin{aligned}&\textsf {s}^t_{\textsf {obj}}:\, \sim \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{safe}\,\mathrm {\,at\,} \, t \end{aligned}$$
    (6.1)
    $$\begin{aligned}&\textsf {d}^t_{\textsf {obj}}: \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{danger}\,\mathrm {\,at\,} \, t \end{aligned}$$
    (6.2)
    Each action leads to a reward specified as follows:
    $$\begin{aligned} \textsf {outc}^{t+1}_{\textsf {obj}}: \textsf {Do}_i \mathrm{care}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_i \mathrm{util}(1)\,\mathrm {\,at\,} \, t+1 \end{aligned}$$
    (6.3)
    $$\begin{aligned} \textsf {outn}^{t+1}_{\textsf {obj}}: \textsf {Do}_i \mathrm{neglect}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_i \mathrm{util}(2)\,\mathrm {\,at\,} \, t+1 \end{aligned}$$
    (6.4)
    These two rules have a consequent of the form \((\textsf {Hold}_{i} \mathrm{util}(u)\,\mathrm {\,at\,} \, t)\), thus the rewards hold from the point of view of agent i, but the rules defining them hold in the environment theory: it holds from an objective point of view, that it holds from the point of view of agent i some utility at time t. Unfortunately, an accident may occur:
    $$\begin{aligned}&\textsf {accc}^{t+1}_{\textsf {obj}}: \textsf {Do}_{i} \mathrm{care}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t +1 \end{aligned}$$
    (6.5)
    $$\begin{aligned}&\textsf {accsn}^{t+1}_{\textsf {obj}} : \textsf {Hold}_{\textsf {obj}} \mathrm{safe}\,\mathrm {\,at\,} \, t , \textsf {Do}_{i} \mathrm{neglect}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t +1 \end{aligned}$$
    (6.6)
    $$\begin{aligned}&\textsf {accdn}^{t+1}_{\textsf {obj}} : \textsf {Hold}_{\textsf {obj}} \mathrm{danger}\,\mathrm {\,at\,} \, t , \textsf {Do}_{i} \mathrm{neglect}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t+1 \end{aligned}$$
    (6.7)
    and when an accident occurs then the agent is eventually harmed:
    $$\begin{aligned} \textsf {outacc}^{t}_{\textsf {obj}}: \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{i} \mathrm{util}(-12)\,\mathrm {\,at\,} \, t \end{aligned}$$
    (6.8)
  • Concerning the set of agent rules \(Rul_i\), we assume that the agent has complete information, so when the state is safe (dangerous resp.) then the agent holds that it is safe (dangerous resp.):
    $$\begin{aligned}&\textsf {s}^t_{i}: \textsf {Hold}_{\textsf {obj}} \mathrm{safe}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{i} \mathrm{safe}\,\mathrm {\,at\,} \, t \end{aligned}$$
    (6.9)
    $$\begin{aligned}&\textsf {d}^t_{i}: \textsf {Hold}_{\textsf {obj}} \mathrm{danger}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{i} \mathrm{danger}\,\mathrm {\,at\,} \, t \end{aligned}$$
    (6.10)
    Whatever the state, the agent can act with care or with negligence:
    $$\begin{aligned}&\textsf {sc}^t_{i}: \textsf {Hold}_i \mathrm{safe}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Do}_i \mathrm{care}\,\mathrm {\,at\,} \, t \end{aligned}$$
    (6.11)
    $$\begin{aligned}&\textsf {sn}^t_{i}: \textsf {Hold}_i \mathrm{safe}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Do}_i \mathrm{neglect}\,\mathrm {\,at\,} \, t \end{aligned}$$
    (6.12)
    $$\begin{aligned}&\textsf {dc}^t_{i}: \textsf {Hold}_i \mathrm{danger}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Do}_i \mathrm{care}\,\mathrm {\,at\,} \, t \end{aligned}$$
    (6.13)
    $$\begin{aligned}&\textsf {dn}^t_{i}: \textsf {Hold}_i \mathrm{danger}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Do}_i \mathrm{neglect}\,\mathrm {\,at\,} \, t \end{aligned}$$
    (6.14)
The sets of conflicts of the two theories are as follows:
$$\begin{aligned} \textit{Conflict} _{\textsf {obj}} = \{ (\square \mathrm{danger}\mathrm {\,at\,} t, \square \mathrm{safe}\mathrm {\,at\,} t), (\square \mathrm{safe}\mathrm {\,at\,} t, \square \mathrm{danger}\mathrm {\,at\,} t) \} \end{aligned}$$
(6.15)
$$\begin{aligned} \begin{aligned} \textit{Conflict} _{i} = \{&(\square \mathrm{danger}\mathrm {\,at\,} t, \square \mathrm{safe}\mathrm {\,at\,} t), (\square \mathrm{care}\mathrm {\,at\,} t, \square \mathrm{neglect}\mathrm {\,at\,} t) \\&(\square \mathrm{safe}\mathrm {\,at\,} t, \square \mathrm{danger}\mathrm {\,at\,} t), (\square \mathrm{neglect}\mathrm {\,at\,} t, \square \mathrm{care}\mathrm {\,at\,} t) \} \end{aligned} \end{aligned}$$
(6.16)
Therefore, and for instance, the statements \((\textsf {Hold}_i \mathrm{danger}\mathrm {\,at\,} t)\) and \((\textsf {Hold}_i \mathrm{safe}\mathrm {\,at\,} t)\) conflict, and the statements \((\textsf {Do}_i \mathrm{care}\mathrm {\,at\,} t)\) and \((\textsf {Do}_i \mathrm{neglect}\mathrm {\,at\,} t)\) conflict too. \(\square \)
As illustrated in the example thereof, we make the following assumptions on environment and agent theories.
  • every rule in an environment theory heading at time t has a consequent holding at time t and each antecedent holds at time t or \(t-\varDelta \).

  • every rule in an agent theory at time t has a consequent holding at time t, because an agent behaves at time t (in the MDP setting, an agent performs an action \(a_t\)) on the basis of a state at the same time t (denoted \(s_t\) in the MDP setting).

On the basis of the above assumptions, we can now define ground environment theories heading at time t (which echo transition theories heading at time t, see Definition 5.7).

Definition 6.1

(Environment theory heading at timet) Let \(T^{\sigma } = \langle Rul^{\sigma }, \textit{Conflict} ^{\sigma }, \succ ^{\sigma } \rangle \) denote a ground environment defeasible theory, and t a ground instant of time. The environment theory of \(T^{\sigma }\) heading at time t is the defeasible theory \(T^{t}_{\textsf {obj}}= \langle Rul^t \cup {AssumRul}^{t-\varDelta }, \textit{Conflict} ^{\sigma }, \succ ^{\sigma } \rangle \) where
$$\begin{aligned}&Rul^t = \{ r \mid r \in Rul^{\sigma },\mathrm {Head}(r) = \square \varphi \mathrm {\,at\,} t \}\\&{AssumRul}^{t-\varDelta } = \{ \quad \Rightarrow \square \varphi \mathrm {\,at\,} t-\varDelta \mid (\square \varphi \mathrm {\,at\,} t-\varDelta ) \in \mathrm {Body}(r), r \in Rul^t \}. \end{aligned}$$

We assume that rules leading to a reinforcement signal (\(\textsf {Hold}_i \mathrm{util}(u) \mathrm {\,at\,} t\)) are part of the environment theory. Arguably, the utilities could be computed from a dedicated theory specifying both intrinsic and extrinsic utilities, but for the sake of simplicity, these reinforcement signals will be computed within the scope of the environment theory. By doing so, the utility of an agent’s attitude will be computed when making a transition from a state to another state.

Besides the environment theory heading at time t, an agent behaves at time t, in line with the agent theory at time t.

Definition 6.2

(Agent theory at timet) Let \(T^{\sigma } = \langle Rul^{\sigma }, \textit{Conflict} ^{\sigma }, \succ ^{\sigma } \rangle \) denote a ground agent defeasible theory, and t a ground instant of time. The agent theory of \(T^{\sigma }\) at time t is the ground defeasible theory \(T^{t}_i= \langle Rul^t \cup {AssumRul}^{t}, \textit{Conflict} ^{\sigma }, \succ ^{\sigma } \rangle \) where
$$\begin{aligned} Rul^t = \{ r \mid r \in Rul^{\sigma }&~and~\mathrm {Head}(r) = \square \varphi \mathrm {\,at\,} t \\&~and~ \{ (\square ' \varphi ' \mathrm {\,at\,} t-\varDelta ) \mid (\square ' \varphi ' \mathrm {\,at\,} t-\varDelta ) \in \mathrm {Body}(r) \} = \emptyset \\&~and~ \{ \sim (\square ' \varphi ' \mathrm {\,at\,} t-\varDelta ) \mid \sim (\square ' \varphi ' \mathrm {\,at\,} t-\varDelta ) \in \mathrm {Body}(r) \} = \emptyset \}\\ {AssumRul}^{t} = \{ \quad \Rightarrow&\textsf {Hold}_{\textsf {obj}}\varphi \mathrm {\,at\,} t \mid (\textsf {Hold}_{\textsf {obj}}\varphi \mathrm {\,at\,} t ) \in \mathrm {Body}(r), r \in Rul^t\} \end{aligned}$$
On the basis of environment and agent theories, we set up two probabilistic labelling frames at any timestamp t:
  • an environment probabilistic labelling frame \(\smash {\langle G^{t}_\textsf {obj}, \mathscr {S}, \langle \varOmega _\textsf {obj}, F_\textsf {obj}, P_\textsf {obj} \rangle \rangle }\) where the argumentation graph \(G^{t}_\textsf {obj}\) is built from the environment theory \(\smash {T^{t}_\textsf {obj}}\) heading at time t;

  • an agent probabilistic labelling frame \(\smash {\langle G^{t}_i, \mathscr {S}, (\varOmega _i, F_i, P_i) \rangle }\) where the argumentation graph \(G^{t}_i\) is built from the agent theory \(T^{t}_i\) at time t.

In both frames, the labelling specification \(\mathscr {S}\) is the specification of legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labellings (Definition 4.17).
As we have previously defined a state as a labelling (Definition 5.5), the two types of theories induce two types of state labellings:
  • a labelling from the environment theory at time t: this labelling is the state of the environment at time t, and we may just call it the state at time t,

  • a labelling from the agent theory at time t: this labelling is the state of the agent at time t, and we call it the attitude of the agent at time t.

Since a labelling concerns arguments or statements, states can be understood in terms of arguments or in terms of statements.

Definition 6.3

(Environment state) Let \(\displaystyle \langle G^{t}_\textsf {obj}, \mathscr {S}, \langle \varOmega _\textsf {obj}, F_\textsf {obj}, P_\textsf {obj} \rangle \rangle \) be an environment probabilistic labelling frame, where argumentation graph \(G^{t}_\textsf {obj}\) is built from the environment theory \(\smash {T^{t}_\textsf {obj}}\) heading at time t.
  • An argument environment state at time t (often denoted \(\mathrm {L}^t_\textsf {obj}\)) is a labelling of the set of arguments \(\mathscr {A}^t \subseteq \mathscr {A}_G\) whose conclusions hold at time t, and such that every argument is labelled as within a labelling \(\mathrm {L}_\textsf {obj}\) in the sample space \(\varOmega _\textsf {obj}\), i.e. \(\forall A \in \mathscr {A}^t, \mathrm {L}^t_\textsf {obj}(A) = \mathrm {L}_\textsf {obj}(A)\).

  • A statement environment state at time t (often denoted \(\mathrm {K}^t_\textsf {obj}\)) is a labelling of statements at time t from an argument labelling of the sample space \(\varOmega _\textsf {obj}\).

Definition 6.4

(Agent state a.k.a. Attitude) Let \(\langle G^{t}_i, \mathscr {S}, (\varOmega _i, F_i, P_i) \rangle \) be an agent probabilistic labelling frame, where argumentation graph \(G^{t}_i\) is built from agent theory \(T^{t}_i\) at time t.
  • An argument attitude of agent i at time t (often denoted \(\mathrm {L}^t_i\)) is an argument labelling in the sample space \(\varOmega _i\).

  • An statement attitude of agent i at time t (often denoted \(\mathrm {K}^t_i\)) is a labelling of statements from an argument attitude of agent i at time t.

As we mentioned in the previous section, we can establish a template (environment or agent) theory \(\smash {T^{*}}\) (where variables are free to hold for any value of its domain) to generate a sequence of ground (environment or agent) probabilistic labelling frames \(\mathfrak {F}_t = ( \displaystyle \langle G^{t}, \mathscr {S}, \langle \varOmega , F, P \rangle \rangle )_t\). For every frame in this sequence, we can thus consider the set of (environment or agent) states \(\mathscr {K}^t\). The set of states of the sequence \(\mathfrak {F}_t\), denoted \(\mathscr {K}(\mathfrak {F}_t)\), is then the union of the environment states generated by the sequence over a set \(Times \) of instants of time:
$$\begin{aligned} \mathscr {K}(\mathfrak {F}_t) = \bigcup _{t \in Times} \mathscr {K}^t. \end{aligned}$$
(6.17)
An agent can observe a statement environment state, and in this regard, amongst the agent’s rules, we say that a perception rule is such that at least one antecedent is of the form (\(\textsf {Hold}_\textsf {obj}\varphi \mathrm {\,at\,} t\)). Such an antecedent is called an observable statement.

Definition 6.5

(Agent observable statement) Let \(\langle G^{t}_i, \mathscr {S}, (\varOmega _i, F_i, P_i) \rangle \) be an agent probabilistic labelling frame, where the argumentation graph \(G^{t}_i\) is built from an agent theory \(T^{t}_i\) at time t. A statement is an agent observable statement if, and only if, it is a statement of the form \((\textsf {Hold}_{\textsf {obj}}\varphi \mathrm {\,at\,} t)\) and it is the conclusion of an argument in the set of arguments of \(G^{t}_i\).

Given any grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling \(\mathrm {L}_{i}\) of graph \(G^{t}_i\), an observable state is the set of agent observable statements labelled \(\textsf {in}\) within \(\mathrm {L}_{i}\). Since we focus on MDPs instead of POMDPs, an agent will always fully observe its environment state. Therefore the number of observable states equals the number of states in the underlying MDP.

Following folk psychology, the expression of an attitude is called a behaviour, which is the labelling of every action of an attitude.

Definition 6.6

(Agent behaviour) Let \(\mathrm {K}^t_{i}\) be a statement attitude of agent i at time t. The behaviour of agenti at time t from statement attitude \(\mathrm {K}^t_{i}\) is a \(\{\textsf {in}, \textsf {no}\}\)-labelling \(\mathrm {B}^t_i\) such that
$$\begin{aligned}\begin{gathered} \textsf {in}(\mathrm {B}^t_i) = \{ (\textsf {Do}_i \varphi \mathrm {\,at\,} t) \mid (\textsf {Do}_i \varphi \mathrm {\,at\,} t) \in \textsf {in}(\mathrm {K}^{t}_{i}) \}, \\ \textsf {no}(\mathrm {B}^t_i) = \{ (\textsf {Do}_i \varphi \mathrm {\,at\,} t) \mid (\textsf {Do}_i \varphi \mathrm {\,at\,} t) \in \textsf {no}(\mathrm {K}^{t}_{i}) \}. \end{gathered}\end{aligned}$$

Thus a behaviour may be such that no actions are performed, and in this case we may say that the agent is inhibited. Given an agent behaviour, actions performed through this behaviour are defined next.

Definition 6.7

(Agent action) Let \(\mathrm {B}^t_i\) be a behaviour of agent i at time t, the set of actions of agent i at time t, denoted \(\mathrm {A}^t_i\), is the set of actions labelled \(\textsf {in}\) within the behaviour \(\mathrm {B}^t_i\), i.e. \(\mathrm {A}^t_i = \textsf {in}(\mathrm {B}^t_i)\).

Example 6.2

Suppose the template agent theory \(\textsf {T}_i = \langle Rul_{i}, \textit{Conflict} _{i}, \emptyset \rangle \) given in Example 6.1, and let us build from it the template argumentation graph describing the agent i, as shown in Fig. 7.

We may focus on the agent theory \(\textsf {T}^0_i\) describing the agent i at time 0. From theory \(\textsf {T}^0_i\), we can build the argumentation graph \(\textsf {G}_i^0\), as shown in Fig.  8.
  • Arguments \(\textsf {S}_{\textsf {obj}}^0\) and \(\textsf {D}_{\textsf {obj}}^0\) are assumptive arguments supporting practical arguments leading to careful or negligent actions.

  • The set \(\mathrm {O}^0_i\) of observable statements at time 0 (that can be labelled \(\textsf {on}\) or \(\textsf {off}\) with respect to the agent’s state) is the following:
    $$\begin{aligned} \begin{aligned} \mathrm {O}^0_i = \{ ( \textsf {Hold}_\textsf {obj}\mathrm{safe}\mathrm {\,at\,} 0), (\textsf {Hold}_\textsf {obj}\mathrm{danger}\mathrm {\,at\,} 0)\}. \end{aligned} \end{aligned}$$
    (6.18)
  • An argument attitude \(\mathrm {L}^0_i\) of the agent i at time 0 may be the following (as \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling):
    $$\begin{aligned} \mathrm {L}^0_i = \langle \{ \textsf {S}^0_{\textsf {obj}}, \textsf {S}^0, \textsf {SC}^0 \}, \{ \textsf {SN}^0, \textsf {D}^0_{\textsf {obj}}, \textsf {D}^0, \textsf {DC}^0, \textsf {DN}^0 \} \rangle . \end{aligned}$$
    (6.19)
  • The corresponding statement attitude \(\mathrm {K}^0_i\) is as follows:
    $$\begin{aligned} \begin{aligned} \mathrm {K}^0_i = \langle&\{ (\textsf {Hold}_{\textsf {obj}} \mathrm{safe}\mathrm {\,at\,} 0), (\textsf {Hold}_{i} \mathrm{safe}\mathrm {\,at\,} 0), (\textsf {Do}_{i} {\mathrm{care}} \mathrm {\,at\,} 0) \}, \\&\{ (\textsf {Do}_i\mathrm{neglect}\mathrm {\,at\,} 0), (\textsf {Hold}_{\textsf {obj}}\mathrm{danger}\mathrm {\,at\,} 0), (\textsf {Hold}_i \mathrm{danger}\mathrm {\,at\,} 0) \} \rangle . \end{aligned} \end{aligned}$$
    (6.20)
  • The behaviour \(\mathrm {B}^0_i\) is as follows:
    $$\begin{aligned} \begin{aligned} \mathrm {B}^0_i = \langle \{ (\textsf {Do}_{i} \mathrm{care}\mathrm {\,at\,} 0) \}, \{ (\textsf {Do}_i\mathrm{neglect}\mathrm {\,at\,} 0) \} \rangle . \end{aligned} \end{aligned}$$
    (6.21)
  • The set of actions is \(\mathrm {A}^0_i = \{ (\textsf {Do}_{i} \mathrm{care}\mathrm {\,at\,} 0) \}\). \(\square \)

Fig. 7

A template argumentation graph describing an agent i. For the sake of clarity, some induced attacks are not drawn (if an argument A is a subargument of B and C attacks A, then C attacks B)

Fig. 8

Argumentation graph describing agent i at time 0

To recap, we have defined argument-based constructs towards argument-based MDPs. From an environment theory, we build an environment probabilistic labelling frame, where any labelling of the sample space is defined as an environment state. Similarly, from an agent theory, we have an agent probabilistic labelling frame, where any labelling of the sample space is defined as an agent attitude. We have shown how behaviours and actions can be characterised in such a setting. As probability distributions have been unspecified, possible factorisations of environment and agent probabilistic labelling frames are discussed next.

6.2 Environment and agent factorisation

A major difficulty in using PA regards its computational complexity. To address this difficulty, we have proposed factors and probabilistic rules in Sect. 4. We now show how factorisations can be specified by means of probabilistic defeasible rules to describe an agent and its environment.

If values of all the factors are fixed, then the agent is a non-learning agent. A non-learning agent cannot change factor values: we say that these factors are unreinforceable. To build a RL agent, we now consider reinforceable factors, and a RL agent is meant to learn values of these reinforceable factors. So, factors are partitioned into two sets:
  • the set of unreinforceable factors whose energy values cannot be changed;

  • the set of reinforceable factors whose energy values can be changed by the agent.

Notation 6.1

A reinforceable factor is represented by the symbol \(\otimes \) (as a controllable steering wheel) instead of the usual notation \(\phi \), whereas an unreinforceable factor is specified by the symbol \(\odot \) (as an uncontrollable steering wheel).

Example 6.3

Suppose that an agent is described by the argumentation graph shown in Fig. 8. We may factorise the joint probability with the unreinforceable factors \(\odot _1(S^0_{\textsf {obj}}, S^0)\) and \(\odot _2(D^0_{\textsf {obj}}, D^0)\) and the reinforceable factor \(\otimes (S^0, SC^0, SN^0, D^0, DC^0, DN^0)\):
$$\begin{aligned} \begin{aligned}&P(S^0_{\textsf {obj}}, S^0, SC^0, SN^0, D^0_{\textsf {obj}}, D^0, DC^0, DN^0) \\&\quad = \odot _1(S^0_{\textsf {obj}}, S^0) \cdot \odot _2 (D^0_{\textsf {obj}}, D^0) \cdot \otimes (S^0, SC^0, SN^0, D^0, DC^0, DN^0). \end{aligned} \end{aligned}$$
(6.22)
Suppose that the agent is in the safe state at time 0, i.e. we have a \(\{\textsf {in}, \textsf {no}\}\)-state \(\mathbf {k}^0_{\textsf {obj}} = ( \{\textsf {Hold}_\textsf {obj}\mathrm{safe}\mathrm {\,at\,} 0\}, \{\textsf {Hold}_\textsf {obj}\mathrm{danger}\mathrm {\,at\,} 0\} )\). As we will see, the argument \(S_{\textsf {obj}}^{0}\) will be labelled \({\small {\textsf {ON}}}\), and the argument \(\textsf {D}_{\textsf {obj}}^{0}\) labelled \({\small {\textsf {OFF}}}\). Possible assignments of random labellings of the reinforceable factor \(\otimes \) conditioned to the safe state are given in Table 1 (where the argument \(\textsf {D}^0\) and all the arguments with the subargument \(\textsf {D}^0\) are not shown). Though the scope of factors are \(\{ {{\small {\textsf {ON}}}}, {\small {\textsf {OFF}}}\}\)-labellings, we present here the grounded \(\{ {\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\) labellings, conditioned that the state is safe. To retrieve the corresponding \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labellings, it is sufficient to see any argument labelled \({\small {\textsf {IN}}}\), \({\small {\textsf {OUT}}}\) or \({\small {\textsf {UND}}}\) as labelled \({\small {\textsf {ON}}}\).
Table 1

View on the assignments of the reinforceable factor \(\otimes \) conditioned to the safe state

\(L_{\textsf {S}^0}: \textsf {Hold}_i \mathrm{safe}\mathrm {\,at\,} 0\) 

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\(L_{\textsf {SC}^0}: \textsf {Do}_i \mathrm{care}\mathrm {\,at\,} 0\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {UND}}}\)

\({\small {\textsf {OFF}}}\)

\(L_{\textsf {SN}^0}: \textsf {Do}_i \mathrm{neglect}\mathrm {\,at\,} 0\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {UND}}}\)

\({\small {\textsf {OFF}}}\)

The agent, graphically described in Fig. 8 with its factorisation in Eq. 6.22, can change values of the reinforceable factor, in the pursuit of optimal attitudes. Another agent may be described by the graph given in Fig.  9 where there is an unreinforceable argument \(\textsf {C}^t\), modelling an unreinforceable tendency of the agent to act with care.
Fig. 9

Argumentation graph describing agent i characterised by an unreinforceable argument (\(C^t\)) modelling a tendency to act with care

In this case, we can define an unreinforceable factor \(\odot _2(C^0)\), and thus the joint distribution can be factorised as follows:
$$\begin{aligned} \begin{aligned}&P(S^0_{\textsf {obj}}, S^0, SC^0, SN^0, D^0_{\textsf {obj}}, D^0, DC^0, DN^0, C^0) \\&\quad = \odot _1(S^0_{\textsf {obj}}, S^0)\cdot \odot _2(D^0_{\textsf {obj}}, D^0)\cdot \otimes (S^0, SC^0, SN^0, D^0, DC^0, DN^0) \cdot \odot _3(C^0). \end{aligned} \end{aligned}$$
(6.23)
\(\square \)

To obtain compact representations of an agent and its environment, the factorisation with reinforceable and unreinforceable factors may be achieved in various ways. In the following, we propose a factorisation along with probabilistic rules.

First of all, we assume that the marginal probability p of a rule may be fixed, and such a rule is unreinforceable. The marginal probability may also be changed by RL, and in this case we replace the value of p with an underscore:
$$\begin{aligned} r, \_ :\ \square _1 \varphi _1 \mathrm {\,at\,} t_1,\ldots , \square _n \varphi _n \mathrm {\,at\,} t_n, \sim \square '_1 \varphi '_{1} \mathrm {\,at\,} t'_1, \ldots , \sim \square '_m \varphi '_m \mathrm {\,at\,} t'_m \Rightarrow \square \varphi \mathrm {\,at\,} t \end{aligned}$$
Such rules are reinforceable rules. If a factor scopes an argument with a reinforceable rule, then this factor is a reinforceable factor, otherwise it is not reinforceable, i.e. it is ‘unreinforceable ’.

Definition 6.8

(Environment probabilistic defeasible theory) An environment probabilistic defeasible theory is a probabilistic defeasible theory \(T = \langle Rul, \textit{Conflict}, \succ \rangle \) where every probabilistic defeasible rule in \(Rul\) is unreinforceable.

At the opposite, an agent probabilistic defeasible theory is a probabilistic defeasible theory where a probabilistic rule may be reinforceable. If all the rules of an agent theory are unreinforceable, then this agent is a non-learning agent.

Example 6.4

Let us reappraise the rules given in Example 6.1. They can be transformed into probabilistic defeasible rules as follows.

When no accident occurs at time t then the state is necessarily safe at time t, otherwise it is dangerous:
$$\begin{aligned}&\textsf {s}^t_{\textsf {obj}}, 1:\, \sim \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{safe}\,\mathrm {\,at\,} \, t \end{aligned}$$
(6.24)
$$\begin{aligned}&\textsf {d}^t_{\textsf {obj}}, 1: \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{danger}\,\mathrm {\,at\,} \, t \end{aligned}$$
(6.25)
Each action necessarily leads to a reward specified as follows:
$$\begin{aligned}&\textsf {outc}^{t+1}_{\textsf {obj}}, 1: \textsf {Do}_i \mathrm{care}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_i \mathrm{util}(1)\,\mathrm {\,at\,} \, t+1 \end{aligned}$$
(6.26)
$$\begin{aligned}&\textsf {outn}^{t+1}_{\textsf {obj}}, 1: \textsf {Do}_i \mathrm{neglect}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_i \mathrm{util}(2)\,\mathrm {\,at\,} \, t+1 \end{aligned}$$
(6.27)
Unfortunately, an accident may occur:
$$\begin{aligned}&\textsf {accc}^{t+1}_{\textsf {obj}}, 0.01: \textsf {Do}_{i} \mathrm{care}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t +1 \end{aligned}$$
(6.28)
$$\begin{aligned}&\textsf {accsn}^{t+1}_{\textsf {obj}}, 0.1 : \textsf {Hold}_{\textsf {obj}} \mathrm{safe}\,\mathrm {\,at\,} \, t , \textsf {Do}_{i} \mathrm{neglect}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t +1 \end{aligned}$$
(6.29)
$$\begin{aligned}&\textsf {accdn}^{t+1}_{\textsf {obj}}, 0.2 : \textsf {Hold}_{\textsf {obj}} \mathrm{danger}\,\mathrm {\,at\,} \, t , \textsf {Do}_{i} \mathrm{neglect}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t+1\nonumber \\ \end{aligned}$$
(6.30)
When an accident occurs then the agent is necessarily harmed:
$$\begin{aligned} \textsf {outacc}^{t}_{\textsf {obj}}, 1 : \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{i} \mathrm{util}(-12)\,\mathrm {\,at\,} \, t \end{aligned}$$
(6.31)
Concerning the set of rules \(\textsf {R}_i\) of the agent theory, we assume that the agent is a learning agent described by unreinforceable rules
$$\begin{aligned}&\textsf {s}^t_{i}, 1: \textsf {Hold}_{\textsf {obj}} \mathrm{safe}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{i} \mathrm{safe}\,\mathrm {\,at\,} \, t \end{aligned}$$
(6.32)
$$\begin{aligned}&\textsf {d}^t_{i}, 1: \textsf {Hold}_{\textsf {obj}} \mathrm{danger}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{i} \mathrm{danger}\,\mathrm {\,at\,} \, t \end{aligned}$$
(6.33)
and reinforceable rules
$$\begin{aligned}&\textsf {sc}^t_{i}, \_: \textsf {Hold}_i \mathrm{safe}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Do}_i \mathrm{care}\,\mathrm {\,at\,} \, t \end{aligned}$$
(6.34)
$$\begin{aligned}&\textsf {sn}^t_{i}, \_: \textsf {Hold}_i \mathrm{safe}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Do}_i \mathrm{neglect}\,\mathrm {\,at\,} \, t \end{aligned}$$
(6.35)
$$\begin{aligned}&\textsf {dc}^t_{i}, \_: \textsf {Hold}_i \mathrm{danger}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Do}_i \mathrm{care}\,\mathrm {\,at\,} \, t\end{aligned}$$
(6.36)
$$\begin{aligned}&\textsf {dn}^t_{i}, \_: \textsf {Hold}_i \mathrm{danger}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Do}_i \mathrm{neglect}\,\mathrm {\,at\,} \, t \end{aligned}$$
(6.37)
\(\square \)

In practice, given a probabilistic theory where any rule is either unreinforceable or reinforceable, we can first draw reinforceable rules to yield a theory where the set of rules is such that any rule is either drawn or unreinforceable. Anyhow, arguments can be unreinforceable or reinforceable.

Definition 6.9

(Reinforceable argument) An argument is reinforceable if, and only if, at least one of its rules is reinforceable.

Lemma 6.1

(Unreinforceable argument) An argument is unreinforceable if, and only if, all its rules are unreinforceable.

Corollary 6.1

An argument is reinforceable if, and only if, at least one of its subarguments is reinforceable.

Corollary 6.2

An argument is unreinforceable if, and only if, all its subarguments are unreinforceable.

Definition 6.10

(Reinforceable factor) A factor is a reinforceable factor if, and only if, its scope includes a random labelling of a reinforceable argument.

Lemma 6.2

(Unreinforceable factor) A factor is a unreinforceable factor if, and only if, all random labellings of its scopes are random labellings of unreinforceable arguments.

Definition 6.11

(Reinforceable assignment) An assignment of random labellings is reinforceable if, and only if, at least one random labelling is the random labelling of a reinforceable argument.

Lemma 6.3

(Unreinforceable assignment) An assignment of random labellings is unreinforceable if, and only if, every random labelling is the random labelling of an unreinforceable argument.

On the basis of an environment probabilistic defeasible theory (Definition 6.8), where all rules are unreinforceable, we can consider an environment probabilistic labelling frame.

Definition 6.12

(Environment probabilistic labelling frame) Given an environment probabilistic defeasible theory \(\smash {T^{t}_\textsf {obj}}\) heading at time t, the environment probabilistic labelling frame heading at time t is a probabilistic labelling frame \(\smash {\langle G^{t}_\textsf {obj}, \mathscr {S}, \langle \varOmega _\textsf {obj}, F_\textsf {obj}, P_\textsf {obj} \rangle \rangle }\) where the argumentation graph \(G^{t}_\textsf {obj}\) is built from the environment theory \(\smash {T^{t}_\textsf {obj}}\) heading at time t.

We can note that factorisation of the distribution of an environment probabilistic labelling frame is left unspecified. It can be anything as long as it reflects the marginal probability of unreinforceable rules. In practice, we can and we will work with a factorisation such that unreinforceable rules are independent. Hence, we will first draw unreinforceable rules independently, and from these rules, we will build environment arguments.

Lemma 6.4

(Environment argument) Let Open image in new window denote an environment probabilistic labelling frame. Every argument in \(\mathscr {A}_G\) is unreinforceable.

Proof

By definition, the argumentation graph \(G^{t}_\textsf {obj}\) is built from an environment probabilistic defeasible theory, let us denote it \(\smash {T^{t}_\textsf {obj}}\). By definition, every rule of \(\smash {T^{t}_\textsf {obj}}\) is unreinforceable (Definition 6.8). Consequently, all the rules of every argument in \(\mathscr {A}_G\) is unreinforceable. Therefore, every argument in \(\mathscr {A}_G\) is unreinforceable (Lemma 6.1). \(\square \)

Proposition 6.1

(Environment factor) Given an environment probabilistic labelling frame Open image in new window , the distribution Open image in new window is parametrised by a set of factors \(\varPhi \) where every factor in \(\varPhi \) is unreinforceable.

Proof

Every argument in \(\mathscr {A}_G\) is unreinforceable (Lemma 6.1). Consequently, every factor in \(\varPhi \) only scopes random labellings of unreinforceable arguments. Therefore, every factor in \(\varPhi \) is unreinforceable (Lemma 6.2). \(\square \)

Proposition 6.2

(Environment assignment) Given an environment probabilistic labelling frame Open image in new window , every assignment of random labellings is unreinforceable.

Proof

Every argument in \(\mathscr {A}_G\) is unreinforceable (Lemma 6.1). Consequently, every assignment of random labellings of arguments in \(\mathscr {A}_G\) is unreinforceable. Therefore, every assignment is unreinforceable (Lemma 6.3). \(\square \)

Lemma 6.1 and Proposition 6.1 and 6.2 show the unreinforceable character of environment probabilistic labelling frames. At the opposite, an agent may be reinforceable. Manifold factorisation may be proposed. A simple factorisation holds in practical probabilistic labelling frames .

Definition 6.13

(Agent practical probabilistic labelling frame) Given an agent probabilistic defeasible theory \(T^{t}_i\) at time t, the agent practical probabilistic labelling frame at time t is a probabilistic labelling frame \(\smash {\langle G^{t}_i, \mathscr {S}, (\varOmega _i, F_i, P_i) \rangle }\) where
  • the argumentation graph \(G^{t}_i\) is built from the agent theory \(T^{t}_i\) at time t;

  • the distribution \(P_i\) is a Gibbs distribution such that all random labellings of reinforceable arguments and all their direct subarguments in \(\mathscr {A}_G\) are the scope of one and only one reinforceable factor.

The reinforceable factor of a practical distribution may be further broken down. For example, we can have one reinforceable factor by observable state. We leave further factorisation for future developments, since it is not essential for our present purposes.

To recap, we have showed how an environment can be described by an environment probabilistic labelling frame built from a probabilistic defeasible theory, while an agent is described by an agent probabilistic labelling frame built from another probabilistic defeasible theory. All the arguments of an environment probabilistic labelling frame are unreinforceable, while some arguments of an agent probabilistic labelling frame may be reinforceable. If a factor scopes a reinforceable argument, then this factor is a reinforceable factor, otherwise it is unreinforceable. Values of uncontrolled factors remain unchanged, while a RL agent may change values of (reinforceable) assignments of reinforceable factors to adapt to its environment, as we will see next.

6.3 Animating reinforcement learning agent

Having assumed that an agent and its environment can be characterised by an agent probabilistic labelling frame and an environment probabilistic labelling frame respectively, we are now prepared to reformulate the MDP setting of (SARSA) RL agents into an argument-based MDP setting for argument-based (SARSA) RL agents on the basis of our PA framework. Whilst a traditional RL agent performs an action drawn from a set of possible actions, we propose that an agent performs an action justified in an attitude drawn from a set of possible attitudes, i.e. mental states. In this view, we move from a behavioural approach of agent modelling to a mentalistic approach. In the remainder, we first propose argument-based MDPs formalised in our PA setting, and then we show how to animate an argument-based RL agent in such MDPs.

In an argument-based MDPs, once an agent has observed the state \(s_t\) and has performed an argument-based deliberation leading to an attitude \(\mathrm {K}^t_{i}\), and thus actions \(\mathrm {A}^{t}_{i}\), then we can draw the next state \(s_{t+\varDelta }\). By doing so, we have an argument-based MDP transition, which is a development of an argument-based transition, see Definition 5.8 .

Definition 6.14

(Argument-based MDP transition) Let
  • \(\langle G^{t+\varDelta }_{\textsf {obj}}, \mathscr {S}, \langle \varOmega _{\textsf {obj}}, F_{\textsf {obj}}, P_{\textsf {obj}} \rangle \rangle \) denote the environment probabilistic labelling frame built from the environment theory \(T^{t+\varDelta }_{\textsf {obj}}\) heading at time \(t+\varDelta \),

  • \({Assum}^{t}_\textsf {obj}\) the assumptions at time t, \({\mathrm {Assum}}_\textsf {obj}^{t} = \{(\square \varphi \mathrm {\,at\,} t)\mid (\square \varphi \mathrm {\,at\,} t) \in {\mathrm {Assum}}(T^{t+\varDelta }_\textsf {obj}) \}\),

  • \(\varPhi _{\textsf {obj}}\) and \(\varPhi _{i}\) two sets of statements,

  • \(\mathscr {K}(\varPhi ^t_{\textsf {obj}})\) the set of environment states at time t,

  • \(\mathscr {K}(\varPhi ^{t+\varDelta }_{\textsf {obj}})\) the set of environment states at time \(t+\varDelta \),

  • \(\mathscr {K}(\varPhi ^{t}_{i})\) the set of statement attitudes of agent i at time t,

  • \(s_t = \mathrm {K}^t_{\textsf {obj}}\) an environment state at time t, \(\mathrm {K}^t_{\textsf {obj}}\in \mathscr {K}(\varPhi ^t_{\textsf {obj}}) \),

  • \({s_{t+\varDelta } = \mathrm {K}^{t+\varDelta }_{\textsf {obj}}}\) an environment state at time \(t+\varDelta \), \({\mathrm {K}^{t+\varDelta }_{\textsf {obj}}\in \mathscr {K}(\varPhi ^{t+\varDelta }_{\textsf {obj}})}\),

  • \(\mathrm {K}^t_{i}\) an attitude of agent i at time t, \(\mathrm {K}^t_{i}\in \mathscr {K}(\varPhi _i^t) \).

The argument-based MDP transition function through \(\langle G^{t+\varDelta }_{{\textsf {obj}}}, \mathscr {S}, \langle \varOmega _{{\textsf {obj}}}, F_{{\textsf {obj}}}, P_{{\textsf {obj}}} \rangle \rangle \) is a probability function \(P: \mathscr {K}(\varPhi ^t_\textsf {obj}) \times \mathscr {K}(\varPhi ^{t+\varDelta }_\textsf {obj}) \times {\mathscr {K}(\varPhi ^{t}_{i})} \rightarrow [0,1] \) such that:
$$\begin{aligned} P(s_{t+\varDelta } \mid s_{t}, \mathrm {K}^t_{i}) = P_{\textsf {obj}}( \mathbf {k}^{t+\varDelta }_{\textsf {obj}} \mid \underset{ A \in \textsf {ON}^t }{\bigcup \, \{L_{A} = {\small {\textsf {ON}}}\}}\,\,\, \underset{A \in \textsf {{\small {\textsf {OFF}}}}^t }{\bigcup \, \{L_{A} = {\small {\textsf {OFF}}}\} }) \end{aligned}$$
where3\({\small {\textsf {ON}}}^t = assumpArg( {Assum}^{t}\cap \textsf {In} )\) and \({\small {\textsf {OFF}}}^t = assumpArg( {Assum}^{t}\backslash \textsf {In})\), with \(\textsf {In} = \textsf {in}(\mathrm {K}^t_{\textsf {obj}}) \cup \textsf {in}(\mathrm {K}^t_{ i})\).
An argument-based MDP transition through \(\smash {\langle G^{t+\varDelta }_{\textsf {obj}}, \mathscr {S}, \langle \varOmega _{\textsf {obj}}, F_{\textsf {obj}}, P_{\textsf {obj}} \rangle \rangle }\) is a transition from the state \(\smash {s_t = \mathrm {K}^t_{\textsf {obj}}}\) to a state \(\smash {s_{t+\varDelta } =\mathrm {K}^{t+\varDelta }_{\textsf {obj}}}\), abbreviated \(\mathrm {K}^{t}_{\textsf {obj}}, \mathrm {K}^t_{i}, \langle G^{t+\varDelta }_{\textsf {obj}}, \mathscr {S}, \langle \varOmega _{\textsf {obj}}, F_{\textsf {obj}}, P_{\textsf {obj}} \rangle \rangle \rightarrow \mathrm {K}^{t+\varDelta }_{\textsf {obj}}\), such that
$$\begin{aligned} s_{t+\varDelta } \sim P(s_{t+\varDelta } \mid s_{t}, \mathrm {K}^t_{i}). \end{aligned}$$

In words, the transition probability of moving from a state \(s_t\) to \(s_{t+\varDelta }\) on the basis of the attitude \(\mathrm {K}^t_{ i}\) (i.e. \(P(s_{t+\varDelta } \mid s_t, \mathrm {K}^t_{i})\)) is the probability distribution \(P_\textsf {obj}\) over the set \(\smash {\mathbf {L}_\textsf {obj}^{t+\varDelta }}\) of random labellings of argument in \(\smash {G^{t+\varDelta }_\textsf {obj}}\), conditioned to the labellings \(\mathrm {K}^t_{\textsf {obj}}\) and \(\mathrm {K}^t_{i}\), i.e. the random labelling of any assumptive arguments of any statements in \(\textsf {in}(\mathrm {K}^t_{\textsf {obj}})\) and \(\textsf {in}(\mathrm {K}^t_{ i})\) is assigned the value \({\small {\textsf {ON}}}\), otherwise it is assigned the value \({\small {\textsf {OFF}}}\).

An argument-based MDP transition is time-dependent, in the sense that timestamps are included in the representation of the states (since a temporal modal literal includes a timestamp). However, in the definition of a basic MDP, the representation of a state is not meant to include any timestamp (but a state can be associated with a timestamp). To address this incongruence, we can conceive atemporalised argument-based MDP transition where states are atemporalised.

Definition 6.15

(Atemporalised argument-based MDP transition) Let
  • \(Tr = \mathrm {K}^{t}_{\textsf {obj}}, \mathrm {K}^t_{i}, \langle G^{t+\varDelta }_{\textsf {obj}}, \mathscr {S}, \langle \varOmega _{\textsf {obj}}, F_{\textsf {obj}}, P_{\textsf {obj}} \rangle \rangle \rightarrow \mathrm {K}^{t+\varDelta }_{\textsf {obj}}\) denote an argument-based MDP transition,

  • \(\mathrm {K}_{\textsf {obj}}\) the atemporalised state of \(\mathrm {K}^{t}_{\textsf {obj}}\),

  • \(\mathrm {K}'_{\textsf {obj}}\) the atemporalised state of \(\mathrm {K}^{t+\varDelta }_{\textsf {obj}}\),

  • \(\mathrm {K}_{i}\) the atemporalised state of \(\mathrm {K}^{t}_{i}\).

The atemporalised argument-based MDP transition of Tr is a transition from the atemporalised state \(\smash {s_t = \mathrm {K}_{\textsf {obj}}}\) to an atemporalised state \(\smash {s_{t+\varDelta } =\mathrm {K}'_{\textsf {obj}}}\), abbreviated \(\mathrm {K}_{\textsf {obj}}, \mathrm {K}_{i}, \langle G^{t+\varDelta }_{\textsf {obj}}, \mathscr {S}, \langle \varOmega _{\textsf {obj}}, F_{\textsf {obj}}, P_{\textsf {obj}} \rangle \rangle \rightarrow \mathrm {K}'_{\textsf {obj}}\), such that:
$$\begin{aligned} s_{t+\varDelta } \sim P(s_{t+\varDelta } \mid s_{t}, \mathrm {K}_{i}) \end{aligned}$$
where \(P(s_{t+\varDelta } \mid s_{t}, \mathrm {K}_{i}) = P(\mathrm {K}^{t+\varDelta }_{\textsf {obj}} \mid \mathrm {K}^t_{\textsf {obj}}, \mathrm {K}^t_{i})\) is the transition probability.

In the definition of an argument-based MDP transition, the probability distribution \(P(s_{t+\varDelta } \mid s_{t}, \mathrm {K}^t_{i})\) is conditioned to the attitudes of the agent instead of its actions. By doing so, we can build utility functions taking into account mental statements such as particular beliefs and desires. For example, and as we illustrate later, an agent may get an extra ‘self-reward’ if its desires are satisfied.

Definition 6.16

(Argument-based reward) Given a state \(s_{t+\varDelta } = \mathrm {K}^{t+\varDelta }_\textsf {obj}\), the immediate argument-based reward\(r_{t}\) at time t moving to state \(s_{t+\varDelta }\) is the sum of rewards of sanction statements labelled in \(s_{t+\varDelta }\):
$$\begin{aligned} r_{t} = \sum _{ ({\textsf {Hold}_i \mathrm{util}(u, {\alpha }) \mathrm {\,at\,} t}) \in \textsf {in}(\mathrm {K}^{t+\varDelta }_\textsf {obj}) } u. \end{aligned}$$

Now that we have defined argument-based MDP transitions and argument-based rewards, we are prepared to propose a definition of argument-based MDPs, which echoes standard MDPs (Definition 3.1), so that an MDP can be specified with our PA framework in terms of a sequence \(\mathfrak {F}_t\) of probabilistic labelling frames (which is possibly specified by a template probabilistic argumentation frame) and thus associated states \(S = \mathscr {K}(\mathfrak {F}_t)\) (see Eq.  6.17).

Definition 6.17

(Argument-based MDP) Let \(\mathfrak {F}_t = (\smash {\langle G^{t}_{\textsf {obj}}, \mathscr {S}, \langle \varOmega _{\textsf {obj}}, F_{\textsf {obj}}, P_{\textsf {obj}} \rangle \rangle })_t\) denote a sequence of environment probabilistic labelling frames heading at time t. An argument-based MDP specified by \(\mathfrak {F}_t\) is a tuple \(\langle S,A,P,R \rangle \), where
  • S is the set of atemporalised states of the sequence \(\mathfrak {F}_t\), i.e. \(S = \mathscr {K}(\mathfrak {F}_t)\);

  • A is the set of atemporalised attitudes;

  • \(P(s_{t+\varDelta } \mid s_{t}, \mathrm {K}_{i})\) is the transition probability of the atemporalised argument-based MDP transition from the atemporalised state \(s_t\) to \(s_{t+\varDelta }\) by adopting attitude \(\mathrm {K}_{i} \in A\);

  • \(R(s_{t+\varDelta } \mid s_{t}, \mathrm {K}_{i})\) is the immediate argument-based reward \(r_t\) received when attitude \(\mathrm {K}_{i}\) is adopted in state \(s_t\), moving to state \(s_{t + \varDelta }\) .

When running an argument-based MDP as in simulations, it is interesting to note that, given an environment state \(\mathrm {K}^{t}_{\textsf {obj}}\) and an agent attitude \(\mathrm {K}^t_{i}\), an argument-based MDP transition \(\smash {\mathrm {K}^{t}_{\textsf {obj}}, \mathrm {K}^t_{i}, \langle G^{t+\varDelta }_{\textsf {obj}}, \mathscr {S}, \langle \varOmega _{\textsf {obj}}, F_{\textsf {obj}}, P_{\textsf {obj}} \rangle \rangle \rightarrow \mathrm {K}^{t+\varDelta }_{\textsf {obj}}}\) can be computed in a time that is polynomial in the number of arguments of G.

Theorem 6.1

(Time complexity of an argument-based MDP transition) Let Open image in new window denote an argument-based transition, where Open image in new window is constructed from an environment probabilistic defeasible theory. Given the environment state Open image in new window , the agent attitude \(\mathrm {K}^t_{i}\), the time complexity of the argument-based transition is \(\smash {O(|\mathscr {A}_G|^c)}\).

Proof

The proof is similar to the proof of Theorem 5.2. To draw a state, we have three steps. In the first step, we draw a \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling of \(\smash {G^{t+\varDelta }_{\textsf {obj}}}\) built from the environment probabilistic theory \(\langle Rul, \textit{Conflict}, \succ \rangle \), this step is \(O(|Rul|\cdot |\mathscr {A}_G|^2)\), see Lemma 5.1. In the second step, we compute the grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling \(\mathrm {L}^{t+\varDelta }\) of the \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling of \(\smash {G^{t+\varDelta }_{\textsf {obj}}}\), this step is \(O(|\mathscr {A}_G|^c)\), see Lemma 4.1. In the third step, we compute the labelling \(\smash {\mathrm {K}^{t+\varDelta }_{\textsf {obj}}}\) of a set \(\varPhi \) of statements from the labelling \(\mathrm {L}^{t+\varDelta }\), i.e. \(\mathrm {K}(\mathrm {L}^{t+\varDelta }, \varPhi ) = \mathrm {K}^{t+\varDelta }_{\textsf {obj}} \), this step is \(\smash {O(|\varPhi | \times |{\small {\textsf {IN}}}(\mathrm {L}^{t+\varDelta })|)}\), see Lemma 4.2. Therefore, the time complexity of an argument-based transition is \(\smash {O(|Rul|\cdot |\mathscr {A}_G|^2 + |\mathscr {A}_G|^c)}\). \(\square \)

To navigate in an argument-based MDP, we consider a simple argument-based RL agent: first, the agent observes the state \(s_t\) at time t, then, using an argument-based policy, it draws an attitude eventually leading to an MDP action \(a_t\).

Definition 6.18

(Argument-based policy) Let
  • \(s_t = \mathrm {K}^t_{\textsf {obj}}\) denote the environment state at time t,

  • \(\langle G^{t}_i, \mathscr {S}, \langle \varOmega _i, F_i, P_i \rangle \rangle \) the agent i’s probabilistic labelling frame built from the agent theory \(T_i^{t}\) at time t,

  • \(O^t_i\) the agent i’s set of observable statements at time t.

The argument-based policy\(\pi \) of agent i at time t is a mapping from states to attitudes:
$$\begin{aligned} \pi ( \mathbf {L}_i^{t} \mid \mathbf {k}^t_{\textsf {obj}} ) = P_i( \mathbf {L}_i^{t} \mid \underset{A \in \textsf {ON}^t }{\bigcup \, \{L_A = {\small {\textsf {ON}}}\}}\quad \underset{A \in \textsf {OFF}^t }{\bigcup \, \{L_A = {\small {\textsf {OFF}}}\} }) \end{aligned}$$
where \({\small {\textsf {ON}}}^t = assumpArg( O^t_i \cap \textsf {in}(\mathrm {K}^t_{\textsf {obj}}) )\), and \({\small {\textsf {OFF}}}^t = assumpArg( O^t_i \backslash \textsf {in}(\mathrm {K}^t_{\textsf {obj}}) )\).
According to Eq. 5.6 specifying the exponential forms of factors of a Gibbs distribution parametrised by a set of factors \(\varPhi \), we retrieve a softmax policy commonly used in RL (cf. Eq. 3.8):
$$\begin{aligned} \pi ( \mathbf {l}_i^{t} \mid \mathbf {k}^t_{\textsf {obj}} ) = \frac{1}{\mathscr {Z}_{\varPhi }} e^{ \frac{ Q\left( \mathbf {l}^t_i \mid \mathbf {k}^t_{\textsf {obj}} \right) }{\tau } } \end{aligned}$$
(6.38)
such that
$$\begin{aligned} \frac{ Q(\mathbf {l}^t_i \mid \mathbf {k}^t_{\textsf {obj}} ) }{\tau } = - E( \mathbf {l}^t_i \mid \mathbf {k}^t_{\textsf {obj}} ). \end{aligned}$$
(6.39)
In this regard, the equations thereof suggest that energy-based PA is a commodious framework to host argument-based RL.

If an agent at time t observes a state \(s_t\), and draws an attitude from an argument-based policy eventually leading to an action \(a_t\), then this agent makes an argument-based deliberation.

Definition 6.19

(Argument-based deliberation) Let \(\smash {s_t = \mathrm {K}^t_{\textsf {obj}}}\) denote an environment state at time t. An argument-based deliberation is the draw of an attitude \(\mathrm {L}^{t}_i\) from the argument-based policy \(\pi ( \mathbf {L}_i^{t} \mid s_t )\) such that \(\mathbf {l}^{t}_i \sim \pi ( \mathbf {L}_i^{t} \mid \mathbf {k}^t_{\textsf {obj}}).\)

Concerning the computational complexity of an argument-based deliberation, since such a deliberation is a draw amongst the legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labellings of an argumentation graph \(G^{t}_i\) and since we have to keep in memory the energy associated with every labelling, the space complexity may not be polynomial in the number of arguments. To address such complexity, compact representations may be considered, but we leave such consideration for future work. We may also discard some labellings of the sample sample by associating them with an infinite energy. In particular the \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling \(\mathrm {L}^{t}_i\) of an attitude may be set with an infinite energy (so that it has probability 0 to be taken) when it corresponds to an ‘inconvenient’ grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling counterpart (denoted \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-\(\mathrm {L}^{t}_i\)). Accordingly, we may compute the statuses of some arguments of the \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)-labelling of an attitude \(\mathrm {L}^{t}_i\) with respect to its grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling counterpart to check whether such attitude is reinforceable. This option is further investigated in Sect. 7.

From the attitude as a labelling \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-\(\mathrm {L}^{t}_i\) of arguments, we derive the grounded \(\{\textsf {in}, \textsf {no}\}\)-labelling \(\mathrm {K}^{t}_{i}\) of statements, which is the statement attitude of the agent at t, and from this attitude, we have a behaviour \(\mathrm {B}^{t}_{i}\), and thus the actions \(\mathrm {A}^{t}_{i}\) performed at t (see Definition 6.7).

Once the argument-based reward is computed, then the agent reinforces the attitude which led to this reward. In traditional Q-learning algorithms, such as SARSA, the Q-value of the pair \((s_t, a_t)\) is updated. In our setting, the traditional Q-value of the pair \((s_t, a_t)\), i.e. \(Q(s_t, a_t)\), is replaced by the energy of the reinforceable labelling assignment \(\mathbf {l}^t_{\otimes }\) in the assignment \(\mathbf {l}_i^t\) corresponding to the attitude \(\mathrm {L}_i^t\), i.e. \(\mathbf {l}^t_{\otimes }=\mathbf {l}_i^t(\mathbf {L}^t)\)4 where \(\mathbf {L}^t\) is the random labellings scoped by the reinforceable factor \(\otimes (\mathbf {L}^t)\) (which is unique, see Definition 6.13).

Definition 6.20

(Reinforcement of attitudes) Let \(\mathrm {L}^{t}_i\) denote an attitude drawn from an argument-based deliberation, and \(r_{t}\) the argument-based reward at time t. The reinforcement of the reinforceable assignment \(\mathbf {l}^t_{\otimes }=\mathbf {l}_i^t(\mathbf {L}^t)\) is an assignment of \(Q(\mathbf {l}^t_{\otimes })\) such that
$$\begin{aligned} Q(\mathbf {l}^t_{\otimes }) \leftarrow (1- \alpha )\cdot Q(\mathbf {l}^t_{\otimes }) + \alpha \cdot ( r_t + \gamma \cdot Q(\mathbf {l}^{t+1}_{\otimes }) ). \end{aligned}$$
Based on the above definitions, we are now prepared to adapt the original SARSA Algorithm  1 into an ‘argument-based SARSA’ RL algorithm, whose pseudocode is given in Algorithm 5.

As an agent’s attitude is a labelling of arguments or ‘mental’ statements, we call this approach the argument-based mentalistic approach to RL. Since the approach is based on a probabilistic energy-based argumentation framework, we may characterise an argument-based RL agent from a logic-based perspective or an energy-based perspective, as we will see in the next section.

7 Agent characterisation

In the previous section, we proposed to model and animate RL agents based on a PA framework. In this section, we take advantage of the PA framework to characterise agent profiles from a logic-based perspective (Sect. 7.1), and from a probabilistic perspective (Sect.  7.2), before illustrating such characterisations (Sect. 7.3).

7.1 Logic-based characterisation

When investigating human and social agency, it is common to study how certain cognitive or dispositional profiles influence agents’ behaviour. In experimental economics, for example, this is almost standard whenever scholars analyse cognitive profiles such as risk-aversion—which is a type of disposition—using agent-based simulations (cf., e.g. [14]), or when sundry cognitive emotions, states, or conceptions of fairness affect agents’ choices (cf., e.g. [26]).

As to the investigations of models for computer systems, a cognitive characterisation of agents’ profiles may be used to model artificial agents with a specific character affecting human-agent interactions, with applications for example in socio-technical systems, gaming industry or for serious games. In socio-technical systems for instance, a designer may prefer to focus on regimented agents compliant with the governing norms, while a game designer may seek to confront a player with an agent whose desires always override obligations.

Logic-based characterisations of agents’ profiles are related in the literature to the idea of agent type. This idea, from a qualitative perspective, has been introduced and extensively studied in [13, 27]. In those works, agent types were proposed first of all to resolve conflict between mental statements, such as desires and obligations. In other words, agent types were characterised by stating conflict resolution types in terms of orders of overruling between rules. For example, an agent is realistic when rules for beliefs override all other components; she is social when obligations are stronger than the other motivational components with the exception of beliefs, etc.

If we focus on agent types as conflict-resolution methods with respect to normative matters, we may state at design time that any agents’ deliberation is governed in such a way that obligations always prevail over, e.g., conflicting desires or actions, this being done with the purpose of exploring how the other components of agents evolve over time. Since agents’ behaviour dynamically depends on what mental states and conclusions are obtained from defeasible theories, enforcing obligations means imposing normative compliance by design, as for regimented agents.

However, one may rather be interested in exploring other strategies for implementing compliance. The regimentation of agents with complete information makes RL mechanisms useless for enforcing norms, but if the agents are not regimented then RL can be used to obtain compliance. Indeed, RL can be implemented as a dynamic model for enforcing norms, because there are good reasons to consider primarily the enforcement strategy where normative violations are possible but mechanisms are implemented “for enforcing obligations [...] by means of sanctions, that is, a treatment of the actions to be performed when a violation occurs, in order to deter agents from misbehaving” [22, sec. 1]. In this perspective, the idea of agent type (if applied to the relation between obligations and motivational attitudes) is no longer useful for imposing compliance by design, but it can be employed for interpreting and classifying the agents behaviour resulting from social interactions.

When modelling norm compliance, and depending on the application, we may enforce obligations at design time to obtain regimented agents, or we prefer to enforce obligations at run-time with RL agents. Hence, our purpose is not only to state logic-based criteria for conflict resolution, but also to use a logic-based characterisation to classify the attitudes of an agent. So, we identify two ideas of agent types:
  • Static agent types: As proposed in [13, 27], the notion of agent type is applied by design to agents and it is used to solve conflicts using preferences between arguments. For example, if obligations prevail over desires, then the agent at stake is social by design, i.e. it has an innate inclination to comply with obligations.

  • Dynamic agent types: This notion of agent type is applied to the dynamics of attitudes. It is based on the idea that an agent can acquire new cognitive profiles. For example, if obligations are accepted while desires are discarded for some time and the agent is not social by design, then the agent is nevertheless social at that time.

Static agent types can be intuitively characterised as follows. First of all, as done in [27], we establish what statements are in conflict. This allows for defining which types of rules are in conflict and which are the stronger ones. Suppose, for example, the following rules:
$$\begin{aligned}&r: \ldots \Rightarrow \textsf {Hold}_i \varphi \mathrm {\,at\,} {t} \\&s: \ldots \Rightarrow \textsf {Obl}_i \lnot \varphi \mathrm {\,at\,} {t} \\&t: \ldots \Rightarrow \textsf {Des}_i \varphi \mathrm {\,at\,} {t} \end{aligned}$$
If we only have \(\textit{Conflict} (\textsf {Hold}_i \varphi \mathrm {\,at\,} {t}, \textsf {Obl}_i \lnot \varphi \mathrm {\,at\,} {t} )\), then this means that rule r is in conflict with rule s. If we state that all \(\textsf {Hold}\)-rules are stronger than conflicting \(\textsf {Obl}\)-rules, then r is stronger than s, and thus if applicable, r will defeat s. Suppose now we drop r. There is no conflict between obligations and desires, and therefore between rules s and t. This means that there is no incompatibility relation between \(\textsf {Des}_i\) and \(\textsf {Obl}_i\) and we are free to derive both \(\textsf {Des}_i \phi \) and \(\textsf {Obl}_i \lnot \phi \). The intuitions thereof can be captured by the following definitions.

Firstly, we identify conflicts amongst modalities from conflicts amongst statements.

Definition 7.1

(Conflict relation over modalities) Let \(Mod \) denote a set of modalities. \(\square \text{- }Conflict \) is a binary relation over \(Mod \), i.e. \( \square \text{- }Conflict \subseteq Mod \times Mod \) such that modality \(\square _1\) conflicts with \(\square _2\), i.e. \( \square \text{- }Conflict (\square _1, \square _2)\) if, and only if, for any literals \(\varphi _1\) and \(\varphi _2\) in conflict, the statements \((\square _1 \varphi _1 \mathrm {\,at\,} t)\) and \((\square _2 \varphi _2 \mathrm {\,at\,} t)\) conflict, i.e. if, and only if, \(\forall \varphi _1, \varphi _2\) such that \(\textit{Conflict} ((\square \varphi _1 \mathrm {\,at\,} t), (\square \varphi _2 \mathrm {\,at\,} t))\) and \(\textit{Conflict} ((\square \varphi _1 \mathrm {\,at\,} t), (\square \varphi _2 \mathrm {\,at\,} t))\) it holds that \(\textit{Conflict} ((\square _1 \varphi _1 \mathrm {\,at\,} t), (\square _2 \varphi _2 \mathrm {\,at\,} t))\) and \(\textit{Conflict} ((\square _2 \varphi _2 \mathrm {\,at\,} t), (\square _1 \varphi _1 \mathrm {\,at\,} t))\).

Secondly, from conflicts between modalities, we define conditions according to which a modality overrides another modality.

Definition 7.2

(Override relation over modalities) Let \(\langle Rul, \textit{Conflict}, \succ \rangle \) denote a ground agent theory. A modality \(\square _1\) overrides another modality \(\square _2\), denoted \({\square _1}\succ _S{\square _2}\) if, and only if, for all rules \(r\in Rul[\square _1\varphi _1\mathrm {\,at\,} t]\) and all \(s\in Rul[\square _2\varphi _2\mathrm {\,at\,} t]\)
  • the modalities \(\Box _1\) and \(\Box _2\) conflict, i.e. \(\square \text{- }Conflict (\Box _1, \Box _2)\), and

  • the rule r is superior to rule s, i.e. \(r\succ s\).

Thirdly, from the conflict and override relations amongst modalities, we define static agent types. Static agent types ensure by design that agents solve conflicts between mental statements in a specific way. Tables in Definition 7.3 show possible types. It should be read as follows. Each table arrays potential conflicts between two types of modality: the first table concerns the pair \((\textsf {Obl}_i, \textsf {Des}_i)\), the second the pair \((\textsf {Obl}_i, \textsf {Do}_i)\), and so on. The first column from the left in each table indicates the conflict between the modalities, the second column concerns how the conflict is solved by design, and the third column provides some possible names for the corresponding agent type.

Definition 7.3

(Static agent type) Given an agent theory \(\langle Rul, \textit{Conflict}, \succ \rangle \) describing an agent, this agent is of (static) typeX if, and only if, the conditions for agent type X hold as established in the table shown below, where ‘S’ means ‘static’.

Conflict

\({\square _1}\succ _S{\square _2}\)

Agent type X

\(\textsf {Obl}_i, \textsf {Des}_i\)

   –

S-motivational-deontic independent

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Des}_i)\)

\({\textsf {Obl}_i}\succ _S{\textsf {Des}_i}\)

S-motivational-deontic compliant

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Des}_i)\)

S-motivational-deontic aporetic

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Des}_i)\)

\({\textsf {Des}_i}\succ _S{\textsf {Obl}_i}\)

S-motivational-deontic deviant

\(\textsf {Obl}_i, \textsf {Do}_i\)

   –

S-practical-deontic independent

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Do}_i)\)

\({\textsf {Obl}_i}\succ _S{\textsf {Do}_i}\)

S-practical-deontic compliant

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Do}_i)\)

S-practical-deontic aporetic

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Do}_i)\)

\({\textsf {Do}_i}\succ _S{\textsf {Obl}_i}\)

S-practical-deontic deviant

\(\textsf {Obl}_i, \textsf {Hold}_i\)

   –

S-epistemic-deontic independent

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Hold}_i)\)

\({\textsf {Obl}_i}\succ _S{\textsf {Hold}_i}\)

S-epistemic-deontic compliant

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Hold}_i)\)

S-epistemic-deontic aporetic

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Hold}_i)\)

\({\textsf {Hold}_i}\succ _S{\textsf {Obl}_i}\)

S-epistemic-deontic deviant

\(\textsf {Des}_i, \textsf {Do}_i\)

   –

S-practical-motivational independent

   \(\square \text{- }Conflict (\textsf {Des}_i, \textsf {Do}_i)\)

\({\textsf {Des}_i}\succ _S{\textsf {Do}_i}\)

S-practical -motivational compliant

   \(\square \text{- }Conflict (\textsf {Des}_i, \textsf {Do}_i)\)

S-practical-motivational aporetic

   \(\square \text{- }Conflict (\textsf {Des}_i, \textsf {Do}_i)\)

\({\textsf {Do}_i}\succ _S{\textsf {Des}_i}\)

S-practical-motivational deviant

\(\textsf {Des}_i, \textsf {Hold}_i\)

   –

S-epistemic-motivational independent

   \(\square \text{- }Conflict (\textsf {Des}_i, \textsf {Hold}_i)\)

\({\textsf {Des}_i}\succ _S{\textsf {Hold}_i}\)

S-epistemic-motivational compliant

   \(\square \text{- }Conflict (\textsf {Des}_i, \textsf {Hold}_i)\)

S-epistemic-motivational aporetic

   \(\square \text{- }Conflict (\textsf {Des}_i, \textsf {Hold}_i)\)

\({\textsf {Hold}_i}\succ _S{\textsf {Des}_i}\)

S-epistemic-motivational deviant

\(\textsf {Do}_i, \textsf {Hold}_i\)

   –

S-epistemic-practical independent

   \(\square \text{- }Conflict (\textsf {Do}_i, \textsf {Hold}_i)\)

\({\textsf {Do}_i}\succ _S{\textsf {Hold}_i}\)

S-epistemic-practical compliant

   \(\square \text{- }Conflict (\textsf {Do}_i, \textsf {Hold}_i)\)

S-epistemic-practical aporetic

   \(\square \text{- }Conflict (\textsf {Do}_i, \textsf {Hold}_i)\)

\({\textsf {Hold}_i}\succ _S{\textsf {Do}_i}\)

S-epistemic-practical deviant

Let us provide some brief comments. S-X-Y independent agents are free to adopt any mental statements, since there are no conflicts. S-X-deontic compliant agents are social: here we obtain normative compliance by design, which means that obligations always override conflicting desires, actions and even beliefs. On the contrary, we have S-X-deontic deviant agents, which must be distinguished from the S–X-deontic aporetic ones, for which we simply do not solve conflicts between obligations and other modalities. In case of S-epistemic–Y compliant agents, beliefs are defeated by other mental modalities, and thus we have classical examples of wishful (deontic or motivational) thinking.

The value of agent types depends on how they are employed, but it also depends on the computational setting where they are used. In argument-based MDPs, it is important to note that an agent observes a state at time t, has desires and does an action at the same time t. We have constrained the logic framework in this regard. For example, an agent may behave with respect to practical arguments using rules of the form “\(\textsf {Hold}_i \varphi _1 \mathrm {\,at\,} t \Rightarrow \textsf {Des}_i \varphi _2\mathrm {\,at\,} t\)” or “\(\textsf {Hold}_i \varphi _1 \mathrm {\,at\,} t \Rightarrow \textsf {Do}_i \varphi _2 \mathrm {\,at\,} t\)” where \(\varphi _1\) and \(\varphi _2\) conflict. Such arguments would be typically considered by an agent to change its world, for example to go from a dangerous state to a safe state. If there is a conflict between \(\textsf {Hold}_i\) and \(\textsf {Des}_i\), or between \(\textsf {Hold}_i\) and \(\textsf {Do}_i\), then arguments built from the above rules self-attack. Consequently, the agent may not be able to have a desire \((\textsf {Des}_i \varphi _2 \mathrm {\,at\,} t)\) or attempt the action \((\textsf {Do}_i \varphi _2 \mathrm {\,at\,} t)\), and hence the agent may be impeded in moving to some states. Conflicts between the modalities \(\textsf {Hold}_i\) and \(\textsf {Des}_i\), or between \(\textsf {Hold}_i\) and \(\textsf {Do}_i\) should thus be avoided to obtain meaningful argument-based MDPs. This discussion emphasises that the operational interpretation of the notion of conflict between modalities depends much on the computational setting.

Static agent types ensure by design that agents solve conflicts between mental statements in a specific way. On the contrary, dynamic agent types do not ensure by design any conflict resolution mechanism, but they are simple ways for interpreting behaviours of an agent. Let us consider social agency and examine a scenario with the following rules:
$$\begin{aligned}&\textsf {r}: \ldots \Rightarrow \textsf {Hold}_i \mathrm {trustworthy\_state} \mathrm {\,at\,} {t} \end{aligned}$$
(7.1)
$$\begin{aligned}&\textsf {s}: \textsf {Hold}_i \lnot \mathrm {trustworthy\_state} \mathrm {\,at\,} {t} \Rightarrow \textsf {Des}_i \lnot \mathrm {pay\_taxes} \mathrm {\,at\,} {t} \end{aligned}$$
(7.2)
$$\begin{aligned}&\textsf {t}: \ldots \Rightarrow \textsf {Obl}_i \mathrm {pay\_taxes}\mathrm {\,at\,} {t} \end{aligned}$$
(7.3)
In such a scenario, compliance can be obtained by a mental mechanism that has not generated a desire conflicting with a derived obligation: indeed no conflict between obligation and desire is specified. Nevertheless, we cannot exclude this as a potential case of compliance, as agent’s deliberation can support a compliant attitude. Hence an agent may be compliant at the level of statement labelling even if the obligation modality does not override other mental modalities.

Definition 7.4

(Dynamic agent type) Given a ground agent theory \(\langle Rul, \textit{Conflict}, \succ \rangle \) describing an agent, the agent is of (dynamic) typeX at time t if, and only if, the relation \(\textit{Conflict} \) and agent’s attitude at time t satisfies conditions as established in the table below, where \(\varphi _1\) and \(\varphi _2\) conflict with each other and ‘D’ means ‘dynamic’.

Conflict

Attitude

Agent type X

\(\textsf {Obl}_i, \textsf {Des}_i\)

 

\(K_{\textsf {Obl}_i\varphi _1 }\)

\(K_{\textsf {Des}_i \varphi _2 }\)

 

   –

\( \textsf {in}\)

\(\textsf {in}\)

D-motivational-deontic independent

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Des}_i)\)

\( \textsf {in}\)

\(\textsf {no}\)

D-motivational-deontic compliant

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Des}_i)\)

\( \textsf {no}\)

\( \textsf {in}\)

D-motivational-deontic deviant

\(\textsf {Obl}_i, \textsf {Do}_i\)

 

\(K_{\textsf {Obl}_i\varphi _1 }\)

\(K_{\textsf {Do}_i\varphi _2 }\)

 

   –

\( \textsf {in}\)

\(\textsf {in}\)

D-practical-deontic independent

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Do}_i)\)

\( \textsf {in}\)

\(\textsf {no}\)

D-practical-deontic compliant

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Do}_i)\)

\( \textsf {no}\)

\( \textsf {in}\)

D-practical-deontic deviant

\(\textsf {Obl}_i, \textsf {Hold}_i\)

 

\(K_{\textsf {Obl}\varphi _1 }\)

\(K_{\textsf {Hold}\varphi _2 }\)

 

   –

\(\textsf {in}\)

\(\textsf {in}\)

D-epistemic-deontic independent

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Hold}_i)\)

\(\textsf {in}\)

\(\textsf {no}\)

D-epistemic-deontic compliant

   \(\square \text{- }Conflict (\textsf {Obl}_i, \textsf {Hold}_i)\)

\(\textsf {no}\)

\(\textsf {in}\)

D-epistemic-deontic deviant

\(\textsf {Des}_i, \textsf {Do}_i\)

 

\(K_{\textsf {Des}_i\varphi _1 }\)

\(K_{\textsf {Do}_i\varphi _2 }\)

 

   –

\(\textsf {in}\)

\(\textsf {in}\)

D-practical-motivational independent

   \(\square \text{- }Conflict (\textsf {Des}_i, \textsf {Do}_i)\)

\(\textsf {in}\)

\(\textsf {no}\)

D-practical-motivational compliant

   \(\square \text{- }Conflict (\textsf {Des}_i, \textsf {Do}_i)\)

\(\textsf {no}\)

\(\textsf {in}\)

D-practical-motivational deviant

\(\textsf {Des}_i, \textsf {Hold}_i\)

 

\(K_{\textsf {Des}_i\varphi _1 }\)

\(K_{\textsf {Hold}_i\varphi _2 }\)

 

   –

\(\textsf {in}\)

\(\textsf {in}\)

D-epistemic-motivational independent

   \(\square \text{- }Conflict (\textsf {Des}_i, \textsf {Hold}_i)\)

\(\textsf {in}\)

\(\textsf {no}\)

D-epistemic-motivational compliant

   \(\square \text{- }Conflict (\textsf {Des}_i, \textsf {Hold}_i)\)

\(\textsf {no}\)

\(\textsf {in}\)

D-epistemic-motivational deviant

\(\textsf {Do}_i, \textsf {Hold}_i\)

 

\(K_{\textsf {Do}_i\varphi _1 }\)

\(K_{\textsf {Hold}_i\varphi _2 }\)

 

   –

\(\textsf {in}\)

\(\textsf {in}\)

D-epistemic-practical independent

   \(\square \text{- }Conflict (\textsf {Do}_i, \textsf {Hold}_i)\)

\(\textsf {in}\)

\(\textsf {no}\)

D-epistemic-practical compliant

   \(\square \text{- }Conflict (\textsf {Do}_i, \textsf {Hold}_i)\)

\(\textsf {no}\)

\(\textsf {in}\)

D-epistemic-practical deviant

We can remark that the table in Definition 7.4 does not cater for cases where conflicting statements are both labelled \(\textsf {no}\), as it is not clear to us how to interpret such cases in a meaningful way in the absence of any mental elements labelled \(\textsf {in}\).

The dynamic type of an agent cannot influence any static type of the agent, whereas a static type can influence the dynamic type of the agent. For example, a S-motivational-deontic deviant agent may tend to become D-motivational-deontic deviant, but not necessarily since arguments for desires can be undercut. Formal investigations of the influence of static types on dynamic types is left to future research.

In summary, agents can have a logic-based characterisation in terms of static types and dynamic types. Static types are fixed at design time and they can influence dynamic types at run time, but not vice versa. We exemplify later such logic-based characterisation in our illustration of the overall framework in Sect.  7.3.

7.2 Probabilistic characterisation

Since any argument-based RL agent may adapt its attitudes, resulting into a distribution of dynamic types, any agent can be characterised from a probabilistic perspective by a distribution of dynamic types. For example, in a particular environment state, an agent may be D-practical-deontic compliant with probability 0.1 and D-practical-deontic deviant with probability 0.9 (and thus D-practical-deontic independent with probability 0).

The distribution of attitudes, and thus dynamic types, may be ‘shaped’ by specific static types. For example, an agent may not be able to take attitudes which are incompatible with its static type. From a probabilistic perspective, such attitudes will be taken with a probability 0. Yet, instead of considering static types to profile an agent in terms of dynamic types, we may set some attitudes with some infinite energies to shape an agent in a similar manner. Nevertheless, a logic-based characterisation in terms of static types has the advantage to be a concise means to shape the distribution of attitudes.

Besides attitudes which are incompatible with some static types, some other attitudes may be discarded by associating them with an infinite energy. By doing so, the number of reinforceable attitudes can be reduced, so that we can aggressively reduce the computational complexity of an argument-based deliberation (Definition 6.19).

In particular, the computational complexity may be drastically reduced by assuming that an agent can perform at most one action at a time. Accordingly, we may discard all the attitudes where more than one reinforceable argument leading to an action \((\textsf {Do}_i \varphi \mathrm {\,at\,} t)\) is labelled \({\small {\textsf {ON}}}\) and \({\small {\textsf {IN}}}\), i.e. we may discard all the attitudes with more than one reinforceable action. If an attitude leads to no action, then it corresponds to a behaviour of inhibition.

To further reduce the complexity of a deliberation, we may also assume that only arguments supporting actions are reinforceable, i.e. arguments leading to statements of beliefs \((\textsf {Hold}_i \varphi \mathrm {\,at\,} t )\), desires \((\textsf {Hold}_i \varphi \mathrm {\,at\,} t )\) and obligations \((\textsf {Obl}_i \varphi \mathrm {\,at\,} t )\) are unreinforceable while arguments leading to actions \((\textsf {Do}_i \varphi \mathrm {\,at\,} t)\) are possibly reinforceable.

Example 7.1

Suppose that there is an obligation to act with care, and that the agent holds that there is such an obligation, i.e. the agent has internalised it through an argument supporting the obligation to act with care and an argument supporting the action to act with care, as shown in Fig. 10.
Fig. 10

Argumentation graph describing agent i characterised by an internalised obligation to act with care

If the argument \(\textsf {O}^t\) is reinforceable, then the agent’s practical factorisation is as follows:
$$\begin{aligned} \begin{aligned}&P(S^t_{\textsf {obj}}, S^t, SC^t, SN^t, D^t_{\textsf {obj}}, D^t, DC^t, DN^t, C^t)\\&\quad = \odot _1(S^t_{\textsf {obj}}, S^t) \cdot \odot _2(D^t_{\textsf {obj}}, D^t) \cdot \otimes (S^t, SC^t, SN^t, D^t, DC^t, DN^t, O^t, OC^t) \end{aligned} \end{aligned}$$
(7.4)

However such factorisation leads to a computational burden due to the number of attitudes, see Table 2. For this reason, we may attach an infinite energy to every attitude where more than one argument supporting an action is labelled \({\small {\textsf {ON}}}\) and \({\small {\textsf {IN}}}\). In this case, the table of the reinforceable factor \(\otimes \) is reduced as in Table 3. \(\square \)

Table 2

View on the controlled factor \(\otimes \) conditioned to the safe state

\(S^t:\,\textsf {Hold}_{i}\mathrm{safe}\mathrm {\,at\,} t\) 

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\(O^t:\,\textsf {Obl}_{i} \mathrm{care}\mathrm {\,at\,} t\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\(SC^t:\,\textsf {Do}_i \mathrm{care}\mathrm {\,at\,} t\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {UND}}}\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {OFF}}}\)

\(SN^t: \textsf {Do}_i \mathrm{neglect}\mathrm {\,at\,} t\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {UND}}}\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {OUT}}}\)

\({\small {\textsf {OUT}}}\)

\({\small {\textsf {OFF}}}\)

\(OC^t:\, \textsf {Do}_i\mathrm{care}\mathrm {\,at\,} t\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

Table 3

View on the reinforceable factor \(\otimes \) conditioned to the safe state, where assignments with infinite energy are not displayed

\(S^t:\,\textsf {Hold}_{i}\mathrm{safe}\mathrm {\,at\,} t\) 

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\(O^t:\,\textsf {Obl}_{i} \mathrm{care}\mathrm {\,at\,} t\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {IN}}}\)

\(SC^t:\,\textsf {Do}_i \mathrm{care}\mathrm {\,at\,} t\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {OFF}}}\)

\(SN^t:\, \textsf {Do}_i \mathrm{neglect}\mathrm {\,at\,} t\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {IN}}}\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {OFF}}}\)

\(OC^t:\, \textsf {Do}_i\mathrm{care}\mathrm {\,at\,} t\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {OFF}}}\)

\({\small {\textsf {IN}}}\)

As illustrated in Example 7.1, if the overall setting is such that every action can be supported by one and only one accepted argument, then an argument-based RL agent variant may boil down to a fine-grained RL agent such that the pairs state-action are replaced by practical arguments along the events of unreinforceable arguments. We will use this type of agent for our illustrations.

7.3 Illustrations

We now illustrate the framework with a few simple experiments on the basis of the MDP pictured in Fig. 1.

We will attempt to show how the declarative and defeasible features of the framework can ease the specification of well-argued models (amongst others) of an argument-based RL agent. Each model of the agent and its environment was formally specified by a probabilistic defeasible theory. Then, each specification was executed, i.e. animated, using our argument-based SARSA algorithm (see Algorithm 5). As the framework is fully declarative, we just had to write the defeasible rules in a file, fix some parameters such as the learning parameters and the length of a run, to obtain a bench of simulations from which we eventually made some statistics.

For every experiment, the learning parameters used by the argument-based SARSA agent were as follows: learning rate \(\alpha = 0.1\), discount rate \(\gamma = 0.9\), and temperature \(\tau = 1\). The initial state was the safe state. For each setting, we traced the probability of careful and negligent behaviours in the safe state and the dangerous state, averaged over 100 runs, along with the absolute deviation.

Basic agent (control)

Let us consider the MDP pictured in Fig. 1 and formatted with the environment and agent probabilistic defeasible theory given in Example 6.4. We animated the theories, and the averaged probability of each action in the safe and dangerous state are shown in Fig. 11.

We observe that, on average, such a basic agent tends to behave with care. Note that, in comparison to the safe state, an agent slowly learns to behave with care in the dangerous state: as an agent learns to behave with care in the safe state, this agent has few accidents and thus hardly visits the dangerous state. As a consequence, this agent has little opportunities to learn to behave with care in the dangerous state (in comparison to the safe state).
Fig. 11

Control—a basic agent tends to behave with care

S–X–Y independent agent, with self-sanctioning desire

On the basis of our control thereof, we conceived a troublesome agent with the desire to behave in a way which appears to be negligent, whatever the state. To model this desire of ‘negligence’ (here the agent does not know the meaning of the atom \(\mathrm{neglect}\)), it is sufficient to add the following rule in the agent’s theory:
$$\begin{aligned} \textsf {d}_{i}, 1: \quad \Rightarrow \textsf {Des}_i \mathrm{neglect}\,\mathrm {\,at\,} \, t \end{aligned}$$
(7.5)
If the agent acts with negligence, then there is a ‘self-reward’, expressed by the following rule added to the environment theory:
$$\begin{aligned} \textsf {outsr}_{\textsf {obj}}, 1: \textsf {Des}_i \mathrm{neglect}\,\mathrm {\,at\,} \, t , \textsf {Do}_i \mathrm{neglect}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_i\mathrm{util}(1) \,\mathrm {\,at\,} \, t+1 \end{aligned}$$
(7.6)
If the agent does not act with negligence, this agent is frustrated and this agent will go through an unpleasant consequence which can be modelled by a negative ‘self-sanction’. So, the following rule is added to the environment theory:
$$\begin{aligned} \begin{aligned} \textsf {outsp}_{\textsf {obj}}, 1: \textsf {Des}_i \mathrm{neglect}\,\mathrm {\,at\,} \, t , \sim \textsf {Do}_i \mathrm{neglect}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_i\mathrm{util}(-1) \,\mathrm {\,at\,} \, t+1 \end{aligned} \end{aligned}$$
(7.7)
Simulation results are shown in Fig. 12. We observe that, on average, an agent tends to behave with negligence in the safe state, while there is no apparent preference in the dangerous state. Indeed, the probability of an agent to be D-practical-motivational compliant in the safe state is close to 1 after 300 time steps, while such an agent appears equiprobably D-practical-motivational independent and D-practical-motivational compliant in the dangerous state (on average). This tendency was expected since the expected utility of a negligent behaviour is now higher than the expected utility of a careful behaviour.
Fig. 12

S–X–Y independent agent with a self-sanctioning desire to act with negligence. Such an agent tends to behave with negligence in the safe state, with no preferences in the dangerous state

S–X–Y independent agent, with self-sanctioning desire and obligation

Suppose an agent as previously described, and let us now consider in addition an obligation to act with care. We add the following rules in the environment and agent theory:
$$\begin{aligned}&\textsf {obl}_{\textsf {obj}}, 1: \quad \Rightarrow \textsf {Obl}_i \mathrm{care}\,\mathrm {\,at\,} \, t \end{aligned}$$
(7.8)
$$\begin{aligned}&\textsf {obl}_{i}, 1: \quad \Rightarrow \textsf {Obl}_i \mathrm{care}\,\mathrm {\,at\,} \, t \end{aligned}$$
(7.9)
The obligation is enforced: if the agent does not act with care then there is a sanction.
$$\begin{aligned} \begin{aligned} \textsf {out}_{\textsf {obj}}, 0.5: \textsf {Obl}_i \mathrm{care}\,\mathrm {\,at\,} \, t, \sim \textsf {Do}_i \mathrm{care}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_i\mathrm{util}(-6) \,\mathrm {\,at\,} \, t +1 \end{aligned} \end{aligned}$$
(7.10)
The rule \(\textsf {obl}_{i}\) is ‘internalised’ because it is included in the agent theory. Yet, the agent is S–X-Y independent, thus the desire to act with negligence and the obligation to act with care do not conflict, and therefore this internalised obligation has no direct effects.

Results of the animations are given in Fig. 13. We observe that such a S–X–Y independent agent, with a self-sanctioning desire to behave with negligence, learns to behave with care, because of the enforcement of an obligation to behave with care.

Though such an agent is S–X–Y independent, it appears as D-practical-deontic compliant as well as D-practical-motivational deviant with a probability close to 1 after 500 time steps in the safe state. In the dangerous state, the agent appears as D-practical-deontic compliant as well as D-practical-motivational deviant with a probability around 0.6 at time 100 to a probability 0.75 at time 1000. Hence, compared to the control, the agent with obligation was quicker to learn to behave with care in the safe state, but it was not quicker to learn to behave with care in the dangerous state because of the limited number of visits of this dangerous state.
Fig. 13

S–X–Y independent agent desiring to act with negligence, with an enforced obligation to act with care. Such an agent tends to behave with care

S-motivational-deontic compliant agent, with self-sanctioning desire and obligation

Suppose now that the agent is S-motivational-deontic compliant. Therefore this agent cannot derive its desire to behave with negligence, and consequently it is D-motivational-deontic compliant. Since the agent does not desire to behave with negligence then this agent will not self-sanction with respect to the fulfilment of this desire. As a consequence, the agent has more incentive to behave with care and it will be quicker to adopt a careful behaviour, as Fig. 14 witnesses.
Fig. 14

S-motivational-deontic compliant agent, with self-sanctioning desire to act with negligence and enforced obligation to act with care. Such an agent tends to behave with care in a faster manner compared to previous agents

S-practical-deontic compliant agent, with self-sanctioning desire and obligation

Suppose now a practical application for socio-technical systems, where an artificial agent is required to be fully compliant with some regulations, that is, the agent has to be regimented.

To design such regimented agent, the designer may assume that the agent is S-practical-deontic compliant, thus the possibility to act with negligence is overridden by the obligation to act with care. Hence, the agent will always reject acting with negligence, and thus the only remaining option is to act with care. Consequently, the agent is now, with probability 1, D-practical-deontic compliant, D-practical-motivational deviant as well as D-motivational-deontic compliant, as Fig. 15 illustrates. In other words, the agent is ‘regimented’.
Fig. 15

S-practical-deontic compliant agent, with self-sanctioning desire to act with negligence and enforced obligation to act with care. Such an agent is regimented, and thus it always behaves with care

By providing the agent with the knowledge that an obligation holds, and by ‘hard-wiring’ the agent as a S-practical-deontic compliant agent, the agent was guided towards the ‘right’ decisions to make for the purpose of the application.

In summary, agents with different types not only learnt different behaviours, but also learnt at different speed: for example, agents with self-sanctioning desire and enforced obligations (Figs.  14 and 15) had their behaviours converged significantly faster than basic agents (Fig. 11). Since a major challenge faced by RL is how to improve learning speed especially in the face of huge real applications (see Sect.  3), this observation suggests that learning speed can be significantly improved by modelling an agent with logic frameworks such as PA. This observation is consistent with those in [24, 25], as the argumentation accelerated RL framework proposed in those works can be viewed as special cases of our present proposal, in which all ‘applicable’ arguments have probability 1 and labelled \({\small {\textsf {ON}}}\) (all ‘inapplicable’ arguments are \({\small {\textsf {OFF}}}\)) and the preferred or grounded labellings are used [7].

8 Conclusion

How to model a bounded agent facing the uncertainty pertaining to the use of partial information and conflicting reasons as well as the uncertainty pertaining to the stochastic nature of the environment and agent’s actions? How to build argument-based and executable specifications of such agents? And how to combine models of agency with RL and norms?

To address these questions, we motivated and investigated a combination of PA and RL allowing us (i) to provide a rich declarative argument-based representation of MDPs (ii) to model an agent that can learn the utility of argument labellings from rewards, so that labellings that are more likely to lead to high rewards will have higher utility values, and will be selected more often.

This computational framework allowed us to move from a behavioural account of agent modelling to an argument-based mentalistic approach where attitudes are labellings of arguments or mental statements. Interestingly, whilst argument-based frameworks for reasoning agents often propose a combination of inference mechanisms for sceptical epistemic reasoning and credulous practical reasoning, our use of the grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)-labelling allows us to have a homogeneous inferential view covering epistemic and practical reasoning.

An advantage of this mentalistic approach is a fine-grained distinction amongst attitudes and associated behaviours. In this regard, we may characterise an agent from a logic-based perspective as well as from an energy-based and probabilistic perspective. In fact, nobody exactly knows what are the dynamics of mental attitudes. The mentalistic approach is thus meant to be a computational tool to account for and investigate hypotheses about agent attitudes in silico, with eventual inputs and validation from in vivo or in situ experiments.

A disadvantage of the argument-based mentalistic approach lies in its computational complexity. As this approach is meant to account for the reinforcement of attitudes and since the number of attitudes can be large, it is computationally more demanding than the traditional reinforcement of pairs state-action. Due to this computational complexity, this mentalistic approach may also have a strong impact on learning dynamics. However, we showed how the computational complexity can be circumscribed by using factors, and by setting an infinite energy to some attitudes. On this basis, we illustrated the framework with simple experiments to highlight the abilities of the setting to represent and animate fine-grained models of agency.

As to future developments, the architecture of the framework invites extensions with respect to learning or reasoning abilities, most interestingly both. In regard to the learning features, more sophisticated learning mechanisms can be straightforwardly investigated such as eligibility traces [54] or reward-shaping [33]. The energy-based argumentation model also paves the way to interesting functionalities: for example we may learn the factor values from real-life data, so that we can reproduce in silico an environment or the observed behaviour of an agent, and we may induce the types of an agent by observing its behaviour. In regard to reasoning, the logic-framework paves the way to a development of an argument-based belief-desire-intention architecture with learning abilities. Finally, as we alluded to, we focused on the setting of MDPs, as a necessary step towards more interesting computational settings. In particular, PA will be more compelling for the setting of POMDPs, where an agent partially observes its current state, and makes argument-based decisions by deriving defeasible conclusions and by updating these conclusions in light of new information.

9 Key notations

Some key notations used in the paper

G

An argumentation graph

T

A defeasible theory

\(\textsf {ArgLabels}\)

A set of labels for arguments

\(l\)

A label for arguments

\(\mathrm {L}\)

An argument labelling function

\(\mathscr {L}\)

A set of argument labellings

\(L_A\)

The random labelling of argument A

L

A set of random argument labellings

\(\mathbf {l}\)

An assignment of a set of random argument labellings

\(\textsf {LitLabels}\)

A set of labels for statements

\(k\)

A label for statements

\(\mathrm {K}\)

A statement labelling function of statements

\(\mathscr {K}\)

A set of statement labellings

\(K_\varphi \)

The random labelling of statement \(\varphi \)

\(\mathbf K \)

A set of random statement labellings

\(\mathbf {k}\)

An assignment of a set of random statement labellings

Footnotes

  1. 1.

    Though there seems to be an emerging consensus in the literature conceiving ‘undercutting’ to mean an attack on a rule and ‘undermining’ to be an attack on premises, we prefer to adopt here a terminology closer to early work on rule-based argumentation, see e.g. [41].

  2. 2.

    Recall: the set of assumptive arguments supporting a set of assumptions \({Assum}\) is denoted \({\mathrm {AssumArg}}({Assum})\), see Notation  4.4.

  3. 3.

    Recall: the set of assumptive arguments supporting a set of assumptions \({Assum}\) is denoted \({\mathrm {AssumArg}}({Assum})\), see Notation 4.4.

  4. 4.

    We use the standard notation, so for \(\mathbf {Y} \subseteq \mathbf {X}\), we use \(\mathbf {x}(\mathbf {Y})\) to refer to the assignment within \(\mathbf {x}\) to the variables in \(\mathbf {Y}\). For example, if \(\mathbf {X}=\{X1,X2,X3\}\), \(\mathbf {Y}=\{X1,X2\}\) and \(\mathbf {x}=\{X1=1,X2=2,X3=3\}\), then \(\mathbf {x}(\mathbf {Y})=\{X1=1,X2=2\}\).

Notes

Acknowledgements

We would like to thank Pietro Baroni for his insights in argumentation. This work was supported by the Marie Curie Intra-European Fellowship PIEFGA-2012-331472.

References

  1. 1.
    Alexy, R. (1989). A theory of legal argumentation: The theory of rational discourse as theory of legal justification. Oxford: Clarendon.Google Scholar
  2. 2.
    Amgoud, L. (2009). Argumentation for decision making. In Argumentation in artificial intelligence (pp. 301–320). Springer.Google Scholar
  3. 3.
    Artikis, A., Sergot, M., & Pitt, J. (2009). Specifying norm-governed computational societies. ACM Transactions on Computational Logic, 10(1), 1:1–1:42.MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Artikis, A., Sergot, M., Pitt, J., Busquets, D., & Riveret, R. (2016). Specifying and executing open multi-agent systems. In Social coordination frameworks for social technical systems (pp. 197–212). Springer.Google Scholar
  5. 5.
    Atkinson, K., Baroni, P., Giacomin, M., Hunter, A., Prakken, H., Reed, C., et al. (2017). Towards artificial argumentation. AI Magazine, 38(3), 25–36.CrossRefGoogle Scholar
  6. 6.
    Atkinson, K., & Bench-Capon, T. J. M. (2007). Practical reasoning as presumptive argumentation using action based alternating transition systems. Artificial Intellignence, 171(10–15), 855–874.MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Baroni, P., Caminada, M., & Giacomin, M. (2011). An introduction to argumentation semantics. The Knowledge Engineering Review, 26(4), 365–410.CrossRefGoogle Scholar
  8. 8.
    Baroni, P., Governatori, G., & Riveret, R. (2016). On labelling statements in multi-labelling argumentation. In Proceedings of the 22nd European conference on artificial intelligence (Vol. 285, pp. 489–497). IOS Press.Google Scholar
  9. 9.
    Bellman, R. (1956). Dynamic programming and Lagrange multipliers. Proceedings of the National Academy of Sciences of the United States of America, 42(10), 767.MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Bench-Capon, T. J. M., & Atkinson, K. (2009). Abstract argumentation and values. In L. Rahwan & G. Simari (eds.) Argumentation in artificial intelligence. Springer.Google Scholar
  11. 11.
    Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vol. 1). Belmont, MA: Athena Scientific.zbMATHGoogle Scholar
  12. 12.
    Besnard, P., García, A. J., Hunter, A., Modgil, S., Prakken, H., Simari, G. R., et al. (2014). Introduction to structured argumentation. Argument & Computation, 5(1), 1–4.CrossRefGoogle Scholar
  13. 13.
    Broersen, J., Dastani, M., Hulstijn, J., & van der Torre, L. (2002). Goal generation in the BOID architecture. Cognitive Science Quarterly, 2(3–4), 428–447.Google Scholar
  14. 14.
    Chen, S. H., & Huang, Y. C. (2005). Risk preference and survival dynamics. In: Agent-based simulation: From modeling methodologies to real-world applications, Agent-based social systems (Vol. 1, pp. 135–143). Tokyo: Springer.Google Scholar
  15. 15.
    Conte, R., & Castelfranchi, C. (1995). Cognitive and social action. London: University College of London Press.Google Scholar
  16. 16.
    Conte, R., & Castelfranchi, C. (2006). The mental path of norms. Ratio Juris, 19, 501–517.CrossRefGoogle Scholar
  17. 17.
    Conte, R., Falcone, R., & Sartor, G. (1999). Introduction: Agents and norms: How to fill the gap? Artificial Intelligence and Law, 7(1), 1–15.CrossRefGoogle Scholar
  18. 18.
    Cormen, T. H., Leiserson, C. E., Rivest, R. L., Stein, C., et al. (2001). Introduction to algorithms (Vol. 2). Cambridge: MIT press.zbMATHGoogle Scholar
  19. 19.
    Dung, P. M. (1995). On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games. Artificial Intelligence, 77(2), 321–358.MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Edmonds, B. (2004). How formal logic can fail to be useful for modelling or designing mas. In Regulated agent-based social systems, Lecture Notes in Computer Science (Vol. 2934, pp. 1–15). Springer.Google Scholar
  21. 21.
    Fasli, M. (2004). Formal systems and agent-based social simulation equals null? Journal of Artificial Societies and Social Simulation, 7(4), 1–7.Google Scholar
  22. 22.
    Fornara, N., & Colombetti, M. (2009). Specifying and enforcing norms in artificial institutions. In Declarative agent languages and technologies VI, Lecture Notes in Computer Science (Vol. 5397, pp. 1–17). Springer.Google Scholar
  23. 23.
    Fox, J., & Parsons, S. (1997). On using arguments for reasoning about actions and values. In Proceedings of the AAAI spring symposium on qualitative preferences in deliberation and practical reasoning.Google Scholar
  24. 24.
    Gao, Y., & Toni, F. (2014). Argumentation accelerated reinforcement learning for cooperativeulti-agent systems. In Proceedings of 21st European conference on artificial intelligence (pp. 333–338). IOS Press.Google Scholar
  25. 25.
    Gao, Y., Toni, F., & Craven, R. (2012). Argumentation-based reinforcement learning for robocup soccer keepaway. In Proceedings of 20th European conference on artificial intelligence (pp. 342–347). IOS Press.Google Scholar
  26. 26.
    Gaudou, B., Lorini, E., & Mayor, E. (2013). Moral guilt: An agent-based model analysis. In Advances in social simulation—Proceedings of the 9th conference of the european social simulation association (pp. 95–106).Google Scholar
  27. 27.
    Governatori, G., & Rotolo, A. (2008). BIO logical agents: Norms, beliefs, intentions in defeasible logic. Autonomous Agents and Multi-Agent Systems, 17(1), 36–69.CrossRefGoogle Scholar
  28. 28.
    Hunter, A., & Thimm, M. (2017). Probabilistic reasoning with abstract argumentation frameworks. Journal of Artificial Intelligence Research, 59, 565–611.MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques—Adaptive computation and machine learning. Cambridge: The MIT Press.Google Scholar
  30. 30.
    Kostrikin, A. I., Manin, Y. I., & Alferieff, M. E. (1997). Linear algebra and geometry. Washington, DC: Gordon and Breach Science Publishers.Google Scholar
  31. 31.
    Modgil, S., & Caminada, M. (2009). Proof theories and algorithms for abstract argumentation frameworks. In Argumentation in artificial intelligence (pp. 105–129). Springer.Google Scholar
  32. 32.
    Muller, J., & Hunter, A. (2012). An argumentation-based approach for decision making. In 24th international conference on tools with artificial intelligence (Vol. 1, pp. 564–571). IEEE.Google Scholar
  33. 33.
    Ng, A., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations: theory and application to reward shaping. In Proceedings of 16th international conference on machine learning (pp. 278–287).Google Scholar
  34. 34.
    Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., & Liang, E. (2006). Autonomous inverted helicopter flight via reinforcement learning. In Experimental robotics IX (pp. 363–372). Springer.Google Scholar
  35. 35.
    Oren, N. (2014). Argument schemes for normative practical reasoning (pp. 63–78). Berlin: Springer.zbMATHGoogle Scholar
  36. 36.
    Parsons, S., & Fox, J. (1996). Argumentation and decision making: A position paper. In Practical reasoning (pp. 705–709). Springer.Google Scholar
  37. 37.
    Pattaro, E. (2005). The law and the right. In E. Pattaro (Ed.), Treatise of legal philosophy and general jurisprudence (Vol. 1). Berlin: Springer.Google Scholar
  38. 38.
    Pollock, J. L. (1995). Cognitive carpentry: A blueprint for how to build a person. Cambridge, MA: MIT Press.Google Scholar
  39. 39.
    Prakken, H. (2006). Combining sceptical epistemic reasoning with credulous practical reasoning. In Proceedings of the 1st conference on computational models of argument (pp. 311–322). IOS Press.Google Scholar
  40. 40.
    Prakken, H. (2011). An abstract framework for argumentation with structured arguments. Argument and Computation, 1(2), 93–124.CrossRefGoogle Scholar
  41. 41.
    Prakken, H., & Sartor, G. (1997). Argument-based extended logic programming with defeasible priorities. Journal of Applied Non-Classical Logics, 7(1–2), 25–75.MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    Prakken, H., & Sartor, G. (2015). Law and logic: A review from an argumentation perspective. Artificial Intelligence, 227, 214–245.MathSciNetCrossRefzbMATHGoogle Scholar
  43. 43.
    Rahwan, I., & Simari, G. R. (Eds.). (2009). Argumentation in artificial Intelligence. Berlin: Springer.Google Scholar
  44. 44.
    Riveret, R., Baroni, P., Gao, Y., Governatori, G., Rotolo, A., & Sartor, G. (2018). A labelling framework for probabilistic argumentation. Annals of Mathamatics and Artificial Intelligence, 83(1), 21–71.MathSciNetCrossRefzbMATHGoogle Scholar
  45. 45.
    Riveret, R., Korkinof, D., Draief, M., & Pitt, J. V. (2015). Probabilistic abstract argumentation: An investigation with boltzmann machines. Argumentation & Computation, 6(2), 178–218.CrossRefGoogle Scholar
  46. 46.
    Riveret, R., Pitt, J. V., Korkinof, D., & Draief, M. (2015). Neuro-symbolic agents: Boltzmann machines and probabilistic abstract argumentation with sub-arguments. In Proceedings of the 14th international conference on autonomous agents and multiagent systems (pp. 1481–1489). ACM.Google Scholar
  47. 47.
    Riveret, R., Rotolo, A., & Sartor, G. (2012). Probabilistic rule-based argumentation for norm-governed learning agents. Artificial Intelligence and Law, 20(4), 383–420.CrossRefzbMATHGoogle Scholar
  48. 48.
    Ross, A. (1958). On law and justice. London: Stevens.Google Scholar
  49. 49.
    Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical report. University of Cambridge.Google Scholar
  50. 50.
    Sartor, G. (2005). Legal reasoning: A cognitive approach to the law. Berlin: Springer.Google Scholar
  51. 51.
    Shams, Z., Vos, M. D., Oren, N., Padget, J., & Satoh, K. (2015). Argumentation-based normative practical reasoning. In Proceedings of the 3rd international workshop on theory and applications of formal argumentation, revised selected papers (pp. 226–242). Springer.Google Scholar
  52. 52.
    Simari, G. I., Shakarian, P., & Falappa, M. A. (2016). A quantitative approach to belief revision in structured probabilistic argumentation. Annals of Mathematics and Artificial Intelligence, 76(3), 375–408.MathSciNetCrossRefzbMATHGoogle Scholar
  53. 53.
    Stone, P., Sutton, R. S., & Kuhlmann, G. (2005). Reinforcement learning for robocup soccer keepaway. Adaptive Behavior, 13, 165–188.CrossRefGoogle Scholar
  54. 54.
    Sutton, R. S., & Barto, A. (1998). Reinforcement learning: An introduction. Cambridge: MIT Press.zbMATHGoogle Scholar
  55. 55.
    Tadepalli, P., Givan, R., & Driessens, K. (2004). Relational reinforcement learning: An overview. In Proceedings of the ICML04 workshop on relational reinforcement learning.Google Scholar
  56. 56.
    van der Hoek, W., Roberts, M., & Wooldridge, M. (2007). Social laws in alternating time: Effectiveness, feasibility, and synthesis. Synthese, 156(1), 1–19.MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Data61 - CSIROBrisbaneAustralia
  2. 2.Technische Universität DarmstadtDarmstadtGermany
  3. 3.University of BolognaBolognaItaly
  4. 4.Imperial College LondonLondonUK
  5. 5.European University InstituteFiesoleItaly

Personalised recommendations