A probabilistic argumentation framework for reinforcement learning agents
Abstract
A boundedreasoning agent may face two dimensions of uncertainty: firstly, the uncertainty arising from partial information and conflicting reasons, and secondly, the uncertainty arising from the stochastic nature of its actions and the environment. This paper attempts to address both dimensions within a single unified framework, by bringing together probabilistic argumentation and reinforcement learning. We show how a probabilistic rulebased argumentation framework can capture Markov decision processes and reinforcement learning agents; and how the framework allows us to characterise agents and their argumentbased motivations from both a logicbased perspective and a probabilistic perspective. We advocate and illustrate the use of our approach to capture models of agency and norms, and argue that, in addition to providing a novel method for investigating agent types, the unified framework offers a sound basis for taking a mentalistic approach to agent profiles.
Keywords
Probabilistic argumentation Markov decision process Reinforcement learning Norms1 Introduction
Probabilistic argumentation (PA) and reinforcement learning (RL) address, from different angles, issues pertaining to bounded rationality.
Formal argumentation addresses bounded rationality by modelling defeasible reasoning. Formal argumentation frameworks represent partial knowledge as arguments and relations (e.g. attack and support) amongst them, resolve conflicts arising from competing arguments by assessing their comparative strengths, derive defeasible conclusions and update such conclusions in the light of new information. Recently, formal argumentation has also been studied in probabilistic settings, leading to PA frameworks, so that the alternative statuses of arguments are events having probability values. Such probabilistic investigations endow formal argumentation with the ability to address bounded rationality from a qualitative and quantitative perspective.
On the other hand, RL addresses bounded rationality by modelling agents interacting with their environment in a trialanderror style: agents receive ‘rewards’ (i.e. positive reinforcements) if their actions lead to desirable outcomes in the long run, and receive ‘punishments’ (i.e. negative reinforcements) otherwise. The target of agents is to maximise their longterm reward. RL agents learn to achieve this target either by selecting the actions that appear to be preferable according to the agents’ previous experiences (exploitation), or by trying other actions that have potential to bring higher longterm rewards (exploration). RL is widely used as a technique for sequential decision making in stochastic environments (‘stochastic’ means here that the outcome of an action is not deterministic), so that agents can learn the optimal behaviour without being explicitly taught.
This paper brings together PA and RL: we quantitatively measure argument values by a utility function, and automatically learn utility values by RL. Arguments with higher utilities will have higher chance to be used to back certain behaviours. We ensure that potentially useful arguments are drawn with some nonzero probability, so as to avoid RL from being stuck into some local optimum.

argumentbased theoretical models of natural agency, that is, models that explain and forecast how diverse kinds of natural agents will behave or would behave under particular contexts or circumstances;

argumentbased operational models of agentbased systems, that is, computer systems of adaptive agents using argumentation and RL to make rational choices in highly stochastic environments, with noisy (i.e. imperfect and uncertain) information.
As to argumentbased models for agentbased systems, our intention is to investigate a possible combination of PA and RL, as complementary ways to endow bounded rational agents with the ability to cope with uncertain environments, towards smarter agentbased applications. Though a large amount of theoretical works have focused on argumentation, we must reckon that argumentation has not found yet much applications in ‘real life’ computer systems. RL is certainly much more successful in this regard, as it has been widely applied in reallife applications. Hence, we hope that argumentation will find useful applications by associating it with RL.
Our framework supports the finegrained characterisation of cognitive profiles of RL agents. So far, logicbased frameworks have been used to characterise agent types, i.e. to analyse the reasoning of agents and in particular various ways to approach conflicts between varied motivations (e.g. selfinterest vs. norm compliance). By combining PA and RL, we will characterise agents and their argumentbased motivations from a logic as well as a probabilistic perspective, and thus we will provide a novel method to investigate agent types. In particular, since formal argumentation supports formal accounts of normative and legal reasoning, we hope that this framework will ease the analysis and construction of models where both norms and uncertainty play an important role.
This paper is organised as follows. In Sect. 2, we further motivate the combination of PA and RL for our purposes. The RL framework is formally presented in Sect. 3. Our argumentation setting is introduced in Sect.4, and its PA development is detailed in Sect. 5. We discuss in Sect. 6 how to build argumentbased RL agents on the basis of our combination of PA and RL. In Sect. 7, we illustrate how the framework can be exploited to characterise profiles of argumentbased RL agents from a logicbased and probabilistic perspective, before concluding in Sect. 8.
2 Motivations
In this section, we further motivate our approach to bounded agents by combining PA and RL, and we do so from three perspectives. Firstly, we introduce reasons for combining these two computational frameworks to build our agents (Sect. 2.1). Secondly, we motivate our proposal with respect to possible applications (Sect. 2.2). Finally, the approach is motivated by limitations of related work (Sect. 2.3).
2.1 Probabilistic argumentation and reinforcement learning
When modelling, specifying or building an agent reasoning with some knowledge within a formal framework, we face the problems of how to represent the knowledge and how to reason with this knowledge. We adopt a declarative approach, rather than a procedural one, as we think that declarative models can be more easily updated and used to test alternative hypotheses, essential for both a reasoning agent and scientists investigating such an agent. Declarative models can also be viewed as executable specifications [3, 4]. As declarative language, we employ a logicbased formalism to capture the cognitive states of an agent and to enable reasoning on the basis of such states. So, statements of this language represent an agent’s beliefs, internalised obligations, desires and actions. Defeasible inferences are applied to such cognitive statements to generate further statements to determine agent’s actions.
The choice of a declarative representation of knowledge coupled with defeasible inferences leads us to adopt a nonmonotonic logicbased framework. As nonmonotonic logicbased framework, we endorse formal argumentation. This allows us to take advantage of work on argumentation in knowledge representation, nonmonotonic logic reasoning and decision making [5, 43], and in normative reasoning [42, 50] through the modular integration of different aspects of reasoning, such as the construction of arguments and the acceptance of arguments and statements [7, 12]. By adopting an argumentbased approach, declarative models and hypotheses can be more naturally updated and argued upon.
Whilst argumentation is well adapted to reason with partial information and conflicts, it does not deal with measures of uncertainty as conceived in studies of randomness or stochasticity. To deal with uncertainty as related to the degree of credibility of arguments and statements, we supplement argumentation theory with probability theory. In particular, we adopt an energybased model [29] for argumentation [44, 45, 46]. Energybased models attach a scalar quantity called an energy to any assignments of the variables in the model. Then, on the basis of energies, each assignment is given a probability. Making inferences with an energybased model consists of comparing the energies associated with various assignments of the variables, and choosing the one with the smallest energy. In this view, the energy reflects the utility or the credit one puts into a possible assignment of the argumentation system; an assignment consisting in associating every argument with an acceptance status. On this basis we determine the probability that an argument has a certain status, and compute the probability distribution of arguments’ statuses. The energies of these configurations can be fixed by human operators or learnt from data or experiences by means of machine learning.
To use RL to learn the energies of argumentation configurations, we formulate the energybased model of probabilistic argumentation as Markov Decision Processes (MDPs). MDPs are widely used to formulate the sequential decision making (SDM) problems: rather than making a ‘oneshot’ decision to maximise the immediate utility, agents in SDM problems need to select sequences of actions covering manifold situations to maximise the longterm accumulated utility. Considering stochastic environments, in which an action a in a state s may trigger transitions to different states along with different utility values, a learning agent does not know a priori the distributions over the possible utilities and states. Through RL, an agent aims at achieving or at least approximating a policy that maximises the rewards in the long run, where a policy maps perceived states of the environment to actions to be performed when in those states. In our PA framework for RL argumentbased agents, a policy maps perceived states of the environment to possible statuses of its arguments, and each argument in turn is for or against some executable actions. By trialanderror, RL learns the best policy and its corresponding distribution of arguments’ statuses, hence the energies.
While RL is used to learn energies of particular argumentation configurations of an agent, the use of PA can be helpful to address problems of common RL settings. In particular, a common problem pertains to documenting decisions [32], i.e. the problem questioning why some actions are good or bad in a state, and the development of corresponding policies. To address this problem, mental statements, such as beliefs, desires and actions, can be justified or discarded by the interplay of arguments to back decisionmaking. Another common problem pertains to the size of the space state, whose expansion involves an exponential growth of the computational requirements for obtaining optimal policies. This is the famous curse of dimensionality faced by common RL algorithms [54]. In this regard, a combination of PA and RL can be useful to apply arguments to new situations similar to previously encountered ones [55]. So, by representing (mental) states with statements supported by arguments, the decisionmaking process of our RL agents may become more informative and understandable, and thus possibly more easily reusable.
In many real applications, agents have only partial knowledge of the environment, and SDM problems in such scenarios can be modelled as partially observed MDPs (POMDPs). In POMDPs , an agent partially observes its current state, makes decisions by deriving defeasible conclusions, and updates these conclusions in light of new observations. If the agent has full knowledge of the current state, then a POMDP boils down to an MDP. In this paper, and as a necessary first step, we operate argumentbased MDPs, i.e. an argumentbased knowledge representation and reasoning setting for MDPs, paving the way to POMDP counterparts.
We will not provide our agent with the ability to ‘observe’ norms or normative rules (this is an issue left to future research), but we will endow them with some initial normative knowledge. By doing so, we can also explore the significance of providing a RL agent with such normative knowledge to select arguments leading to useful decisions.
2.2 Applications

argumentbased theoretical models about natural agents,

argumentbased operational models for computer systems.
Logicbased frameworks are wellestablished to investigate operational models for computer systems. Yet logicbased investigations are sometimes criticised with regard to their applications in social sciences, and in particular agentbased social simulation (ABSS). Experimental insights are possible from social simulations, and logicbased investigations can be employed to study some relevant concepts and their mathematical properties (axiomatization, decidability, complexity, etc.). Nevertheless, such investigations are sometimes deemed to add nothing interesting to understand social phenomena, and might lead to “empty formal logic papers without any results” [20].
In contrast to recurrent criticisms of logicbased approaches in ABSS, these approaches are defended by other researchers. For example, [21] argues that logic can be useful in ABSS because a logical analysis based on (a) a philosophical or sociological theory, (b) observations and data about a particular social phenomenon, and (c) intuitions or a blend of them can provide the requirements and the specification for an ABSS system and more generally MAS. Moreover, a logicbased system might help to check the validity of the ABSS model and to adjust it by way of having a clear understanding of the formal model underpinning it.
As far as we are concerned, and for both argumentbased investigations of theoretical models and operational models, since formal argumentation is particularly suitable to provide formal accounts of legal reasoning, our framework is meant to support the analysis and construction of models where norms play an important role. In particular, normgoverned computer systems could profit from argumentbased analyses and techniques developed for modelling legal reasoning in order to formalise normative systems tailored to govern computer systems. This view is in line with Conte et al. [17] enquiring how to fill the gap between the frameworks of autonomous agents and legal theory. In this view, if the interaction between models of normative systems and cognitive agents requires integrating works in the legal and MAS domains, then a common formalism, for instance argumentation, could significantly facilitate such an integration.
2.3 Related work
Since desires, goals, plans and actions often conflict or can be justified in alternative ways, computational models of argument have been set up to model decision making and practical reasoning, i.e. reasoning about what it is best for a particular agent to do in a given situation.
An early line of investigations was initiated by J. Fox and S. Parsons, where argumentation frameworks were proposed for practical reasoning [36], for instance for making decisions about the expected value of actions [23]. Another early line of research was carried out by Pollock [38] at the crossroad of artificial intelligence and philosophy by proposing a general theory of rationality and its implementation in OSCAR, an architecture for an autonomous rational agent using arguments. More recent work are various, although they often employ (adaptations of) Dung’s argumentation frameworks to capture arguments for (in)compatible goals or desires, and plans to achieve them. For example, Amgoud [2] proposed an argumentbased model for decisionmaking in two phases: in the first inference phase, arguments in favour/against each option (action) are built and evaluated in relation to some semantics; in the second comparison phase, pairs of alternative options are compared using a given criterion. Other work took a schemebased approach where argument schemes and critical questions are applied to practical reasoning. For instance, Atkinson and BenchCapon [6] described an approach to practical reasoning based on Actionbased Alternating Transition System (AATS) [56] to reason rigorously about actions and their effects, and for the presumptive justifications of action through the instantiation of argument scheme, which are then examined through a series of critical questions. Practical reasoning can be also considered with norms, leading to ‘normative practical reasonning’. For example, Oren [35], inspired by [6], describes a formal normative model based on AATS along with argument schemes and critical questions to reason about how goals and obligations lead to preferences over the possible executions of the system, while Shams et al. [51] propose another formal framework for normative practical reasoning that is able to generate consistent plans for a set of conflicting goals and norms. Abovementioned approaches on argumentbased practical reasoning have the advantage to determine decisions with some argued and rich explanations meant to be easy to understand, but they incorporate no (reinforcement) learning.
There has been some work devoted to integrating argumentation frameworks and RL, so as to increase learning speed. For example, Gao [24, 25] proposed the argumentation accelerated reinforcement learning (AARL) framework, which can be applied to both singleagent and cooperative multiagent problems. They built a variant of the valuebased argumentation framework [10] so as to represent people’s (possibly conflicting) domainspecific beliefs and derive the ‘good’ action for each agent by using some argumentation semantics. The framework uses potentialbased reward shaping technique to recommend ‘good’ actions to agents, and showed that their approach can improve the learning speed in singleagent and cooperative multiagent problems. However, the AARL framework does not take into account the degree of uncertainty attached to arguments.
It is often assumed that reasoning about beliefs should be sceptical while reasoning about actions should be credulous. These assumptions lead to logicbased systems that combine different semantics accounts for epistemic and practical reasoning, see e.g. [39]. Beyond computational models of argument, the use of different semantics is also backed by the idea that actions have no truth values, unlike statements that underlie beliefs. Besides, when degrees of uncertainty are attached to arguments and their conclusions, it is interesting to investigate to what extent a unique semantics could be devised to cover epistemic and practical reasoning patterns, and we will do so.
To address the combination of epistemic and practical reasoning with learning abilities, we propose a PA framework allowing us to integrate seamlessly an argumentbased knowledge representation setting, probability theory and reinforcement learning. Computational models of argumentation and probabilistic considerations can be combined in various ways (see e.g. [28, 44, 52]). In this paper, we reappraise the setting of [47] for multiagent systems (MAS) with the approach of ‘probabilistic labellings’ as developped in [44]. Instead of exposing diverse inference rules for epistemic and practical reasoning [39], our PA framework uses a unique labelling specification for both types of reasoning. Instead of planning [2, 6, 35, 51], we embrace RL. As a consequence, instead of an AATS to reason about actions and their effects (as in [6, 35, 51]), we adopt the standard MDP setting for RL, so that we can capture the uncertainty on actions, their effects and the environment. Instead of using an AATS along with arguments schemes and critical questions to generate arguments, we represent MDPs and RL agents with arguments, leading to what we call argumentbased MDPs and RL. By doing so, we obtain an argumentbased knowledge representation framework combined with a classical decision setting that directly uses a probability distribution and a utility (value) function over alternative decisions. This allows us to fill a gap between argumentbased practical reasoning where arguments are meant to qualitatively explain decisions but where the valuation of decisions is elusive, and classical decision theory where choices are typically attached some values but can be difficult to explain in detail from a qualitative perspective with arguments. The gap between argumentbased practical reasoning and classical decision making was not empty in terms of research, see e.g. work back to [23], and thus our intention is also to provide an RL alternative to address it.
As to argumentbased MDPs and RL, it turns out that Y. Gao et al’s AARL framework [24, 25] can be viewed as a special case of our probabilistic argumentation framework. In their approach, they selected the ‘applicable’ arguments, i.e., the arguments whose premises are satisfied, in each state, and only used these applicable arguments to derive the ‘good’ actions for agents in this state. As we will see, this can be implemented in our setting, by giving the probability zero to all labellings in which these arguments are inapplicable.
Besides the above works which are rather oriented towards operational models for computer systems, a model which includes argumentation, probability and learning may provide a fresh approach to normative social psychology and cognition, and facilitate the interactionintegration of manifold studies on law and norms. Empiricist approaches to the law view single norms as sociopsychological entities, i.e., as beliefs, accompanied by conative states, entertained by individual agents [37, 48]. Theorists of legal logic and argumentation [1] prefer to focus on ideal patterns for normative reasoning and rational practical interaction. Cognitive scientists focus on norms as a specific instances of sociocognitive content, and on learning processes of imitation and adaptation through which agents detect social norms and endorse them [15, 16]. Some attempts to integrate psychological and logicalargumentative aspects exist (see e.g. [50]), but they fail to take into account learning. As our normative agents can endorse norms within a probabilistic argumentbased setting, reason and act according to arguments, and learn through experience the reliability and utility of arguments and combination of them, we are able to start merging psychological, logicargumentative and cognitive insights into a new synthesis.
In the rest of the paper, we thus investigate a rulebased PA framework to study and animate argumentbased executable specifications of MDPs, where argumentbased RL agents can cope with the uncertainty arising from partial information and conflicting reasons as well as the uncertainty arising from the stochastic nature of its actions and its environment.
3 Reinforcement learning setting
In this section, we outline Markov decision processes (MDPs) (Sect. 3.1), a mathematical framework widely used to model reinforcement learning problems. Then, we introduce a popular RL algorithm, SARSA (Sect. 3.2), and we briefly illustrate its use in an MDP (Sect. 3.2).
3.1 Markov decision processes
An MDP is a mathematical framework for modelling decision making when outcomes of actions are partly unknown. In this work, we focus on finite horizon MDPs with discrete states and actions. In addition, we assume the next state is only determined by the current state and action, thus we work with the firstorder Markovian assumption (higherorder Markovian assumptions can be adjusted to the firstorder Markovian assumption by including historical information in the state). We study these MDPs because they have been widely used in RL research, and provide the simple yet generic mathematical foundation for our RL framework.
Definition 3.1

S is the set of states,

A is the set of actions,

\(P(s' \mid s,a)\) is the transition probability of moving from state s to \(s'\) by performing action a,

\(R(s' \mid s,a)\) gives the immediate reward received when action a is executed in state s, moving to state \(s'\).
While the above definition is a common characterisation of MDPs which features abstract atomic actions, we can note that these actions may be replaced by some structured concepts such as complex ‘attitudes’ or ‘behaviours’. We will do so later when building up argumentbased MDPs. In the remainder of this section, and for the sake of clarity, MDPs are left characterised in terms of such abstract atomic actions.
Example 3.1

States. There are two states in this MDP: safe and danger. Formally, the state set is \(S = \{\mathrm{safe}, \mathrm{danger}\}\).

Actions. There are two actions in this MDP: care and neglect, both these actions are available in each state. Formally, we have \(A = \{\mathrm{care}, \mathrm{neglect}\}\).

Transition probability. From Fig. 1 we can see that when the agent is in state safe, if it performs action care, it has 0.99 probability to remain in state safe, and 0.01 probability to transit to state danger. Formally, this transition dynamic can be represented by \(P(\mathrm{safe}\mathrm{safe}, \mathrm{care}) = 0.99\) and \(P(\mathrm{danger}\mathrm{safe}, \mathrm{care}) = 0.01\). Other transition probabilities can be represented similarly. Of course, in many real applications (e.g. Robot Soccer games [53] and helicopter control [34]), the transition probabilities are unknown.

Rewards. We give two rewards below, and other reward functions can be obtained similarly: from Fig. 1, when the agent is in state safe and performs action care, if the agent remains in state safe, it receives a reward of 1; otherwise, if the agent is transited to state danger, it receives a reward of 11. These rewards can be denoted as \(R(\mathrm{safe}\mathrm{safe},\mathrm{care}) = 1\) and \(R(\mathrm{danger}\mathrm{safe},\mathrm{care}) = 11\).
3.2 SARSA algorithm
Example 3.2
To illustrate how the SARSA algorithm works, we apply SARSA to the MDP illustrated in Fig. 1. We let the learning rate \(\alpha = 0.1\), the discount rate \(\gamma = 0.9\) and the temperature \(\tau = 1\). Also, we initialise all Qvalues to 0 (line 1 in Algorithm 1).
Suppose the agent is currently in state safe; thus, s is safe (line 3). Then we choose an action in this state. Since \(Q(\mathrm{safe},\mathrm{care}) = Q(\mathrm{safe},\mathrm{neglect}) = 0\), according to the softmax policy (Eq. 3.8), the probabilities of choosing care and neglect are both 0.5. Suppose we choose care, i.e. we let \(a = \mathrm{care}\) (line 4). Then we perform this action to observe its outcome (line 6). Suppose that, by performing care, the agent is transited to state \(\mathrm{safe}{}\) and receives a reward of 1 (note that the agent does not know the transition probability a priori). Thus, \(s' = \mathrm{safe}\) and \(r = 1\). We use softmax to choose action \(a'\) in state \(s'\) (line 7). Again, since all Qvalues remain their original values (i.e. 0), both actions have the same probability to be chosen. Suppose \(a' = \mathrm{neglect}\). Then we can update the Qvalue of stateaction pair (s, a) (line 8), the new Q(s, a) value is 0.1, i.e. \(Q(\mathrm{safe}, \mathrm{care}) = 0.1\). After s and a are updated (line 9), the first learning step finishes.
In the second learning step (starts from line 5), recall that \(s= \mathrm{safe}\) and \(a = \mathrm{neglect}\). By performing \(a_t\) in \(s_t\), suppose that the agent is transited to state safe and thus receives a reward of 2. Thus, \(s' = \mathrm{safe}\) and \(r = 2\). Then we have to choose action in \(s'\), using the softmax policy (line 7). Since \(Q(\mathrm{safe}, \mathrm{care}) = 0.1\) and \(Q(\mathrm{safe},\mathrm{neglect}) = 0\), by Eq. (3.8), we can obtain that the probability of choosing care and neglect are approximately 0.52 and 0.48, respectively. Thus, we see that after the first learning step, care has a higher probability to be chosen. Suppose \(a' = \mathrm{care}\), and then we can update \(Q(\mathrm{safe},\mathrm{neglect})\) (line 8). Recall that \(Q(s',a') = Q(\mathrm{safe}, \mathrm{care}) = 0.1\). We can easily obtain that the new \(Q(\mathrm{safe}, \mathrm{neglect})\) value is 0.209. After we update s and a (line 9), the second learning step finishes.
Since there is no terminal state in this MDP, in theory the learning loop continues infinitely. In practice, we can exit the loop after certain learning episodes. \(\square \)
As Example 3.2 illustrates, to learn a good policy, SARSA only needs to store a Qvalue table for each stateaction pair and performs very simple Qvalue updates (line 8) in every learning step; after enough learning time and under certain conditions (roughly speaking, \(\tau \) and \(\alpha \) values should approach 0 in certain rates; see [49] for detailed conditions), the Qvalue of each stateaction pair is guaranteed to converge to its optimal value (defined in Eq. 3.4) with probability 1, and the optimal policies can thus be derived. If the numbers of states and actions are huge, we may use function approximation techniques to approximate the values of Q(s, a), so as to avoid huge expenses in storing the Q(s, a) matrix.
4 Argumentation setting
In this section, we present a minimalist rulebased argumentation framework and its abstract account (Sect. 4.1). Then, we move on to the specification of the acceptance labelling of arguments (Sect. 4.2) and statements (Sect. 4.3).
4.1 Argumentation framework
We first present a rulebased argumentation framework and its abstract account, specifying so the structures on which we will develop our probabilistic setting. This rulebased argumentation setting is ‘minimalist’, so that we avoid the discussion of features which are unnecessary to our goal of providing a basic probabilistic account of argumentation. The argumentation setting is inspired from [19, 40], and uses some (adaptations of) their definitions.
The language of the framework is built from literals. A literal is either an atomic formula (atom) or its negation, and we usually denote a literal as \(\varphi \). For any literal \(\varphi \), the complementary literal is a literal corresponding to the negation of \(\varphi \).
Notation 4.1
We write \( \varphi \) to denote the complementary literal of \(\varphi \): if \(\varphi \) is an atom p then \(\varphi \) is \(\lnot p\), and if \(\varphi \) is \(\lnot p\) then \(\varphi \) is p.
We adopt a discrete temporal setting: the timeline is discretised into a set of ‘time slices’ or ‘time steps’ or ‘instants (of time)’ or ‘timestamps’ etc., so that the studied system is described by temporal modal literals, that we also call statements, taken at intervals that are regularly spaced with a predetermined time granularity \(\varDelta \).
Definition 4.1

\(Atoms \) be a set of propositional atoms,

\(Lit = Atoms \cup \{ \lnot p \mid p \in Atoms \}\) a set of literals,

\(Mod = \{ \square _1, \ldots , \square _n \}\) a set of modal operators, and

\(Times \) a discrete totally ordered set of timestamps \(\{t_1, t_2, \ldots \}\), such that \(t_{i+1}t_i = \varDelta \), \(\varDelta \in \mathbb {R}^{+}\).
Notation 4.2
 1.
A set of statements will be usually denoted \(\varPhi \), and the set of statements holding at time t as \(\varPhi ^t\), such that \(\varPhi ^t = \{ (\square \varphi \mathrm {\,at\,} t) \mid (\square \varphi \mathrm {\,at\,} t)\in \varPhi \} \).
 2.
A statement may be denoted \(\varphi \) when the modal and temporal information has no importance.
Given a statement, we may be interested only by its modal literal part without the timestamp, or its ‘atemporalised’ statement.
Definition 4.2

Given a statement \((\square \varphi \mathrm {\,at\,} t)\), its atemporalised statement is \((\square \varphi )\).

Given a set of statements \(\varPhi ^t\), its atemporalised set of statements is \(\varPhi = \{ \square \varphi \mid (\square \varphi \mathrm {\,at\,} t) \in \varPhi ^t \}.\)
Definition 4.1 can be ‘instantiated’ to specify a particular set of statements (including its constituents such as its set of instants of time), but we will often omit to mention such a set to avoid overloading the presentation.
Given a set of statements, we can build defeasible rules, so that some statements (defeasibly) support any particular statement.
Definition 4.3
 1.
\(r \in \textsf {LabRules}\) is the unique identifier of the rule,
 2.
\(\square _1 \varphi _1 \mathrm {\,at\,} t_1,\ldots , \square _n \varphi _n \mathrm {\,at\,} t_n, \square '_1 \varphi '_{1} \mathrm {\,at\,} t'_{1}, \ldots , \square '_m \varphi '_{m} \mathrm {\,at\,} t'_{m} \in \varPhi \) (\(0 \le n\) and \(0 \le m\)) are statements. \(\square _1 \varphi _1 \mathrm {\,at\,} t_1,\ldots ,\square _n \phi _n \mathrm {\,at\,} t_n\) is the antecedent of the rule, which is the conjunction of statements. The set \(\mathrm {Body}(r) = \{\square _1 \varphi _1 \mathrm {\,at\,} t_1,\ldots ,\square _n \varphi _n \mathrm {\,at\,} t_n, \sim \square '_1 \varphi '_{1} \mathrm {\,at\,} t'_{1}, \ldots , \sim \square '_m \varphi '_{m} \mathrm {\,at\,} t'_{m}\}\) is the body of the rule r;
 3.
\((\square \varphi \mathrm {\,at\,} t) \in \varPhi \) is the consequent (or head) of the rule, which is a single statement. The consequent of the rule r is denoted \(\mathrm {Head}(r)\), \(\mathrm {Head}(r) = (\square \varphi \mathrm {\,at\,} t).\)
Note that we put no constraints on the timestamps of statements in the body and the head of a defeasible rule, because that is not necessary at this stage. However, some constraints are assumed later (to discard ‘retroactive’ rules for example) when operating on the Markov setting of classic MDPs.
Notation 4.3
In the remainder, we may simply say ‘rule’ instead of ‘defeasible rule’. And for the sake of simplicity, we discard the use of ‘strict’ rules in our setting, because they did not appear essential for our purposes.
Rules may lead to conflicting statements (we ensure later that two conflicting statements cannot be both accepted). For this reason, we assume that a conflict relation is defined over statements to express conflicts in addition to those corresponding to negation.
Definition 4.4
(Conflict relation) A conflict relation over a set of statements\(\varPhi \), denoted \(\textit{Conflict} \), is a binary relation over \(\varPhi \), i.e. \(\textit{Conflict} \subseteq \varPhi \times \varPhi \), such that for any \((\square \varphi _1\mathrm {\,at\,} t), (\square \varphi _2 \mathrm {\,at\,} t)\in \varPhi \), if \(\varphi _1\) and \(\varphi _2\) are complementary (i.e. \(\varphi _1 = \varphi _2\)), then \((\square \varphi _1\mathrm {\,at\,} t)\) and \((\square \varphi _2 \mathrm {\,at\,} t)\) conflict, i.e. \((\square \varphi _1\mathrm {\,at\,} t, \square \varphi _2\mathrm {\,at\,} t) \in \textit{Conflict} \).
In addition, if a statement \((\square \varphi _1\mathrm {\,at\,} t)\) conflicts with \((\square \varphi _2 \mathrm {\,at\,} t)\) for any modality \(\square \), i.e. \((\square \varphi _1\mathrm {\,at\,} t, \square \varphi _2\mathrm {\,at\,} t) \in \textit{Conflict} \), then we say that literal \(\varphi _1\) conflicts with \(\varphi _2\), denoted \(\textit{Conflict} (\varphi _1, \varphi _2)\) in the remainder.
Two rules may have conflicting heads, and in this case one rule may prevail over the other one. To express this, we use a rule superiority relation \(\succ \) over rules, so that \(r_1\succ r_2\) states that rule \(r_1\) is superior to rule \(r_2\).
Definition 4.5
(Superiority relation over defeasible rules) A superiority relation over a set of rules\(Rul\), denoted \(\succ \), is an antireflexive and antisymmetric binary relation over \(Rul\), i.e. \( \succ \subseteq Rul\times Rul\).
A defeasible theory is built from a set of rules, conflicts over statements and a superiority relation over rules.
Definition 4.6
(Defeasible theory) A defeasible theory is a tuple \(T = \langle Rul, \textit{Conflict}, \succ \rangle \) where \(Rul\) is a set of defeasible rules, \(\textit{Conflict} \) is a conflict relation, and \(\succ \) is a superiority relation over defeasible rules.
One may advance that the conflict relation over statements should be antireflexive or symmetric. Similarly, one may argue that the superiority relation should be transitive. However, we are aiming at building a minimalist rulebased argumentation framework for reinforcement learning agents, where the addition of such properties would overload it, maybe without bringing any significant advantages. Hence, instead of overloading our approach with such properties, further specification of such relations is delegated to the designer of a defeasible theory.
By combining the defeasible rules in a theory, we can build arguments. We use the definition of arguments as given in [40], adapted to our setting.
Definition 4.7

\(A \in \textsf {LabArgs}\) is the unique identifier of the argument; and

\(0 \le n\) and \(0 \le m\), and \(A_1, \ldots , A_n\) are arguments constructed from T; and

\(r: \mathrm {Conc} (A_1), \ldots , \mathrm {Conc} (A_n), \sim \square _1 \varphi _{1} \mathrm {\,at\,} t_1, \ldots , \sim \square _m \varphi _m \mathrm {\,at\,} t_m \Rightarrow \square \varphi \mathrm {\,at\,} t\) is a rule in \(Rul\).
An argument without subarguments has the form \(A:\quad {\sim \square _1 \varphi _{1} \mathrm {\,at\,} t_1, \ldots , \sim \square _m \varphi _m \mathrm {\,at\,} t_m} \Rightarrow _r \varphi \), and it is called an assumptive argument. In the remainder, arguments are finite, and any argument bottoms out in assumptive arguments.
Definition 4.8
(Assumptive argument) An argument A is an assumptive argument if, and only if, its set of direct subarguments is empty, i.e. \(\mathrm {DirectSub} (A)= \emptyset \).
In this minimalist rulebased setting, the conclusion of any assumptive argument is an assumption. Any assumption is thus a statement which is the head of a rule with no antecedent.
Notation 4.4
 1.
Given a defeasible theory T, the set of assumptions is denoted \({\mathrm {Assum}}(T)\).
 2.
The set of assumptive arguments whose conclusions are a set of assumptions Assum is denoted \({\mathrm {AssumArg}}(Assum)\).
Arguments may conflict and thus attacks between arguments may appear. We consider two types of attacks: rebuttals (clash of incompatible conclusions) and undercuttings^{1} (attacks on negation as failure premises). In regard to rebuttals, we assume that there is a preference relation over arguments determining whether two rebutting arguments mutually attack each other or only one of them (being preferred) attacks the other. The preference relation over arguments can be defined in various ways on the basis of the preference over rules. We adopt a simple lastlink ordering according to which an argument A is preferred over another argument B, denoted as \(A \succ B\), if, and only if, the rule \(\mathrm {TopRule} (A)\) is superior to the rule \(\mathrm {TopRule} (B)\), i.e. \(\mathrm {TopRule} (A) \succ \mathrm {TopRule} (B)\).
Definition 4.9

B rebuts A (on \(A'\)) if, and only if, \(\exists A' \in \mathrm {Sub} (A)\) such that \(\mathrm {Conc} (B)\) and \(\mathrm {Conc} (A')\) are in conflict, i.e. \(\textit{Conflict} (\mathrm {Conc} (B),\mathrm {Conc} (A'))\), and \(A' \not \succ B\),

B undercuts A (on \(A'\)) if, and only if, \(\exists A' \in \mathrm {Sub} (A)\) such that \(\sim \mathrm {Conc} (B)\) belongs to the body of \(\mathrm {TopRule} (A')\), i.e. \((\sim \mathrm {Conc} (B)) \in \mathrm {Body}(\mathrm {TopRule} (A'))\).
On the basis of arguments and attacks between arguments, we use the common definition of an argumentation framework [19].
Definition 4.10
(Argumentation graph) An argumentation graph constructed from a defeasible theory T is a tuple \(\left\langle \mathscr {A}, \leadsto \right\rangle \) where \(\mathscr {A}\) is the set of all arguments constructed from T, and \(\leadsto \subseteq \mathscr {A}\times \mathscr {A}\) is a binary relation of attack.
Notation 4.5
Given a graph \(G = \left\langle \mathscr {A}, \leadsto \right\rangle \), the set of arguments \(\mathscr {A}\) may be denoted \(\mathscr {A}_{G}\). Similarly, the set of statements appearing in the rules of the defeasible theory underlying a graph G may be denoted \(\varPhi _G\).
An example of argumentation graph is illustrated in Fig. 2. As exposed later, the relation of subargument is primordial in a probabilistic setting, since the ‘belief’ in an argument necessarily imply the beliefs in its subarguments. And for this reason, we now define subtheories and associated legal subgraphs.
Definition 4.11
(Defeasible subtheory) Let T denote a defeasible theory \(\langle Rul, \textit{Conflict}, \succ \rangle \). A defeasible subtheoryU of T is a defeasible theory \(\langle Rul', \textit{Conflict}, \succ \rangle \) such that \(Rul' \subseteq Rul\).
Definition 4.12

\(\mathscr {A}_H\) is the set of all arguments constructed from a defeasible subtheory \(U \in T\), and

\(\leadsto _H = \leadsto \cap (\mathscr {A}_H \times \mathscr {A}_H)\).
To recap, our argumentation framework is based on defeasible theories featuring defeasible rules, conflicts amongst statements and a superiority relation over these rules to resolve conflicts. Given a defeasible theory, we can built arguments and attacks between arguments, leading to an argumentation graph and its legal subgraphs. Next, we will see how to label any argument with an argumentation status.
4.2 Labelling of arguments
Given an argumentation graph, we can compute sets of acceptable or discarded arguments, i.e. arguments that will survive or not to attacks and counterattacks. To do that, we label arguments as reviewed in [7], but slightly adapted to our upcoming probabilistic setting. Accordingly, we distinguish three labellings: \(\{{{{\small {\textsf {ON}}}}}, {{\small {\textsf {OFF}}}}\}\)labelling, \(\{{{\small {\textsf {IN}}}},{{\small {\textsf {OUT}}}},{{\small {\textsf {UND}}}}\}\)labelling and \(\{{\small {\textsf {IN}}},{\small {\textsf {OUT}}},{\small {\textsf {UND}}},{\small {\textsf {OFF}}}\}\)labelling. In a \(\{{\small {\textsf {ON}}},{\small {\textsf {OFF}}}\}\)labelling, each argument is associated with one label which is either \({\small {\textsf {ON}}}\) or \({\small {\textsf {OFF}}}\) to indicate whether an argument is expressed or not (i.e. the event of an argument occurs or not). In a \(\{{\small {\textsf {IN}}},{\small {\textsf {OUT}}},{\small {\textsf {UND}}}\}\)labelling, each argument is associated with one label which is either \({\small {\textsf {IN}}}\), \({\small {\textsf {OUT}}}\), or \({\small {\textsf {UND}}}\): a label ‘\({\small {\textsf {IN}}}\)’ means the argument is accepted while a label ‘\({\small {\textsf {OUT}}}\)’ indicates that it is rejected. The label ‘\({\small {\textsf {UND}}}\)’ marks the status of the argument as undecided. The \(\{{\small {\textsf {IN}}},{\small {\textsf {OUT}}},{\small {\textsf {UND}}},{\small {\textsf {OFF}}}\}\)labelling extends a \(\{{\small {\textsf {IN}}},{\small {\textsf {OUT}}},{\small {\textsf {UND}}}\}\)labelling with the \({\small {\textsf {OFF}}}\) label to indicate that an argument is omitted, that is, depending on the context, not believed or unexpressed.
Definition 4.13
(Argument labelling) Let G be an argumentation graph, and \(\textsf {ArgLabels}\) a set of labels for arguments. An \(\textsf {ArgLabels}\)labelling of a set \(\mathscr {A} \subseteq \mathscr {A}_G\) is a total function \(\mathrm {L}: \mathscr {A} \rightarrow \textsf {ArgLabels}\).

a \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling of G is a total function \(\mathrm {L}: \mathscr {A}_G \rightarrow \{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\);

a \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)labelling of G is a total function \(\mathrm {L}: \mathscr {A}_G \rightarrow \{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\);

a \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling of G is a total function \(\mathrm {L}: \mathscr {A}_G \rightarrow \{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\).
Notation 4.6
 1.
The set of all possible \(\textsf {ArgLabels}\)labelling assignments of a set of arguments \(\mathscr {A}\) is denoted as \(\mathscr {L}_{\textsf {ArgLabels}}(\mathscr {A})\), and, given an argumentation graph G, we may write \(\mathscr {L}_{\textsf {ArgLabels}}(G)\) instead of \(\mathscr {L}_{\textsf {ArgLabels}}(\mathscr {A}_G)\).
 2.
Given an \(\textsf {ArgLabels}\)labelling \(\mathrm {L}\), the set of arguments with a label l may be denoted \(l(\mathrm {L})\), i.e. \(l(\mathrm {L}) = \{A \mid \mathrm {L}(A) = l \}\). For example, \({\small {\textsf {ON}}}(\mathrm {L}) = \{A\mid \mathrm {L}(A) = {\small {\textsf {ON}}}\}\).
 3.
A \(\{{\small {\textsf {IN}}},{\small {\textsf {OUT}}},{\small {\textsf {UND}}}\}\)labelling \(\mathrm {L}\) may be represented as a tuple \(\langle {\small {\textsf {IN}}}(\mathrm {L}), {\small {\textsf {OUT}}}(\mathrm {L}), {\small {\textsf {UND}}}(\mathrm {L}) \rangle \), and a \(\{{\small {\textsf {IN}}},{\small {\textsf {OUT}}},{\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling \(\mathrm {L}\) as a tuple \(\langle {\small {\textsf {IN}}}(\mathrm {L}), {\small {\textsf {OUT}}}(\mathrm {L}),{\small {\textsf {UND}}}(\mathrm {L}), {\small {\textsf {OFF}}}(\mathrm {L}) \rangle \).
Generally, not all labellings in \(\mathscr {L}_{\textsf {ArgLabels}}(G)\) are meaningful or have satisfactory properties. An X\(\textsf {ArgLabels}\)labelling specification identifies for any argument graph G a subset of \(\mathscr {L}_{\textsf {ArgLabels}}(G)\).
Definition 4.14
(Argument labelling specification) Let G denote an argumentation graph, and \(\textsf {ArgLabels}\) a set of labels, an X\(\textsf {ArgLabels}\)labelling specification of a set of arguments \(\mathscr {A} \subseteq \mathscr {A}_G\) identifies a set of \(\textsf {ArgLabels}\)labellings of \(\mathscr {A}\), denoted as \(\mathscr {L}^X_{\textsf {ArgLabels}}(\mathscr {A})\), such that \(\mathscr {L}^X_{\textsf {ArgLabels}}(\mathscr {A}) \subseteq \mathscr {L}_{\textsf {ArgLabels}}(\mathscr {A})\).
We will focus on specifications of \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)labellings and \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labellings based on complete \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)labellings [7].
Definition 4.15

A is labelled \({\small {\textsf {IN}}}\) if, and only if, all attackers of A are \({\small {\textsf {OUT}}}\),

A is labelled \({\small {\textsf {OUT}}}\) if, and only if, A has an attacker \({\small {\textsf {IN}}}\).
Since a labelling is a total function, if an argument is not labelled \({\small {\textsf {IN}}}\) or \({\small {\textsf {OUT}}}\), then it is \({\small {\textsf {UND}}}\).
An argumentation graph may have several complete \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)labellings: we will focus on the unique complete labelling with the smallest set of labels \({\small {\textsf {IN}}}\) (or equivalently with the largest set of labels \({\small {\textsf {UND}}}\)) called the grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)labelling [7, 19].
Definition 4.16
(Grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)labelling) A grounded\(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)labelling\(\mathrm {L}\) of an argumentation graph G is a complete \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)labelling of G such that \({\small {\textsf {IN}}}(\mathrm {L})\) is minimal (w.r.t. set inclusion) amongst all complete \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)labellings of G.
Moving to forthcoming probabilistic argumentation, it is possible to extend these standards by adding a label \({\small {\textsf {OFF}}}\) to indicate excluded arguments from reasoning. The idea is to match any legal subgraph with a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling by ‘switching off’ arguments outside the considered subgraph, and we do the similar operation to define grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labellings.
Definition 4.17

every argument in \(\mathscr {A}_H\) is labelled \({\small {\textsf {ON}}}\),

every argument in \(\mathscr {A}_G \backslash \mathscr {A}_H\) is labelled \({\small {\textsf {OFF}}}\).
Definition 4.18

every argument in \(\mathscr {A}_H\) is labelled as in the grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)labelling of H,

every argument in \(\mathscr {A}_G \backslash \mathscr {A}_H \) is labelled \({\small {\textsf {OFF}}}\).
We can note that a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling of an argumentation graph G can be computed in a time that is polynomial in the number of arguments of G.
Lemma 4.1
(Time complexity of a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling) The time complexity of a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling of an argumentation graph G is \(O(\mathscr {A}_G^c)\).
To recap, given an argumentation graph, we can label arguments following a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling and its grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling counterpart, using a polynomial algorithm. Next, we look at the labelling of statements.
4.3 Labelling of statements
Up to this point, we have worked with the labelling of arguments with no consideration for the labelling of statements, and in particular no consideration for the labelling of the conclusions supported by arguments.
Labellings of statements can be performed in different manners, see e.g. [8]. From an abstract point of view, given a set of statements, a labelling of this set is a total function associating any statement with a label.
Definition 4.19
(Statement labelling) Let \(\varPhi \) be a set of statements, and \(\textsf {LitLabels}\) a set of labels on statements. A \(\textsf {LitLabels}\)labelling of \(\varPhi \) is a total function \(K: \varPhi \rightarrow \textsf {LitLabels}\).
Per se, a labelling of literals is just a function mapping a set of statements to a set of labels, but such a labelling may rely on an acceptance labelling of arguments. For our very purposes, we will use acceptance statement labellings [8], where, given an argumentation graph built from a defeasible theory, a statement labelling is built with respect to any labelling of a specific set of argument labellings of the graph.
Definition 4.20

G denote an argumentation graph built from a defeasible theory,

\(\mathscr {L}=\mathscr {L}^{X}_{\textsf {ArgLabels}}(G)\) the set of X\(\textsf {ArgLabels}\)labellings of G,

\(\varPhi \) a set of statements, and

\(\textsf {LitLabels}\) a set of labels on statements.
For example, if we write \(\mathrm {K}(\mathrm {L},\varphi ) = \textsf {in}\), then it means that, given the argument labelling \(\mathrm {L}\), the statement \(\varphi \) is labelled \(\textsf {in}\).
Various acceptance statement labellings can be specified [8]. We will focus on the simplest labelling we can think of for our purposes, the bivalent\(\{\textsf {in}, \textsf {no}\}\)labelling according to which a statement is either accepted or not, without further sophistication. If a statement is accepted then it is labelled ‘\(\textsf {in}\)’, otherwise it is labelled ‘\(\textsf {no}\)’.
Definition 4.21

G denote an argumentation graph built from a defeasible theory,

\(\mathscr {L}=\mathscr {L}^{\textsf {grounded}}_{\{\textsf {IN}, \textsf {OUT}, \textsf {UND}, \textsf {OFF} \}}(G)\) the set of grounded \(\{ {\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labellings of G, and

\(\varPhi \) a set of statements.
An algorithm for computing the bivalent \(\{\textsf {in}, \textsf {no}\}\)labelling of a set of statements and from a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling is presented in Algorithm 4. We can note that, given a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling of an argumentation graph G, the bivalent \(\{\textsf {in}, \textsf {no}\}\)labelling of a set of statements \(\varPhi \) can be computed in a time that is linear in the number of arguments of G times the number of statements in \(\varPhi \) (more efficient algorithms may exist).
Lemma 4.2
(Time complexity of a bivalent \(\{\textsf {in}, \textsf {no}\}\)labelling) Given a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}\), \({\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling \(\mathrm {L}\) and a finite set of statements \(\varPhi \), the time complexity of computing the bivalent \(\{\textsf {in}, \textsf {no}\}\)labelling assignment \(\mathrm {K}(\mathrm {L}, \phi )\) for every \(\phi \in \varPhi \) is \(O(\varPhi  \times \)\({\small {\textsf {IN}}}\)\((\mathrm {L}))\).
We can accommodate Definition 4.21 so that \(\mathscr {L}=\mathscr {L}^{\textsf {legal}}_{ {\{\textsf {ON}, \textsf {OFF} \}} }(G)\), i.e. \(\mathscr {L}\) is the set of legal \(\{ {\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labellings of the argumentation graph G. In this case, we can straightforwardly consider the corresponding set of grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labellings, so that with a slight notational shortcut, we have the bivalent \(\{\textsf {in}, \textsf {no}\}\)labelling \(\mathrm {K}(\mathrm {L}, \varPhi )\) where \(\mathrm {L}\) is a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling.
Eventually, as we considered atemporalised statements, we consider atemporalised bivalent \(\{\textsf {in}, \textsf {no}\}\)labellings.
Definition 4.22
(Atemporalised bivalent\(\{\textsf {in}, \textsf {no}\}\)labelling) Let \(\varPhi ^t\) denote a set of statements holding at time t, and \(\varPhi \) its set of atemporalised statements. Given a bivalent \(\{\textsf {in}, \textsf {no}\}\)labelling \(K^t\) of \(\varPhi ^t\), its atemporalised bivalent\(\{\textsf {in}, \textsf {no}\}\)labelling is a bivalent \(\{\textsf {in}, \textsf {no}\}\)labelling K of \(\varPhi \) such that, for any modal literal \(\square \varphi \in \varPhi \), \(\square \varphi \) is labelled \(\textsf {in}\), i.e. \(\mathrm {K}(\square \varphi ) = \textsf {in}\), if and only if \(\mathrm {K}^t(\square \varphi \mathrm {\,at\,} t) = \textsf {in}\).
In summary, given a labelling of arguments (and in particular a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling), a set of statements can be labelled according to a bivalent \(\{\textsf {in}, \textsf {no}\}\)labelling, with an efficient algorithm. And from a bivalent \(\{\textsf {in}, \textsf {no}\}\)labelling, we can hold its atemporalised bivalent \(\{\textsf {in}, \textsf {no}\}\)labelling.
5 Probabilistic argumentation setting
In this section, we endorse probabilistic labellings of arguments as developped in [44] (Sect. 5.1), compact representations (Sect. 5.2) and ‘memoryless’ properties to deal with the complexity of the temporal setting (Sect. 5.3).
5.1 Probabilistic labellings
Having laid out an approach encompassing legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labellings as well as grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}\}\)labellings and grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labellings, we are now ready to introduce the treatment of probabilistic uncertainty as proposed in the approach of probabilistic labellings [44]. Given an X\(\textsf {ArgLabels}\)labelling specification for an argumentation graph G, a probability value is assigned to each element of the set \(\mathscr {L}^{X}_{\textsf {ArgLabels}}(G)\) of labellings, which represents the sample space. Intuitively, each labelling is viewed as a possible outcome, having a certain probability to occur.
Given an X\(\textsf {ArgLabels}\)labelling specification for an argumentation graph G, there is the choice between defining the sample space as either (i) the set \(\mathscr {L}^{X}_{\textsf {ArgLabels}}(G)\) of labellings of G, or (ii) the set \(\smash {{\mathscr {L}_{\textsf {ArgLabels}}(G)}}\) (and the probability for any labelling which is in \(\smash {{\mathscr {L}_{\textsf {ArgLabels}}(G)} \backslash \mathscr {L}^{X}_{\textsf {ArgLabels}}(G)}\) is set to 0). In fact, this distinction does not really matter in practice, but it turns out that the second definition of the sample space better fits the use of factors (as we will conceive them soon), and it is the reason the second definition is favoured here.
Notation 5.1
Whilst our notational nomenclature refers to X\(\textsf {ArgLabels}\)labellings (e.g. one may refer to \(\textsf {legal}\)\(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labellings, or \(\textsf {grounded}\)\(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labellings and so on) where X and \(\textsf {ArgLabels}\) refer to an X\(\textsf {ArgLabels}\)labelling specification, for the sake of notational conciseness we will sometimes speak of \(\mathscr {S}\)labellings, using a single symbol \(\mathscr {S}\) to synthesise the pair of symbols X\(\textsf {ArgLabels}\), i.e. \(\mathscr {S} = X\)\(\textsf {ArgLabels}\).
Definition 5.1

the sample space \(\varOmega \) is the set of labellings of G, \(\varOmega = {\mathscr {L}_{\textsf {ArgLabels}}(G)}\),

the \(\sigma \)algebra F is the power set of \(\varOmega \),

the function P from \(F(\varOmega )\) to [0, 1] is a probability distribution satisfying Kolmogorov axioms, such that for any labelling \(\mathrm {L}\) not in the set \(\mathscr {L}^{X}_{\textsf {ArgLabels}}(G)\) of labellings of G, the probability of \(\mathrm {L}\) is 0, i.e. \(\smash {\forall \mathrm {L}\in \varOmega \backslash \mathscr {L}^{X}_{\textsf {ArgLabels}}(G),~P(\{\mathrm {L}\})=0 }\).
Since any legal \(\{{\small {\textsf {ON}}},{\small {\textsf {OFF}}}\}\)labelling can be trivially mapped to one and only one grounded \(\{{\small {\textsf {IN}}},{\small {\textsf {OUT}}},{\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling (as we can visualise in Fig. 5), we can straightforwardly map a probabilistic labelling frame \(\langle G, \mathrm {legal}\text{ }\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}, \langle \varOmega ,F,P \rangle \rangle \) into a probabilistic labelling frame \(\langle G, \mathrm {grounded}\text{ }\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}, \langle \varOmega ', F', P' \rangle \rangle \): let \(\mathrm {L}\) denote a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling in \(\varOmega \), and let \(\mathrm {L}'\) denote its grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling counterpart in \(\varOmega '\), such that \(\mathrm {L}(A) = {\small {\textsf {ON}}}\) if, and only if, \(\mathrm {L}'(A) = {\small {\textsf {IN}}}\) or \(\mathrm {L}'(A) = {\small {\textsf {OUT}}}\) or \(\mathrm {L}'(A) = {\small {\textsf {UND}}}\); we may state \(P(\mathrm {L}) = P'(\mathrm {L}')\) so that the probability of a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling equals the probability of its legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling counterpart. Therefore, using this mapping, we may compute the probability of a grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling by computing the probability of a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling, and vice versa. In the rest of the paper, we will focus on legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labellings and grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labellings for our probabilistic setting, and we will often use the abovementioned mapping, leaving the labelling specification \(\mathscr {S}\) of a probabilistic labelling frame \(\langle G, \mathscr {S}, \langle \varOmega ,F,P \rangle \rangle \) possibly holding for \(\textsf {legal}\)\(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\) or \(\textsf {grounded}\)\(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\).
As we have defined our probabilistic space, we can work with random variables, i.e. functions (traditionally defined by an upper case letter as X, Y or Z for example) from the sample space \(\varOmega \) into another set of elements. Accordingly, for every argument A, we introduce a categorical random variable called a random labelling denoted \(L_A\) from \(\varOmega \) into a set of labels, presumably \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\) or \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\). So the event \(L_A = {\small {\textsf {ON}}}\) is a shorthand for the outcomes \(\{ \mathrm {L}\in \varOmega \mid \mathrm {L}(A) = {\small {\textsf {ON}}}\}\), or \(\{ \mathrm {L}\in \varOmega \mid \mathrm {L}(A) = {\small {\textsf {IN}}} \text{ or } \mathrm {L}(A) = {\small {\textsf {OUT}}} \text{ or } \mathrm {L}(A) = {\small {\textsf {UND}}}\}\) in case of \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labellings.
We also introduce random variables for the labelling of statements. Accordingly, for any statement \(\varphi \), we introduce a categorical random variable which is denoted \(K_\varphi \) and which can take value in the set \(\textsf {LitLabels}\) of labels of a specified \(\textsf {LitLabels}\)labelling of statements. These random variables for statements are also called random labellings.
Notation 5.2
 1.
When the context does not give rise to any ambiguity, we will abbreviate the random labelling \(L_A\) by simply A (in italics).
 2.
We denote Val(X) the set of values that a random labelling can take. For example, Val\((L_A) = \{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\).
 3.
We use upper boldface type to denote sets of random labellings. So \(\mathbf L \) denotes a set of random labellings \(\{L_{A1}, \ldots , L_{An}\}\), and \(\mathbf {K}\) denotes a set of random labellings \(\{K_{\varphi 1}, \ldots , K_{\varphi n}\}\).
 4.
We use lower boldface type to denote assignments to a set of random labellings, i.e. assignments of values to the variables in this set. So given \(\mathbf {L} = \{L_{A_1}, \ldots , L_{A_n}\}\), a possible assignment is \(\mathbf {l} = \{L_{A_1} = {\small {\textsf {ON}}}, \ldots , L_{A_n}= {\small {\textsf {OFF}}}\}\).
 5.
A labelling assignment \(\mathbf {l} = \{L_{A_1} = l_1, \ldots , L_{A_n} = l_n\}\) may be used to denote the assignment corresponding to a labelling \(\mathrm {L}\) (and viceversa), such that \(L_{A_i} = l_i\) if, and only if, \(\mathrm {L}(A_i) = l_i\). The same shortcut applies for the labelling of statements.
 6.
The joint distribution over a set \(\mathbf L = \{L_{A1}, L_{A2}, \ldots , L_{An} \}\) of random labelling is formally denoted \(P(\{L_{A1}, L_{A2}, \ldots , L_{An} \})\), but we will often write it \(P(L_{A1}, L_{A2}, \ldots , L_{An})\).
Example 5.1
Referring to the probabilistic labelling frame illustrated in Fig. 5, we can assert \(P(\{ L_{\textsf {B1}} = {\small {\textsf {IN}}}, L_{\textsf {B2}} = {\small {\textsf {IN}}}, L_{\textsf {B}} = {\small {\textsf {IN}}}\} ) = 1/4 \). As an alternative notation, given the assignment \(\mathbf {l} = \{ L_{\textsf {B1}} = {\small {\textsf {IN}}}, L_{\textsf {B2}} = {\small {\textsf {IN}}}, L_{\textsf {B}} = {\small {\textsf {IN}}}\}\), we have \(P(\mathbf {l}) = 1/4 \). \(\square \)
5.2 Compact representation
In our probabilistic settings for argumentation, instead of specifying the probability of every possible labelling, we can resort to compact representations of the sample space; and we will do it by considering factors allowing us to break down a joint probability into a product of manageable parts.
So, given a set \(\mathbf L \) of random labellings, we employ (positive) factors (see [29]): a factor is a function, denoted \(\phi \), from the possible set of assignments Val\((\mathbf L )\) to positive real numbers \(\mathbb {R}^+\). The set \(\mathbf L \) is called the scope of the factor \(\phi \). On this basis, we can write the joint distribution of random labellings as a Gibbs distribution parametrised by a set of factors.
Definition 5.2
Example 5.2

\(R_r(\mathrm {L}) =1\) if, and only if, the rule r is included in the set of rules of any argument A labelled \({\small {\textsf {ON}}}\) (or \({\small {\textsf {IN}}}\) or \({\small {\textsf {OUT}}}\) or \({\small {\textsf {UND}}}\)) in labelling \(\mathrm {L}\),

\(R_r(\mathrm {L}) = 0\) otherwise.
Definition 5.3

p is a real number in [0, 1] (the marginal probability of the rule);

\( r:\ \square _1 \varphi _1 \mathrm {\,at\,} t_1,\ldots , \square _n \varphi _n \mathrm {\,at\,} t_n, \sim \square '_1 \varphi '_{1} \mathrm {\,at\,} t'_1, \ldots , \sim \square '_m \varphi '_m \mathrm {\,at\,} t'_m \Rightarrow \square \varphi \mathrm {\,at\,} \, t \) is a defeasible rule over \(\varPhi \).
Definition 5.4
(Probabilistic defeasible theory) A probabilistic defeasible theory is a tuple \(\langle Rul, \textit{Conflict}, \succ \rangle \) where \(Rul\) is a set of probabilistic defeasible rules, \(\textit{Conflict} \) is a conflict relation, and \(\succ \) is a superiority relation over defeasible rules.
A probabilistic defeasible theory is just a slight development of a defeasible theory (Definition 4.6) where every rule is given with its marginal probability. From a probabilistic defeasible theory, we can thus straightforwardly hold the defeasible theory where the probabilities on rules are omitted, and from which we can build an argumentation graph and thus a probabilistic argumentation frame. Then, the probabilities on rules are simply constraints on the (Gibbs) probability distribution P of the frame, see [44].
Concerning computational complexity, a naive approach to draw a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling is to make a table recording all the possible legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labellings, then compute the probability of every \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling (through the Gibbs–Boltzmann distribution), and finally draw a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling from this distribution. Unfortunately, this approach is not efficient because the number of legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labellings of an argumentation graph G with \(n= \mathscr {A}_G\) arguments is \(2^n\) in the worst case. To address the complexity of drawing legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labellings, probabilistic rules and probabilistic defeasible theories provide a more efficient means to draw legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labellings.
Lemma 5.1
(Time complexity of drawing a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling) Let \(T =\langle Rul, \textit{Conflict}, \succ \rangle \) denote a probabilistic defeasible theory, the time complexity of drawing a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling of the argumentation graph G constructed from T is \(O( Rul\cdot \mathscr {A}_G^2 )\).
Proof
Firstly, the complexity of generating a random number is O(1). Thus the complexity of drawing \(m \le Rul\) defeasible rules is \(O(Rul)\). Secondly, suppose that the argumentation graph G built from defeasible theory T is built ex ante, along with a mapping from rules to arguments associating every rule r with the set of arguments \(\mathscr {A}_r\) necessarily built with r. For every rule r, if rule r was not drawn then every argument in \(\mathscr {A}_r\) is labelled \({\small {\textsf {OFF}}}\). Deleting arguments labelled \({\small {\textsf {OFF}}}\) in \(\mathscr {A}_G\) is \(O(Rul\cdot \mathscr {A}_G\cdot \mathscr {A}_G)\). Every argument which is not labelled \({\small {\textsf {OFF}}}\) is labelled \({\small {\textsf {ON}}}\). Therefore, the time complexity of drawing a legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling of the argumentation graph G constructed from T is \(O(Rul\cdot \mathscr {A}_G^2)\). \(\square \)
So, factors (along with energybased models) and probabilistic rules allow us to have compact representations of probabilistic frames. In practice, we can specify a probabilistic argumentation frame from a probabilistic defeasible theory, however such a theory will be often left implicit in the remainder. We use later these constructs to account for MDPs and learning agents. We will restrict ourselves to a memoryless account of MDPs, and in this regard, we endorse next a ‘memoryless’ account of argumentation.
5.3 Memoryless argumentation
Let us recap. Given a defeasible theory (Definition 4.6), we first construct an argumentation graph by building arguments (Definition 4.7) and the attacks between them (Definition 4.9). This argumentation graph is used in a probabilistic labelling frame (Definition 5.1), and to deal with joint probabilities in this probabilistic labelling frame, we factorise arguments into several ‘groups’ scoped by factors so that we can write the joint distribution of random labellings as a Gibbs distribution (Definition 5.2). Such a distribution can have an exponential form (so that, as we will see soon, energies can be shaped by RL), but it can also account for probabilistic defeasible rules and probabilistic defeasible theories (Definition 5.4).
However, a major issue with the framework so far is that argumentation graphs may appear very large, especially in a temporal setting. Let us illustrate this point with the following example.
Example 5.3
To address the size of argumentation graphs in a temporal setting, we take advantage of the memoryless property in MDPs (see Sect. 3.1). In that regard, we differentiate a logicbased memoryless setting and a probabilisticbased memoryless setting.
Both the logicbased and probabilisticbased memoryless settings introduced above establish a Markovian model of transitions between labellings. We will use labellings of statements to represent states and, furthermore, use our PA framework to represent the MDPs and build RL algorithms. The details of our argumentationbased MDP and RL framework are presented in the next section. In the rest of the section, and as a preparation of our argumentbased MDP setting, we first deal with the representation of states with labellings, and then present how we can model the (Markovian style) transition between states by using the memoryless setting we introduced above.
A state is a labelling of statements, depending on the labelling specification.
Definition 5.5
(\(\textsf {LitLabels}\)state) Let \(\varPhi ^t\) be a set of statements holding at time t. A \(\textsf {LitLabels}\)state at time t with respect to \(\varPhi ^t\) is a \(\textsf {LitLabels}\)labelling of \(\varPhi ^t\).
For our purposes, only \(\{\textsf {in}, \textsf {no}\}\)states will be considered, and a \(\{\textsf {in}, \textsf {no}\}\)state at time t with respect to \(\varPhi ^t\) may be simply called a state, leaving the set \(\varPhi ^t\) implicit.
Notation 5.3
 1.
A \(\{\textsf {in}, \textsf {no}\}\)state at time t with respect to \(\varPhi ^t\) may be denoted \(\mathrm {K}^t\) or \(s_t\).
 2.
The set of all \(\{\textsf {in}, \textsf {no}\}\)states with respect to \(\varPhi ^t\) is denoted \(\mathscr {K}(\varPhi ^t)\).
In Definition 5.5, a state at time t includes timestamps because any statement representation includes a timestamp. However, one may prefer to work with atemporalised states (as in classic MDP settings). For this reason, and similarly as we did for atemporalised bivalent \(\{\textsf {in}, \textsf {no}\}\)labellings (Definition 4.22), we define atemporalised states.
Definition 5.6
(Atemporalised\(\{\textsf {in}, \textsf {no}\}\)state) Let \(\mathrm {K}^t\) denote a \(\{\textsf {in}, \textsf {no}\}\)state at time t, its atemporalised\(\{\textsf {in}, \textsf {no}\}\)state is the atemporalised bivalent \(\{\textsf {in}, \textsf {no}\}\)labelling of \(\mathrm {K}^t\).
Now we have to present how to specify the transition between states in line with the memoryless setting of labellings, as specified in Eq. (5.8). As Example 5.3 illustrates, the number of arguments heading at some time t may appear unmanageable from a computational perspective. As a workaround, the transition from a state \(s_{t\varDelta }\) to the state \(s_t\) will be specified by a probabilistic labelling frame \(\langle G^t, \mathscr {S}, \langle \varOmega , F, P \rangle \rangle \) such that the argumentation graph \(G^t\) is built from a transition defeasible theory\(T^t\).
Of course, given a time sequence \(0, \varDelta , 2\cdot \varDelta , \ldots , n\cdot \varDelta \), we will not write a transition defeasible theory by hand for every instant \(0, \varDelta , 2\cdot \varDelta , \ldots , n\cdot \varDelta \). Instead, we will have a template theory, denoted \(T^*\), where most variables (in particular temporal variables) are free to hold for any value of its domain. From a template theory \(T^*\), we can consider the ground theory, denoted \(T^{\sigma }\), of all the possible ground rules, conflicts, and superiority relations.
Consequently, any factor is constrained such that its scope only includes random labellings of arguments whose top conclusion holds at time t or \(t'\) with \(t  t' \le \varDelta \). As for notation, a factor with such a scope is denoted \(\phi ^t\).
Then, the transition from a state \(s_t\) to another state \(s_{t+1}\) will be performed through the ground theory relating the statements holding at time t and those statements presumably holding at time \(t+1\).
Accordingly, we take advantage of the memoryless setting so as to compute the state labellings ‘step by step’ in order to avoid computing the labellings in all time steps at once: given the state labelling at t, we use the memoryless setting to obtain the labelling at time \(t+1\). To do this ‘step by step’ computation, we formally define the transition theory heading at time t as follows.
Definition 5.7

rules heading at time t, i.e. \(Rul^t\), and

rules leading to assumptions holding at time \(t\varDelta \) appearing in the body of the rules in \(Rul^t\).

arguments whose conclusions hold at time t, and

assumptive arguments whose conclusions hold at time \(t\varDelta \).
Notation 5.4
A labelling of a graph \(G^{t+\varDelta }\) is denoted \((\mathrm {L}^{t+\varDelta }, \mathrm {L}^{t})\), where \(\mathrm {L}^{t+\varDelta }\) is the labelling of arguments whose conclusions hold at \(t+\varDelta \), and \(\mathrm {L}^{t}\) is the labelling of assumptive arguments whose conclusions hold at t.
Definition 5.8

\(\langle G^{t+\varDelta }, \mathscr {S}, \langle \varOmega , F, P \rangle \rangle \) denote a probabilistic labelling frame, where \(G^{t+\varDelta }\) is built from the transition theory \(T^{t+\varDelta }\) heading at time \(t+\varDelta \),

\({Assum}^{t}\) the assumptions at time t, \({Assum}^{t} = \{(\square \varphi \mathrm {\,at\,} t)\mid (\square \varphi \mathrm {\,at\,} t) \in {\mathrm {Assum}}(T^{t+\varDelta }) \}\),

\(\varPhi \supseteq \varPhi _G\) a set of statements,

\(\mathscr {K}(\varPhi ^t)\) the set of states at time t,

\(\mathscr {K}(\varPhi ^{t+\varDelta })\) the set of states at time \(t+\varDelta \),

\(s_t = \mathrm {K}^t\) a state at time t, \(\mathrm {K}^t\in \mathscr {K}(\varPhi ^t) \),

\(s_{t+\varDelta } = \mathrm {K}^{t+\varDelta }\) a state at time \(t+\varDelta \), \(\mathrm {K}^{t+\varDelta }\in \mathscr {K}(\varPhi ^{t+\varDelta }) \).
\(P(s_{t+\varDelta } \mid s_{t}) = P( \mathbf {k}^{t+\varDelta } \mid \underset{A \in \textsf {ON}^t}{\bigcup \, \{ L_A = {\small {\textsf {ON}}}\}} \quad \underset{ A \in \textsf {OFF}^t }{\bigcup \, \{ L_A = {\small {\textsf {OFF}}}\} } )\)
where^{2}\({\small {\textsf {ON}}}^t = assumpArg( {Assum}^{t}\cap \textsf {in}(\mathrm {K}^t))\), and \({\small {\textsf {OFF}}}^t = assumpArg( {Assum}^{t}\backslash \textsf {in}(\mathrm {K}^t))\).
From the state \(\smash {s_{t+\varDelta }}\) and the transition theory \(\smash {T^{t+2\cdot \varDelta }}\) we can make a transition to the state \(\smash {s_{t+2\cdot \varDelta }}\), and so on.
Theorem 5.1
Proof
Our step by step computational setting defines thus a discretetime Markov chain in terms of a graph of states over which any transition is ‘argued’ through an argumentation graph. Given a probabilistic labelling frame, we may consider all the possible states, and the transition matrix, so that we can reuse common mathematical techniques to study such systems. It is also possible to run MonteCarlo simulations of an argumentbased Markov chain. In this regard, the argumentbased transition to a state \(s_{t+\varDelta }\) can be computed in a time that is polynomial in the number of arguments of the argumentation graph \(G^{t+\varDelta }\) heading at \(t+\varDelta \).
Theorem 5.2
(Time complexity of an argumentbased transition) Let \(\smash {\mathrm {K}^{t},} \langle G^{t+\varDelta }, \mathscr {S}, \langle \varOmega , F, P \rangle \rangle \smash {\rightarrow \mathrm {K}^{t+\varDelta }}\) denote an argumentbased transition, where \(G^{t+\varDelta }\) is constructed from a probabilistic defeasible theory. Given the state \(\smash {\mathrm {K}^{t}}\), the time complexity of the argumentbased transition is \(\smash {O(Rul\cdot \mathscr {A}_G^2 + \mathscr {A}_G^c)}\).
Proof
To draw a state, we have three steps. In the first step, we draw a \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling of \(\smash {G^{t+\varDelta }}\) built from the probabilistic theory \(\langle Rul, \textit{Conflict}, \succ \rangle \), this step is \(O(Rul\cdot \mathscr {A}_G^2)\), see Lemma 5.1. In the second step, we compute the grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling \(\mathrm {L}^{t+\varDelta }\) of the \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling of \(\smash {G^{t+\varDelta }}\), this step is \(O(\mathscr {A}_G^c)\), see Lemma 4.1. In the third step, we compute the labelling \(\smash {\mathrm {K}^{t+\varDelta }}\) of a set \(\varPhi \) of statements from the labelling \(\mathrm {L}^{t+\varDelta }\), i.e. \(\mathrm {K}(\mathrm {L}^{t+\varDelta }, \varPhi ) = \mathrm {K}^{t+\varDelta } \), this step is \(\smash {O(\varPhi  \times {\small {\textsf {IN}}}(\mathrm {L}^{t+\varDelta }))}\), see Lemma 4.2. Therefore, the time complexity of an argumentbased transition is \(\smash {O(Rul\cdot \mathscr {A}_G^2 + \mathscr {A}_G^c)}\). \(\square \)
Given a time sequence \(0, \varDelta , \ldots , n\cdot \varDelta \), we have a sequence of transition theories \(\smash {T^0, T^\varDelta ,}\smash {\ldots , T^{n\cdot \varDelta }}\) and thus a sequence of probabilistic labelling frames \(\smash {\langle G^0, \mathscr {S}, \langle \varOmega , F, P^0 \rangle \rangle ,} \langle G^\varDelta , \mathscr {S}, \langle \varOmega , F, P^\varDelta \rangle \rangle , \ldots , \langle G^{n\cdot \varDelta }, \mathscr {S}, \langle \varOmega , F, P^{n\cdot \varDelta } \rangle \rangle \). Every probability distribution \(\smash {P^t}\) is parametrised by a set of factors \(\smash {\varPhi ^{t} }\), and for the sake of simplicity we posit that the probabilistic dependences amongst template arguments remain unchanged over time. So for any template argumentation graph \(\smash {G^t}\), the scope of factors remains unchanged. However, since we aim at capturing learning agents, we will see in the next section that the values of factors may change over time as they are updated by RL.
In summary, we set up a logicbased and probabilisticbased memoryless argumentbased transition framework to compute states ‘step by step’, so as to avoid computing all time steps’ labellings at once. By doing so, we have established an argumentbased representation of Markov chains. This representation is our basis to build argumentbased MDPs and reinforcement learning agents in the next section.
6 Argumentbased Markov decision processes and learning agents
In this section, we show how an MDP setting for RL can be captured in our PA framework. We first specify argumentbased constructs towards argumentbased MDPs (Sect. 6.1), and possible factorisations of probability distributions (Sect. 6.2). Then we articulate these constructs to build argumentbased MDPs and RL agents (Sect. 6.3).
6.1 Environment and agent representation
\(\textsf {Hold}_{\textsf {obj}} \varphi \mathrm {\,at\,} t\)  It holds, from an objective point of view, that \(\varphi \) at time t. 

\(\textsf {Hold}_i \varphi \mathrm {\,at\,} t\)  It holds, from the point of view of agent i, that \(\varphi \) at time t. 
\(\textsf {Des}_i \varphi \mathrm {\,at\,} t\)  The agent i desires \(\varphi \) at time t. 
\(\textsf {Obl}_{i} \varphi \mathrm {\,at\,} t\)  It is obligatory for agent i to do \(\varphi \) at time t. 
\(\textsf {Do}_i \varphi \mathrm {\,at\,} t\)  The agent i attempts to do \(\varphi \) at time t. 
Epistemic information is indicated by the modality \(\textsf {Hold}_{\textsf {obj}}\) and \(\textsf {Hold}_i\). The subscript \(\textsf {obj}\) indicates that an expression objectively holds, i.e. that is the case (rather than being merely believed by an agent). We may say that \(\textsf {obj}\) embodies the objective point of view, that it only accepts what is ‘true’.
Obligations and desires, expressed by modalities \(\textsf {Obl}_i\) and \(\textsf {Des}_i\) respectively, are here to illustrate the ability of our argumentbased framework, as we will see later, to explicitly investigate qualitative features and their interactions that are often left implicit in common MDP settings. Desires (and the reasoning leading to them) determine what is desirable or undesirable, and to what extent an agent wants to reach or avoid it (assuming that the environment does not interfere), while the compliance or the infringement of an obligation determines environmental interferences such as punitive sanctions. In this paper, obligations and desires are not meant to be learnt through reinforcement. In that regard, we do not exploit here the contention according to which an agent desires to perform an action in light of its superior utility, but we remark that the proposed framework may be extended to host such considerations.
Actions, expressed by the modality \(\textsf {Do}_i\), are not necessarily successful, because agents operate in a probabilistic nonmonotonic framework and their behaviour is governed by defeasible arguments. In this framework, we do not cater for intentions. Thus, \((\textsf {Do}_i \varphi \mathrm {\,at\,} t)\) stands for i’s attempt to do \(\varphi \) at time t regardless of any intention. The content \(\varphi \) of the action \((\textsf {Do}_i \varphi \mathrm {\,at\,} t)\) may characterise the feature of a state, e.g. we may write \((\textsf {Do}_i \mathrm{safe}\mathrm {\,at\,} t)\), or it may characterise an action, e.g. we may write \((\textsf {Do}_i \mathrm{care}\mathrm {\,at\,} t)\) where ‘\(\mathrm{care}\)’ denotes a careful action eventually leading to a safe state. Actions \((\textsf {Do}_i \varphi \mathrm {\,at\,} t)\) are distinct from abstract atomic actions as conceived in MDPs in Sect. 3. MDP actions are reappraised later as ‘attitudes’, which will allow us to provide more finegrained characterisations of agency, for example to characterise the absence or omission of any actions.
Since we are willing to model RL agents, for which reinforcement signals are essential, we also assume sanction statements of the form \(\textsf {Hold}_i \mathrm{util}(u, \alpha )\) to indicate a scalar utility u received by agent i, where \(\alpha \) is a unique identifier to distinguish the utility amongst others (we may omit the argument \(\alpha \) to avoid overloading the presentation). So if the statement (\(\textsf {Hold}_i \mathrm{util}(10) \mathrm {\,at\,} t\)) occurs to be labelled \(\textsf {in}\), then the agent receives the reward 10 at time t. Such expression of sanctions within the argumentation setting will allow us to build complex utility functions shaping the mental attitudes of an agent in interaction with a (normative) environment. For example, we may experiment the case where the overall utility of something is the sum of its intrinsic utilities as well as extrinsic utilities resulting from positive and negative environmental interferences, such as sanctions from the infringement of an obligation.

we build two different theories and each theory contains the rules associated with an agent i or the environment, or

we build only one theory and we prefix every rule to indicate ‘where’ the rule holds (i.e. in the mind of the agent or in the wild).

one defeasible theory, denoted \(T_{\textsf {obj}},\) representing the environment; and

one defeasible theory, denoted \(T_{i},\) representing the agent i.
Example 6.1

an environment theory \(\textsf {T}_{\textsf {obj}} = \langle Rul_{\textsf {obj}}, \textit{Conflict} _{\textsf {obj}}, \emptyset \rangle \), and

an agent theory \(\textsf {T}_{i} = \langle Rul_{i}, \textit{Conflict} _{i}, \emptyset \rangle \).
 The set \(Rul_{\textsf {obj}}\) exactly contains the environment rules below, with their informal meaning. When no accident occurs at time t then the state is safe at time t, otherwise it is dangerous:$$\begin{aligned}&\textsf {s}^t_{\textsf {obj}}:\, \sim \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{safe}\,\mathrm {\,at\,} \, t \end{aligned}$$(6.1)Each action leads to a reward specified as follows:$$\begin{aligned}&\textsf {d}^t_{\textsf {obj}}: \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{danger}\,\mathrm {\,at\,} \, t \end{aligned}$$(6.2)$$\begin{aligned} \textsf {outc}^{t+1}_{\textsf {obj}}: \textsf {Do}_i \mathrm{care}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_i \mathrm{util}(1)\,\mathrm {\,at\,} \, t+1 \end{aligned}$$(6.3)These two rules have a consequent of the form \((\textsf {Hold}_{i} \mathrm{util}(u)\,\mathrm {\,at\,} \, t)\), thus the rewards hold from the point of view of agent i, but the rules defining them hold in the environment theory: it holds from an objective point of view, that it holds from the point of view of agent i some utility at time t. Unfortunately, an accident may occur:$$\begin{aligned} \textsf {outn}^{t+1}_{\textsf {obj}}: \textsf {Do}_i \mathrm{neglect}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_i \mathrm{util}(2)\,\mathrm {\,at\,} \, t+1 \end{aligned}$$(6.4)$$\begin{aligned}&\textsf {accc}^{t+1}_{\textsf {obj}}: \textsf {Do}_{i} \mathrm{care}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t +1 \end{aligned}$$(6.5)$$\begin{aligned}&\textsf {accsn}^{t+1}_{\textsf {obj}} : \textsf {Hold}_{\textsf {obj}} \mathrm{safe}\,\mathrm {\,at\,} \, t , \textsf {Do}_{i} \mathrm{neglect}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t +1 \end{aligned}$$(6.6)and when an accident occurs then the agent is eventually harmed:$$\begin{aligned}&\textsf {accdn}^{t+1}_{\textsf {obj}} : \textsf {Hold}_{\textsf {obj}} \mathrm{danger}\,\mathrm {\,at\,} \, t , \textsf {Do}_{i} \mathrm{neglect}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t+1 \end{aligned}$$(6.7)$$\begin{aligned} \textsf {outacc}^{t}_{\textsf {obj}}: \textsf {Hold}_{\textsf {obj}} \mathrm{accident}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{i} \mathrm{util}(12)\,\mathrm {\,at\,} \, t \end{aligned}$$(6.8)
 Concerning the set of agent rules \(Rul_i\), we assume that the agent has complete information, so when the state is safe (dangerous resp.) then the agent holds that it is safe (dangerous resp.):$$\begin{aligned}&\textsf {s}^t_{i}: \textsf {Hold}_{\textsf {obj}} \mathrm{safe}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{i} \mathrm{safe}\,\mathrm {\,at\,} \, t \end{aligned}$$(6.9)Whatever the state, the agent can act with care or with negligence:$$\begin{aligned}&\textsf {d}^t_{i}: \textsf {Hold}_{\textsf {obj}} \mathrm{danger}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Hold}_{i} \mathrm{danger}\,\mathrm {\,at\,} \, t \end{aligned}$$(6.10)$$\begin{aligned}&\textsf {sc}^t_{i}: \textsf {Hold}_i \mathrm{safe}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Do}_i \mathrm{care}\,\mathrm {\,at\,} \, t \end{aligned}$$(6.11)$$\begin{aligned}&\textsf {sn}^t_{i}: \textsf {Hold}_i \mathrm{safe}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Do}_i \mathrm{neglect}\,\mathrm {\,at\,} \, t \end{aligned}$$(6.12)$$\begin{aligned}&\textsf {dc}^t_{i}: \textsf {Hold}_i \mathrm{danger}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Do}_i \mathrm{care}\,\mathrm {\,at\,} \, t \end{aligned}$$(6.13)$$\begin{aligned}&\textsf {dn}^t_{i}: \textsf {Hold}_i \mathrm{danger}\,\mathrm {\,at\,} \, t \Rightarrow \textsf {Do}_i \mathrm{neglect}\,\mathrm {\,at\,} \, t \end{aligned}$$(6.14)

every rule in an environment theory heading at time t has a consequent holding at time t and each antecedent holds at time t or \(t\varDelta \).

every rule in an agent theory at time t has a consequent holding at time t, because an agent behaves at time t (in the MDP setting, an agent performs an action \(a_t\)) on the basis of a state at the same time t (denoted \(s_t\) in the MDP setting).
Definition 6.1
We assume that rules leading to a reinforcement signal (\(\textsf {Hold}_i \mathrm{util}(u) \mathrm {\,at\,} t\)) are part of the environment theory. Arguably, the utilities could be computed from a dedicated theory specifying both intrinsic and extrinsic utilities, but for the sake of simplicity, these reinforcement signals will be computed within the scope of the environment theory. By doing so, the utility of an agent’s attitude will be computed when making a transition from a state to another state.
Besides the environment theory heading at time t, an agent behaves at time t, in line with the agent theory at time t.
Definition 6.2

an environment probabilistic labelling frame \(\smash {\langle G^{t}_\textsf {obj}, \mathscr {S}, \langle \varOmega _\textsf {obj}, F_\textsf {obj}, P_\textsf {obj} \rangle \rangle }\) where the argumentation graph \(G^{t}_\textsf {obj}\) is built from the environment theory \(\smash {T^{t}_\textsf {obj}}\) heading at time t;

an agent probabilistic labelling frame \(\smash {\langle G^{t}_i, \mathscr {S}, (\varOmega _i, F_i, P_i) \rangle }\) where the argumentation graph \(G^{t}_i\) is built from the agent theory \(T^{t}_i\) at time t.

a labelling from the environment theory at time t: this labelling is the state of the environment at time t, and we may just call it the state at time t,

a labelling from the agent theory at time t: this labelling is the state of the agent at time t, and we call it the attitude of the agent at time t.
Definition 6.3

An argument environment state at time t (often denoted \(\mathrm {L}^t_\textsf {obj}\)) is a labelling of the set of arguments \(\mathscr {A}^t \subseteq \mathscr {A}_G\) whose conclusions hold at time t, and such that every argument is labelled as within a labelling \(\mathrm {L}_\textsf {obj}\) in the sample space \(\varOmega _\textsf {obj}\), i.e. \(\forall A \in \mathscr {A}^t, \mathrm {L}^t_\textsf {obj}(A) = \mathrm {L}_\textsf {obj}(A)\).

A statement environment state at time t (often denoted \(\mathrm {K}^t_\textsf {obj}\)) is a labelling of statements at time t from an argument labelling of the sample space \(\varOmega _\textsf {obj}\).
Definition 6.4

An argument attitude of agent i at time t (often denoted \(\mathrm {L}^t_i\)) is an argument labelling in the sample space \(\varOmega _i\).

An statement attitude of agent i at time t (often denoted \(\mathrm {K}^t_i\)) is a labelling of statements from an argument attitude of agent i at time t.
Definition 6.5
(Agent observable statement) Let \(\langle G^{t}_i, \mathscr {S}, (\varOmega _i, F_i, P_i) \rangle \) be an agent probabilistic labelling frame, where the argumentation graph \(G^{t}_i\) is built from an agent theory \(T^{t}_i\) at time t. A statement is an agent observable statement if, and only if, it is a statement of the form \((\textsf {Hold}_{\textsf {obj}}\varphi \mathrm {\,at\,} t)\) and it is the conclusion of an argument in the set of arguments of \(G^{t}_i\).
Given any grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling \(\mathrm {L}_{i}\) of graph \(G^{t}_i\), an observable state is the set of agent observable statements labelled \(\textsf {in}\) within \(\mathrm {L}_{i}\). Since we focus on MDPs instead of POMDPs, an agent will always fully observe its environment state. Therefore the number of observable states equals the number of states in the underlying MDP.
Following folk psychology, the expression of an attitude is called a behaviour, which is the labelling of every action of an attitude.
Definition 6.6
Thus a behaviour may be such that no actions are performed, and in this case we may say that the agent is inhibited. Given an agent behaviour, actions performed through this behaviour are defined next.
Definition 6.7
(Agent action) Let \(\mathrm {B}^t_i\) be a behaviour of agent i at time t, the set of actions of agent i at time t, denoted \(\mathrm {A}^t_i\), is the set of actions labelled \(\textsf {in}\) within the behaviour \(\mathrm {B}^t_i\), i.e. \(\mathrm {A}^t_i = \textsf {in}(\mathrm {B}^t_i)\).
Example 6.2
Suppose the template agent theory \(\textsf {T}_i = \langle Rul_{i}, \textit{Conflict} _{i}, \emptyset \rangle \) given in Example 6.1, and let us build from it the template argumentation graph describing the agent i, as shown in Fig. 7.

Arguments \(\textsf {S}_{\textsf {obj}}^0\) and \(\textsf {D}_{\textsf {obj}}^0\) are assumptive arguments supporting practical arguments leading to careful or negligent actions.
 The set \(\mathrm {O}^0_i\) of observable statements at time 0 (that can be labelled \(\textsf {on}\) or \(\textsf {off}\) with respect to the agent’s state) is the following:$$\begin{aligned} \begin{aligned} \mathrm {O}^0_i = \{ ( \textsf {Hold}_\textsf {obj}\mathrm{safe}\mathrm {\,at\,} 0), (\textsf {Hold}_\textsf {obj}\mathrm{danger}\mathrm {\,at\,} 0)\}. \end{aligned} \end{aligned}$$(6.18)
 An argument attitude \(\mathrm {L}^0_i\) of the agent i at time 0 may be the following (as \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling):$$\begin{aligned} \mathrm {L}^0_i = \langle \{ \textsf {S}^0_{\textsf {obj}}, \textsf {S}^0, \textsf {SC}^0 \}, \{ \textsf {SN}^0, \textsf {D}^0_{\textsf {obj}}, \textsf {D}^0, \textsf {DC}^0, \textsf {DN}^0 \} \rangle . \end{aligned}$$(6.19)
 The corresponding statement attitude \(\mathrm {K}^0_i\) is as follows:$$\begin{aligned} \begin{aligned} \mathrm {K}^0_i = \langle&\{ (\textsf {Hold}_{\textsf {obj}} \mathrm{safe}\mathrm {\,at\,} 0), (\textsf {Hold}_{i} \mathrm{safe}\mathrm {\,at\,} 0), (\textsf {Do}_{i} {\mathrm{care}} \mathrm {\,at\,} 0) \}, \\&\{ (\textsf {Do}_i\mathrm{neglect}\mathrm {\,at\,} 0), (\textsf {Hold}_{\textsf {obj}}\mathrm{danger}\mathrm {\,at\,} 0), (\textsf {Hold}_i \mathrm{danger}\mathrm {\,at\,} 0) \} \rangle . \end{aligned} \end{aligned}$$(6.20)
 The behaviour \(\mathrm {B}^0_i\) is as follows:$$\begin{aligned} \begin{aligned} \mathrm {B}^0_i = \langle \{ (\textsf {Do}_{i} \mathrm{care}\mathrm {\,at\,} 0) \}, \{ (\textsf {Do}_i\mathrm{neglect}\mathrm {\,at\,} 0) \} \rangle . \end{aligned} \end{aligned}$$(6.21)

The set of actions is \(\mathrm {A}^0_i = \{ (\textsf {Do}_{i} \mathrm{care}\mathrm {\,at\,} 0) \}\). \(\square \)
To recap, we have defined argumentbased constructs towards argumentbased MDPs. From an environment theory, we build an environment probabilistic labelling frame, where any labelling of the sample space is defined as an environment state. Similarly, from an agent theory, we have an agent probabilistic labelling frame, where any labelling of the sample space is defined as an agent attitude. We have shown how behaviours and actions can be characterised in such a setting. As probability distributions have been unspecified, possible factorisations of environment and agent probabilistic labelling frames are discussed next.
6.2 Environment and agent factorisation
A major difficulty in using PA regards its computational complexity. To address this difficulty, we have proposed factors and probabilistic rules in Sect. 4. We now show how factorisations can be specified by means of probabilistic defeasible rules to describe an agent and its environment.

the set of unreinforceable factors whose energy values cannot be changed;

the set of reinforceable factors whose energy values can be changed by the agent.
Notation 6.1
A reinforceable factor is represented by the symbol \(\otimes \) (as a controllable steering wheel) instead of the usual notation \(\phi \), whereas an unreinforceable factor is specified by the symbol \(\odot \) (as an uncontrollable steering wheel).
Example 6.3
View on the assignments of the reinforceable factor \(\otimes \) conditioned to the safe state
\(L_{\textsf {S}^0}: \textsf {Hold}_i \mathrm{safe}\mathrm {\,at\,} 0\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\) 
\(L_{\textsf {SC}^0}: \textsf {Do}_i \mathrm{care}\mathrm {\,at\,} 0\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {UND}}}\)  \({\small {\textsf {OFF}}}\) 
\(L_{\textsf {SN}^0}: \textsf {Do}_i \mathrm{neglect}\mathrm {\,at\,} 0\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {UND}}}\)  \({\small {\textsf {OFF}}}\) 
To obtain compact representations of an agent and its environment, the factorisation with reinforceable and unreinforceable factors may be achieved in various ways. In the following, we propose a factorisation along with probabilistic rules.
Definition 6.8
(Environment probabilistic defeasible theory) An environment probabilistic defeasible theory is a probabilistic defeasible theory \(T = \langle Rul, \textit{Conflict}, \succ \rangle \) where every probabilistic defeasible rule in \(Rul\) is unreinforceable.
At the opposite, an agent probabilistic defeasible theory is a probabilistic defeasible theory where a probabilistic rule may be reinforceable. If all the rules of an agent theory are unreinforceable, then this agent is a nonlearning agent.
Example 6.4
Let us reappraise the rules given in Example 6.1. They can be transformed into probabilistic defeasible rules as follows.
In practice, given a probabilistic theory where any rule is either unreinforceable or reinforceable, we can first draw reinforceable rules to yield a theory where the set of rules is such that any rule is either drawn or unreinforceable. Anyhow, arguments can be unreinforceable or reinforceable.
Definition 6.9
(Reinforceable argument) An argument is reinforceable if, and only if, at least one of its rules is reinforceable.
Lemma 6.1
(Unreinforceable argument) An argument is unreinforceable if, and only if, all its rules are unreinforceable.
Corollary 6.1
An argument is reinforceable if, and only if, at least one of its subarguments is reinforceable.
Corollary 6.2
An argument is unreinforceable if, and only if, all its subarguments are unreinforceable.
Definition 6.10
(Reinforceable factor) A factor is a reinforceable factor if, and only if, its scope includes a random labelling of a reinforceable argument.
Lemma 6.2
(Unreinforceable factor) A factor is a unreinforceable factor if, and only if, all random labellings of its scopes are random labellings of unreinforceable arguments.
Definition 6.11
(Reinforceable assignment) An assignment of random labellings is reinforceable if, and only if, at least one random labelling is the random labelling of a reinforceable argument.
Lemma 6.3
(Unreinforceable assignment) An assignment of random labellings is unreinforceable if, and only if, every random labelling is the random labelling of an unreinforceable argument.
On the basis of an environment probabilistic defeasible theory (Definition 6.8), where all rules are unreinforceable, we can consider an environment probabilistic labelling frame.
Definition 6.12
(Environment probabilistic labelling frame) Given an environment probabilistic defeasible theory \(\smash {T^{t}_\textsf {obj}}\) heading at time t, the environment probabilistic labelling frame heading at time t is a probabilistic labelling frame \(\smash {\langle G^{t}_\textsf {obj}, \mathscr {S}, \langle \varOmega _\textsf {obj}, F_\textsf {obj}, P_\textsf {obj} \rangle \rangle }\) where the argumentation graph \(G^{t}_\textsf {obj}\) is built from the environment theory \(\smash {T^{t}_\textsf {obj}}\) heading at time t.
We can note that factorisation of the distribution of an environment probabilistic labelling frame is left unspecified. It can be anything as long as it reflects the marginal probability of unreinforceable rules. In practice, we can and we will work with a factorisation such that unreinforceable rules are independent. Hence, we will first draw unreinforceable rules independently, and from these rules, we will build environment arguments.
Lemma 6.4
(Environment argument) Let Open image in new window denote an environment probabilistic labelling frame. Every argument in \(\mathscr {A}_G\) is unreinforceable.
Proof
By definition, the argumentation graph \(G^{t}_\textsf {obj}\) is built from an environment probabilistic defeasible theory, let us denote it \(\smash {T^{t}_\textsf {obj}}\). By definition, every rule of \(\smash {T^{t}_\textsf {obj}}\) is unreinforceable (Definition 6.8). Consequently, all the rules of every argument in \(\mathscr {A}_G\) is unreinforceable. Therefore, every argument in \(\mathscr {A}_G\) is unreinforceable (Lemma 6.1). \(\square \)
Proposition 6.1
(Environment factor) Given an environment probabilistic labelling frame Open image in new window , the distribution Open image in new window is parametrised by a set of factors \(\varPhi \) where every factor in \(\varPhi \) is unreinforceable.
Proof
Every argument in \(\mathscr {A}_G\) is unreinforceable (Lemma 6.1). Consequently, every factor in \(\varPhi \) only scopes random labellings of unreinforceable arguments. Therefore, every factor in \(\varPhi \) is unreinforceable (Lemma 6.2). \(\square \)
Proposition 6.2
(Environment assignment) Given an environment probabilistic labelling frame Open image in new window , every assignment of random labellings is unreinforceable.
Proof
Every argument in \(\mathscr {A}_G\) is unreinforceable (Lemma 6.1). Consequently, every assignment of random labellings of arguments in \(\mathscr {A}_G\) is unreinforceable. Therefore, every assignment is unreinforceable (Lemma 6.3). \(\square \)
Lemma 6.1 and Proposition 6.1 and 6.2 show the unreinforceable character of environment probabilistic labelling frames. At the opposite, an agent may be reinforceable. Manifold factorisation may be proposed. A simple factorisation holds in practical probabilistic labelling frames .
Definition 6.13

the argumentation graph \(G^{t}_i\) is built from the agent theory \(T^{t}_i\) at time t;

the distribution \(P_i\) is a Gibbs distribution such that all random labellings of reinforceable arguments and all their direct subarguments in \(\mathscr {A}_G\) are the scope of one and only one reinforceable factor.
The reinforceable factor of a practical distribution may be further broken down. For example, we can have one reinforceable factor by observable state. We leave further factorisation for future developments, since it is not essential for our present purposes.
To recap, we have showed how an environment can be described by an environment probabilistic labelling frame built from a probabilistic defeasible theory, while an agent is described by an agent probabilistic labelling frame built from another probabilistic defeasible theory. All the arguments of an environment probabilistic labelling frame are unreinforceable, while some arguments of an agent probabilistic labelling frame may be reinforceable. If a factor scopes a reinforceable argument, then this factor is a reinforceable factor, otherwise it is unreinforceable. Values of uncontrolled factors remain unchanged, while a RL agent may change values of (reinforceable) assignments of reinforceable factors to adapt to its environment, as we will see next.
6.3 Animating reinforcement learning agent
Having assumed that an agent and its environment can be characterised by an agent probabilistic labelling frame and an environment probabilistic labelling frame respectively, we are now prepared to reformulate the MDP setting of (SARSA) RL agents into an argumentbased MDP setting for argumentbased (SARSA) RL agents on the basis of our PA framework. Whilst a traditional RL agent performs an action drawn from a set of possible actions, we propose that an agent performs an action justified in an attitude drawn from a set of possible attitudes, i.e. mental states. In this view, we move from a behavioural approach of agent modelling to a mentalistic approach. In the remainder, we first propose argumentbased MDPs formalised in our PA setting, and then we show how to animate an argumentbased RL agent in such MDPs.
In an argumentbased MDPs, once an agent has observed the state \(s_t\) and has performed an argumentbased deliberation leading to an attitude \(\mathrm {K}^t_{i}\), and thus actions \(\mathrm {A}^{t}_{i}\), then we can draw the next state \(s_{t+\varDelta }\). By doing so, we have an argumentbased MDP transition, which is a development of an argumentbased transition, see Definition 5.8 .
Definition 6.14

\(\langle G^{t+\varDelta }_{\textsf {obj}}, \mathscr {S}, \langle \varOmega _{\textsf {obj}}, F_{\textsf {obj}}, P_{\textsf {obj}} \rangle \rangle \) denote the environment probabilistic labelling frame built from the environment theory \(T^{t+\varDelta }_{\textsf {obj}}\) heading at time \(t+\varDelta \),

\({Assum}^{t}_\textsf {obj}\) the assumptions at time t, \({\mathrm {Assum}}_\textsf {obj}^{t} = \{(\square \varphi \mathrm {\,at\,} t)\mid (\square \varphi \mathrm {\,at\,} t) \in {\mathrm {Assum}}(T^{t+\varDelta }_\textsf {obj}) \}\),

\(\varPhi _{\textsf {obj}}\) and \(\varPhi _{i}\) two sets of statements,

\(\mathscr {K}(\varPhi ^t_{\textsf {obj}})\) the set of environment states at time t,

\(\mathscr {K}(\varPhi ^{t+\varDelta }_{\textsf {obj}})\) the set of environment states at time \(t+\varDelta \),

\(\mathscr {K}(\varPhi ^{t}_{i})\) the set of statement attitudes of agent i at time t,

\(s_t = \mathrm {K}^t_{\textsf {obj}}\) an environment state at time t, \(\mathrm {K}^t_{\textsf {obj}}\in \mathscr {K}(\varPhi ^t_{\textsf {obj}}) \),

\({s_{t+\varDelta } = \mathrm {K}^{t+\varDelta }_{\textsf {obj}}}\) an environment state at time \(t+\varDelta \), \({\mathrm {K}^{t+\varDelta }_{\textsf {obj}}\in \mathscr {K}(\varPhi ^{t+\varDelta }_{\textsf {obj}})}\),

\(\mathrm {K}^t_{i}\) an attitude of agent i at time t, \(\mathrm {K}^t_{i}\in \mathscr {K}(\varPhi _i^t) \).
In words, the transition probability of moving from a state \(s_t\) to \(s_{t+\varDelta }\) on the basis of the attitude \(\mathrm {K}^t_{ i}\) (i.e. \(P(s_{t+\varDelta } \mid s_t, \mathrm {K}^t_{i})\)) is the probability distribution \(P_\textsf {obj}\) over the set \(\smash {\mathbf {L}_\textsf {obj}^{t+\varDelta }}\) of random labellings of argument in \(\smash {G^{t+\varDelta }_\textsf {obj}}\), conditioned to the labellings \(\mathrm {K}^t_{\textsf {obj}}\) and \(\mathrm {K}^t_{i}\), i.e. the random labelling of any assumptive arguments of any statements in \(\textsf {in}(\mathrm {K}^t_{\textsf {obj}})\) and \(\textsf {in}(\mathrm {K}^t_{ i})\) is assigned the value \({\small {\textsf {ON}}}\), otherwise it is assigned the value \({\small {\textsf {OFF}}}\).
An argumentbased MDP transition is timedependent, in the sense that timestamps are included in the representation of the states (since a temporal modal literal includes a timestamp). However, in the definition of a basic MDP, the representation of a state is not meant to include any timestamp (but a state can be associated with a timestamp). To address this incongruence, we can conceive atemporalised argumentbased MDP transition where states are atemporalised.
Definition 6.15

\(Tr = \mathrm {K}^{t}_{\textsf {obj}}, \mathrm {K}^t_{i}, \langle G^{t+\varDelta }_{\textsf {obj}}, \mathscr {S}, \langle \varOmega _{\textsf {obj}}, F_{\textsf {obj}}, P_{\textsf {obj}} \rangle \rangle \rightarrow \mathrm {K}^{t+\varDelta }_{\textsf {obj}}\) denote an argumentbased MDP transition,

\(\mathrm {K}_{\textsf {obj}}\) the atemporalised state of \(\mathrm {K}^{t}_{\textsf {obj}}\),

\(\mathrm {K}'_{\textsf {obj}}\) the atemporalised state of \(\mathrm {K}^{t+\varDelta }_{\textsf {obj}}\),

\(\mathrm {K}_{i}\) the atemporalised state of \(\mathrm {K}^{t}_{i}\).
In the definition of an argumentbased MDP transition, the probability distribution \(P(s_{t+\varDelta } \mid s_{t}, \mathrm {K}^t_{i})\) is conditioned to the attitudes of the agent instead of its actions. By doing so, we can build utility functions taking into account mental statements such as particular beliefs and desires. For example, and as we illustrate later, an agent may get an extra ‘selfreward’ if its desires are satisfied.
Definition 6.16
Now that we have defined argumentbased MDP transitions and argumentbased rewards, we are prepared to propose a definition of argumentbased MDPs, which echoes standard MDPs (Definition 3.1), so that an MDP can be specified with our PA framework in terms of a sequence \(\mathfrak {F}_t\) of probabilistic labelling frames (which is possibly specified by a template probabilistic argumentation frame) and thus associated states \(S = \mathscr {K}(\mathfrak {F}_t)\) (see Eq. 6.17).
Definition 6.17

S is the set of atemporalised states of the sequence \(\mathfrak {F}_t\), i.e. \(S = \mathscr {K}(\mathfrak {F}_t)\);

A is the set of atemporalised attitudes;

\(P(s_{t+\varDelta } \mid s_{t}, \mathrm {K}_{i})\) is the transition probability of the atemporalised argumentbased MDP transition from the atemporalised state \(s_t\) to \(s_{t+\varDelta }\) by adopting attitude \(\mathrm {K}_{i} \in A\);

\(R(s_{t+\varDelta } \mid s_{t}, \mathrm {K}_{i})\) is the immediate argumentbased reward \(r_t\) received when attitude \(\mathrm {K}_{i}\) is adopted in state \(s_t\), moving to state \(s_{t + \varDelta }\) .
When running an argumentbased MDP as in simulations, it is interesting to note that, given an environment state \(\mathrm {K}^{t}_{\textsf {obj}}\) and an agent attitude \(\mathrm {K}^t_{i}\), an argumentbased MDP transition \(\smash {\mathrm {K}^{t}_{\textsf {obj}}, \mathrm {K}^t_{i}, \langle G^{t+\varDelta }_{\textsf {obj}}, \mathscr {S}, \langle \varOmega _{\textsf {obj}}, F_{\textsf {obj}}, P_{\textsf {obj}} \rangle \rangle \rightarrow \mathrm {K}^{t+\varDelta }_{\textsf {obj}}}\) can be computed in a time that is polynomial in the number of arguments of G.
Theorem 6.1
(Time complexity of an argumentbased MDP transition) Let Open image in new window denote an argumentbased transition, where Open image in new window is constructed from an environment probabilistic defeasible theory. Given the environment state Open image in new window , the agent attitude \(\mathrm {K}^t_{i}\), the time complexity of the argumentbased transition is \(\smash {O(\mathscr {A}_G^c)}\).
Proof
The proof is similar to the proof of Theorem 5.2. To draw a state, we have three steps. In the first step, we draw a \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling of \(\smash {G^{t+\varDelta }_{\textsf {obj}}}\) built from the environment probabilistic theory \(\langle Rul, \textit{Conflict}, \succ \rangle \), this step is \(O(Rul\cdot \mathscr {A}_G^2)\), see Lemma 5.1. In the second step, we compute the grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling \(\mathrm {L}^{t+\varDelta }\) of the \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling of \(\smash {G^{t+\varDelta }_{\textsf {obj}}}\), this step is \(O(\mathscr {A}_G^c)\), see Lemma 4.1. In the third step, we compute the labelling \(\smash {\mathrm {K}^{t+\varDelta }_{\textsf {obj}}}\) of a set \(\varPhi \) of statements from the labelling \(\mathrm {L}^{t+\varDelta }\), i.e. \(\mathrm {K}(\mathrm {L}^{t+\varDelta }, \varPhi ) = \mathrm {K}^{t+\varDelta }_{\textsf {obj}} \), this step is \(\smash {O(\varPhi  \times {\small {\textsf {IN}}}(\mathrm {L}^{t+\varDelta }))}\), see Lemma 4.2. Therefore, the time complexity of an argumentbased transition is \(\smash {O(Rul\cdot \mathscr {A}_G^2 + \mathscr {A}_G^c)}\). \(\square \)
To navigate in an argumentbased MDP, we consider a simple argumentbased RL agent: first, the agent observes the state \(s_t\) at time t, then, using an argumentbased policy, it draws an attitude eventually leading to an MDP action \(a_t\).
Definition 6.18

\(s_t = \mathrm {K}^t_{\textsf {obj}}\) denote the environment state at time t,

\(\langle G^{t}_i, \mathscr {S}, \langle \varOmega _i, F_i, P_i \rangle \rangle \) the agent i’s probabilistic labelling frame built from the agent theory \(T_i^{t}\) at time t,

\(O^t_i\) the agent i’s set of observable statements at time t.
If an agent at time t observes a state \(s_t\), and draws an attitude from an argumentbased policy eventually leading to an action \(a_t\), then this agent makes an argumentbased deliberation.
Definition 6.19
(Argumentbased deliberation) Let \(\smash {s_t = \mathrm {K}^t_{\textsf {obj}}}\) denote an environment state at time t. An argumentbased deliberation is the draw of an attitude \(\mathrm {L}^{t}_i\) from the argumentbased policy \(\pi ( \mathbf {L}_i^{t} \mid s_t )\) such that \(\mathbf {l}^{t}_i \sim \pi ( \mathbf {L}_i^{t} \mid \mathbf {k}^t_{\textsf {obj}}).\)
Concerning the computational complexity of an argumentbased deliberation, since such a deliberation is a draw amongst the legal \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labellings of an argumentation graph \(G^{t}_i\) and since we have to keep in memory the energy associated with every labelling, the space complexity may not be polynomial in the number of arguments. To address such complexity, compact representations may be considered, but we leave such consideration for future work. We may also discard some labellings of the sample sample by associating them with an infinite energy. In particular the \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling \(\mathrm {L}^{t}_i\) of an attitude may be set with an infinite energy (so that it has probability 0 to be taken) when it corresponds to an ‘inconvenient’ grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling counterpart (denoted \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)\(\mathrm {L}^{t}_i\)). Accordingly, we may compute the statuses of some arguments of the \(\{{\small {\textsf {ON}}}, {\small {\textsf {OFF}}}\}\)labelling of an attitude \(\mathrm {L}^{t}_i\) with respect to its grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling counterpart to check whether such attitude is reinforceable. This option is further investigated in Sect. 7.
From the attitude as a labelling \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)\(\mathrm {L}^{t}_i\) of arguments, we derive the grounded \(\{\textsf {in}, \textsf {no}\}\)labelling \(\mathrm {K}^{t}_{i}\) of statements, which is the statement attitude of the agent at t, and from this attitude, we have a behaviour \(\mathrm {B}^{t}_{i}\), and thus the actions \(\mathrm {A}^{t}_{i}\) performed at t (see Definition 6.7).
Once the argumentbased reward is computed, then the agent reinforces the attitude which led to this reward. In traditional Qlearning algorithms, such as SARSA, the Qvalue of the pair \((s_t, a_t)\) is updated. In our setting, the traditional Qvalue of the pair \((s_t, a_t)\), i.e. \(Q(s_t, a_t)\), is replaced by the energy of the reinforceable labelling assignment \(\mathbf {l}^t_{\otimes }\) in the assignment \(\mathbf {l}_i^t\) corresponding to the attitude \(\mathrm {L}_i^t\), i.e. \(\mathbf {l}^t_{\otimes }=\mathbf {l}_i^t(\mathbf {L}^t)\)^{4} where \(\mathbf {L}^t\) is the random labellings scoped by the reinforceable factor \(\otimes (\mathbf {L}^t)\) (which is unique, see Definition 6.13).
Definition 6.20
As an agent’s attitude is a labelling of arguments or ‘mental’ statements, we call this approach the argumentbased mentalistic approach to RL. Since the approach is based on a probabilistic energybased argumentation framework, we may characterise an argumentbased RL agent from a logicbased perspective or an energybased perspective, as we will see in the next section.
7 Agent characterisation
In the previous section, we proposed to model and animate RL agents based on a PA framework. In this section, we take advantage of the PA framework to characterise agent profiles from a logicbased perspective (Sect. 7.1), and from a probabilistic perspective (Sect. 7.2), before illustrating such characterisations (Sect. 7.3).
7.1 Logicbased characterisation
When investigating human and social agency, it is common to study how certain cognitive or dispositional profiles influence agents’ behaviour. In experimental economics, for example, this is almost standard whenever scholars analyse cognitive profiles such as riskaversion—which is a type of disposition—using agentbased simulations (cf., e.g. [14]), or when sundry cognitive emotions, states, or conceptions of fairness affect agents’ choices (cf., e.g. [26]).
As to the investigations of models for computer systems, a cognitive characterisation of agents’ profiles may be used to model artificial agents with a specific character affecting humanagent interactions, with applications for example in sociotechnical systems, gaming industry or for serious games. In sociotechnical systems for instance, a designer may prefer to focus on regimented agents compliant with the governing norms, while a game designer may seek to confront a player with an agent whose desires always override obligations.
Logicbased characterisations of agents’ profiles are related in the literature to the idea of agent type. This idea, from a qualitative perspective, has been introduced and extensively studied in [13, 27]. In those works, agent types were proposed first of all to resolve conflict between mental statements, such as desires and obligations. In other words, agent types were characterised by stating conflict resolution types in terms of orders of overruling between rules. For example, an agent is realistic when rules for beliefs override all other components; she is social when obligations are stronger than the other motivational components with the exception of beliefs, etc.
If we focus on agent types as conflictresolution methods with respect to normative matters, we may state at design time that any agents’ deliberation is governed in such a way that obligations always prevail over, e.g., conflicting desires or actions, this being done with the purpose of exploring how the other components of agents evolve over time. Since agents’ behaviour dynamically depends on what mental states and conclusions are obtained from defeasible theories, enforcing obligations means imposing normative compliance by design, as for regimented agents.
However, one may rather be interested in exploring other strategies for implementing compliance. The regimentation of agents with complete information makes RL mechanisms useless for enforcing norms, but if the agents are not regimented then RL can be used to obtain compliance. Indeed, RL can be implemented as a dynamic model for enforcing norms, because there are good reasons to consider primarily the enforcement strategy where normative violations are possible but mechanisms are implemented “for enforcing obligations [...] by means of sanctions, that is, a treatment of the actions to be performed when a violation occurs, in order to deter agents from misbehaving” [22, sec. 1]. In this perspective, the idea of agent type (if applied to the relation between obligations and motivational attitudes) is no longer useful for imposing compliance by design, but it can be employed for interpreting and classifying the agents behaviour resulting from social interactions.

Static agent types: As proposed in [13, 27], the notion of agent type is applied by design to agents and it is used to solve conflicts using preferences between arguments. For example, if obligations prevail over desires, then the agent at stake is social by design, i.e. it has an innate inclination to comply with obligations.

Dynamic agent types: This notion of agent type is applied to the dynamics of attitudes. It is based on the idea that an agent can acquire new cognitive profiles. For example, if obligations are accepted while desires are discarded for some time and the agent is not social by design, then the agent is nevertheless social at that time.
Firstly, we identify conflicts amongst modalities from conflicts amongst statements.
Definition 7.1
(Conflict relation over modalities) Let \(Mod \) denote a set of modalities. \(\square \text{ }Conflict \) is a binary relation over \(Mod \), i.e. \( \square \text{ }Conflict \subseteq Mod \times Mod \) such that modality \(\square _1\) conflicts with \(\square _2\), i.e. \( \square \text{ }Conflict (\square _1, \square _2)\) if, and only if, for any literals \(\varphi _1\) and \(\varphi _2\) in conflict, the statements \((\square _1 \varphi _1 \mathrm {\,at\,} t)\) and \((\square _2 \varphi _2 \mathrm {\,at\,} t)\) conflict, i.e. if, and only if, \(\forall \varphi _1, \varphi _2\) such that \(\textit{Conflict} ((\square \varphi _1 \mathrm {\,at\,} t), (\square \varphi _2 \mathrm {\,at\,} t))\) and \(\textit{Conflict} ((\square \varphi _1 \mathrm {\,at\,} t), (\square \varphi _2 \mathrm {\,at\,} t))\) it holds that \(\textit{Conflict} ((\square _1 \varphi _1 \mathrm {\,at\,} t), (\square _2 \varphi _2 \mathrm {\,at\,} t))\) and \(\textit{Conflict} ((\square _2 \varphi _2 \mathrm {\,at\,} t), (\square _1 \varphi _1 \mathrm {\,at\,} t))\).
Secondly, from conflicts between modalities, we define conditions according to which a modality overrides another modality.
Definition 7.2

the modalities \(\Box _1\) and \(\Box _2\) conflict, i.e. \(\square \text{ }Conflict (\Box _1, \Box _2)\), and

the rule r is superior to rule s, i.e. \(r\succ s\).
Thirdly, from the conflict and override relations amongst modalities, we define static agent types. Static agent types ensure by design that agents solve conflicts between mental statements in a specific way. Tables in Definition 7.3 show possible types. It should be read as follows. Each table arrays potential conflicts between two types of modality: the first table concerns the pair \((\textsf {Obl}_i, \textsf {Des}_i)\), the second the pair \((\textsf {Obl}_i, \textsf {Do}_i)\), and so on. The first column from the left in each table indicates the conflict between the modalities, the second column concerns how the conflict is solved by design, and the third column provides some possible names for the corresponding agent type.
Definition 7.3
(Static agent type) Given an agent theory \(\langle Rul, \textit{Conflict}, \succ \rangle \) describing an agent, this agent is of (static) typeX if, and only if, the conditions for agent type X hold as established in the table shown below, where ‘S’ means ‘static’.
Conflict  \({\square _1}\succ _S{\square _2}\)  Agent type X 

\(\textsf {Obl}_i, \textsf {Des}_i\)  
–  –  Smotivationaldeontic independent 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Des}_i)\)  \({\textsf {Obl}_i}\succ _S{\textsf {Des}_i}\)  Smotivationaldeontic compliant 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Des}_i)\)  –  Smotivationaldeontic aporetic 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Des}_i)\)  \({\textsf {Des}_i}\succ _S{\textsf {Obl}_i}\)  Smotivationaldeontic deviant 
\(\textsf {Obl}_i, \textsf {Do}_i\)  
–  –  Spracticaldeontic independent 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Do}_i)\)  \({\textsf {Obl}_i}\succ _S{\textsf {Do}_i}\)  Spracticaldeontic compliant 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Do}_i)\)  –  Spracticaldeontic aporetic 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Do}_i)\)  \({\textsf {Do}_i}\succ _S{\textsf {Obl}_i}\)  Spracticaldeontic deviant 
\(\textsf {Obl}_i, \textsf {Hold}_i\)  
–  –  Sepistemicdeontic independent 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Hold}_i)\)  \({\textsf {Obl}_i}\succ _S{\textsf {Hold}_i}\)  Sepistemicdeontic compliant 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Hold}_i)\)  –  Sepistemicdeontic aporetic 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Hold}_i)\)  \({\textsf {Hold}_i}\succ _S{\textsf {Obl}_i}\)  Sepistemicdeontic deviant 
\(\textsf {Des}_i, \textsf {Do}_i\)  
–  –  Spracticalmotivational independent 
\(\square \text{ }Conflict (\textsf {Des}_i, \textsf {Do}_i)\)  \({\textsf {Des}_i}\succ _S{\textsf {Do}_i}\)  Spractical motivational compliant 
\(\square \text{ }Conflict (\textsf {Des}_i, \textsf {Do}_i)\)  –  Spracticalmotivational aporetic 
\(\square \text{ }Conflict (\textsf {Des}_i, \textsf {Do}_i)\)  \({\textsf {Do}_i}\succ _S{\textsf {Des}_i}\)  Spracticalmotivational deviant 
\(\textsf {Des}_i, \textsf {Hold}_i\)  
–  –  Sepistemicmotivational independent 
\(\square \text{ }Conflict (\textsf {Des}_i, \textsf {Hold}_i)\)  \({\textsf {Des}_i}\succ _S{\textsf {Hold}_i}\)  Sepistemicmotivational compliant 
\(\square \text{ }Conflict (\textsf {Des}_i, \textsf {Hold}_i)\)  –  Sepistemicmotivational aporetic 
\(\square \text{ }Conflict (\textsf {Des}_i, \textsf {Hold}_i)\)  \({\textsf {Hold}_i}\succ _S{\textsf {Des}_i}\)  Sepistemicmotivational deviant 
\(\textsf {Do}_i, \textsf {Hold}_i\)  
–  –  Sepistemicpractical independent 
\(\square \text{ }Conflict (\textsf {Do}_i, \textsf {Hold}_i)\)  \({\textsf {Do}_i}\succ _S{\textsf {Hold}_i}\)  Sepistemicpractical compliant 
\(\square \text{ }Conflict (\textsf {Do}_i, \textsf {Hold}_i)\)  –  Sepistemicpractical aporetic 
\(\square \text{ }Conflict (\textsf {Do}_i, \textsf {Hold}_i)\)  \({\textsf {Hold}_i}\succ _S{\textsf {Do}_i}\)  Sepistemicpractical deviant 
Let us provide some brief comments. SXY independent agents are free to adopt any mental statements, since there are no conflicts. SXdeontic compliant agents are social: here we obtain normative compliance by design, which means that obligations always override conflicting desires, actions and even beliefs. On the contrary, we have SXdeontic deviant agents, which must be distinguished from the S–Xdeontic aporetic ones, for which we simply do not solve conflicts between obligations and other modalities. In case of Sepistemic–Y compliant agents, beliefs are defeated by other mental modalities, and thus we have classical examples of wishful (deontic or motivational) thinking.
The value of agent types depends on how they are employed, but it also depends on the computational setting where they are used. In argumentbased MDPs, it is important to note that an agent observes a state at time t, has desires and does an action at the same time t. We have constrained the logic framework in this regard. For example, an agent may behave with respect to practical arguments using rules of the form “\(\textsf {Hold}_i \varphi _1 \mathrm {\,at\,} t \Rightarrow \textsf {Des}_i \varphi _2\mathrm {\,at\,} t\)” or “\(\textsf {Hold}_i \varphi _1 \mathrm {\,at\,} t \Rightarrow \textsf {Do}_i \varphi _2 \mathrm {\,at\,} t\)” where \(\varphi _1\) and \(\varphi _2\) conflict. Such arguments would be typically considered by an agent to change its world, for example to go from a dangerous state to a safe state. If there is a conflict between \(\textsf {Hold}_i\) and \(\textsf {Des}_i\), or between \(\textsf {Hold}_i\) and \(\textsf {Do}_i\), then arguments built from the above rules selfattack. Consequently, the agent may not be able to have a desire \((\textsf {Des}_i \varphi _2 \mathrm {\,at\,} t)\) or attempt the action \((\textsf {Do}_i \varphi _2 \mathrm {\,at\,} t)\), and hence the agent may be impeded in moving to some states. Conflicts between the modalities \(\textsf {Hold}_i\) and \(\textsf {Des}_i\), or between \(\textsf {Hold}_i\) and \(\textsf {Do}_i\) should thus be avoided to obtain meaningful argumentbased MDPs. This discussion emphasises that the operational interpretation of the notion of conflict between modalities depends much on the computational setting.
Definition 7.4
(Dynamic agent type) Given a ground agent theory \(\langle Rul, \textit{Conflict}, \succ \rangle \) describing an agent, the agent is of (dynamic) typeX at time t if, and only if, the relation \(\textit{Conflict} \) and agent’s attitude at time t satisfies conditions as established in the table below, where \(\varphi _1\) and \(\varphi _2\) conflict with each other and ‘D’ means ‘dynamic’.
Conflict  Attitude  Agent type X  

\(\textsf {Obl}_i, \textsf {Des}_i\)  
\(K_{\textsf {Obl}_i\varphi _1 }\)  \(K_{\textsf {Des}_i \varphi _2 }\)  
–  \( \textsf {in}\)  \(\textsf {in}\)  Dmotivationaldeontic independent 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Des}_i)\)  \( \textsf {in}\)  \(\textsf {no}\)  Dmotivationaldeontic compliant 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Des}_i)\)  \( \textsf {no}\)  \( \textsf {in}\)  Dmotivationaldeontic deviant 
\(\textsf {Obl}_i, \textsf {Do}_i\)  
\(K_{\textsf {Obl}_i\varphi _1 }\)  \(K_{\textsf {Do}_i\varphi _2 }\)  
–  \( \textsf {in}\)  \(\textsf {in}\)  Dpracticaldeontic independent 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Do}_i)\)  \( \textsf {in}\)  \(\textsf {no}\)  Dpracticaldeontic compliant 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Do}_i)\)  \( \textsf {no}\)  \( \textsf {in}\)  Dpracticaldeontic deviant 
\(\textsf {Obl}_i, \textsf {Hold}_i\)  
\(K_{\textsf {Obl}\varphi _1 }\)  \(K_{\textsf {Hold}\varphi _2 }\)  
–  \(\textsf {in}\)  \(\textsf {in}\)  Depistemicdeontic independent 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Hold}_i)\)  \(\textsf {in}\)  \(\textsf {no}\)  Depistemicdeontic compliant 
\(\square \text{ }Conflict (\textsf {Obl}_i, \textsf {Hold}_i)\)  \(\textsf {no}\)  \(\textsf {in}\)  Depistemicdeontic deviant 
\(\textsf {Des}_i, \textsf {Do}_i\)  
\(K_{\textsf {Des}_i\varphi _1 }\)  \(K_{\textsf {Do}_i\varphi _2 }\)  
–  \(\textsf {in}\)  \(\textsf {in}\)  Dpracticalmotivational independent 
\(\square \text{ }Conflict (\textsf {Des}_i, \textsf {Do}_i)\)  \(\textsf {in}\)  \(\textsf {no}\)  Dpracticalmotivational compliant 
\(\square \text{ }Conflict (\textsf {Des}_i, \textsf {Do}_i)\)  \(\textsf {no}\)  \(\textsf {in}\)  Dpracticalmotivational deviant 
\(\textsf {Des}_i, \textsf {Hold}_i\)  
\(K_{\textsf {Des}_i\varphi _1 }\)  \(K_{\textsf {Hold}_i\varphi _2 }\)  
–  \(\textsf {in}\)  \(\textsf {in}\)  Depistemicmotivational independent 
\(\square \text{ }Conflict (\textsf {Des}_i, \textsf {Hold}_i)\)  \(\textsf {in}\)  \(\textsf {no}\)  Depistemicmotivational compliant 
\(\square \text{ }Conflict (\textsf {Des}_i, \textsf {Hold}_i)\)  \(\textsf {no}\)  \(\textsf {in}\)  Depistemicmotivational deviant 
\(\textsf {Do}_i, \textsf {Hold}_i\)  
\(K_{\textsf {Do}_i\varphi _1 }\)  \(K_{\textsf {Hold}_i\varphi _2 }\)  
–  \(\textsf {in}\)  \(\textsf {in}\)  Depistemicpractical independent 
\(\square \text{ }Conflict (\textsf {Do}_i, \textsf {Hold}_i)\)  \(\textsf {in}\)  \(\textsf {no}\)  Depistemicpractical compliant 
\(\square \text{ }Conflict (\textsf {Do}_i, \textsf {Hold}_i)\)  \(\textsf {no}\)  \(\textsf {in}\)  Depistemicpractical deviant 
We can remark that the table in Definition 7.4 does not cater for cases where conflicting statements are both labelled \(\textsf {no}\), as it is not clear to us how to interpret such cases in a meaningful way in the absence of any mental elements labelled \(\textsf {in}\).
The dynamic type of an agent cannot influence any static type of the agent, whereas a static type can influence the dynamic type of the agent. For example, a Smotivationaldeontic deviant agent may tend to become Dmotivationaldeontic deviant, but not necessarily since arguments for desires can be undercut. Formal investigations of the influence of static types on dynamic types is left to future research.
In summary, agents can have a logicbased characterisation in terms of static types and dynamic types. Static types are fixed at design time and they can influence dynamic types at run time, but not vice versa. We exemplify later such logicbased characterisation in our illustration of the overall framework in Sect. 7.3.
7.2 Probabilistic characterisation
Since any argumentbased RL agent may adapt its attitudes, resulting into a distribution of dynamic types, any agent can be characterised from a probabilistic perspective by a distribution of dynamic types. For example, in a particular environment state, an agent may be Dpracticaldeontic compliant with probability 0.1 and Dpracticaldeontic deviant with probability 0.9 (and thus Dpracticaldeontic independent with probability 0).
The distribution of attitudes, and thus dynamic types, may be ‘shaped’ by specific static types. For example, an agent may not be able to take attitudes which are incompatible with its static type. From a probabilistic perspective, such attitudes will be taken with a probability 0. Yet, instead of considering static types to profile an agent in terms of dynamic types, we may set some attitudes with some infinite energies to shape an agent in a similar manner. Nevertheless, a logicbased characterisation in terms of static types has the advantage to be a concise means to shape the distribution of attitudes.
Besides attitudes which are incompatible with some static types, some other attitudes may be discarded by associating them with an infinite energy. By doing so, the number of reinforceable attitudes can be reduced, so that we can aggressively reduce the computational complexity of an argumentbased deliberation (Definition 6.19).
In particular, the computational complexity may be drastically reduced by assuming that an agent can perform at most one action at a time. Accordingly, we may discard all the attitudes where more than one reinforceable argument leading to an action \((\textsf {Do}_i \varphi \mathrm {\,at\,} t)\) is labelled \({\small {\textsf {ON}}}\) and \({\small {\textsf {IN}}}\), i.e. we may discard all the attitudes with more than one reinforceable action. If an attitude leads to no action, then it corresponds to a behaviour of inhibition.
To further reduce the complexity of a deliberation, we may also assume that only arguments supporting actions are reinforceable, i.e. arguments leading to statements of beliefs \((\textsf {Hold}_i \varphi \mathrm {\,at\,} t )\), desires \((\textsf {Hold}_i \varphi \mathrm {\,at\,} t )\) and obligations \((\textsf {Obl}_i \varphi \mathrm {\,at\,} t )\) are unreinforceable while arguments leading to actions \((\textsf {Do}_i \varphi \mathrm {\,at\,} t)\) are possibly reinforceable.
Example 7.1
However such factorisation leads to a computational burden due to the number of attitudes, see Table 2. For this reason, we may attach an infinite energy to every attitude where more than one argument supporting an action is labelled \({\small {\textsf {ON}}}\) and \({\small {\textsf {IN}}}\). In this case, the table of the reinforceable factor \(\otimes \) is reduced as in Table 3. \(\square \)
View on the controlled factor \(\otimes \) conditioned to the safe state
\(S^t:\,\textsf {Hold}_{i}\mathrm{safe}\mathrm {\,at\,} t\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\) 
\(O^t:\,\textsf {Obl}_{i} \mathrm{care}\mathrm {\,at\,} t\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\) 
\(SC^t:\,\textsf {Do}_i \mathrm{care}\mathrm {\,at\,} t\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {UND}}}\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {OFF}}}\) 
\(SN^t: \textsf {Do}_i \mathrm{neglect}\mathrm {\,at\,} t\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {UND}}}\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {OUT}}}\)  \({\small {\textsf {OUT}}}\)  \({\small {\textsf {OFF}}}\) 
\(OC^t:\, \textsf {Do}_i\mathrm{care}\mathrm {\,at\,} t\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\) 
View on the reinforceable factor \(\otimes \) conditioned to the safe state, where assignments with infinite energy are not displayed
\(S^t:\,\textsf {Hold}_{i}\mathrm{safe}\mathrm {\,at\,} t\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\) 
\(O^t:\,\textsf {Obl}_{i} \mathrm{care}\mathrm {\,at\,} t\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {IN}}}\) 
\(SC^t:\,\textsf {Do}_i \mathrm{care}\mathrm {\,at\,} t\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {OFF}}}\) 
\(SN^t:\, \textsf {Do}_i \mathrm{neglect}\mathrm {\,at\,} t\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {IN}}}\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {OFF}}}\) 
\(OC^t:\, \textsf {Do}_i\mathrm{care}\mathrm {\,at\,} t\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {OFF}}}\)  \({\small {\textsf {IN}}}\) 
As illustrated in Example 7.1, if the overall setting is such that every action can be supported by one and only one accepted argument, then an argumentbased RL agent variant may boil down to a finegrained RL agent such that the pairs stateaction are replaced by practical arguments along the events of unreinforceable arguments. We will use this type of agent for our illustrations.
7.3 Illustrations
We now illustrate the framework with a few simple experiments on the basis of the MDP pictured in Fig. 1.
We will attempt to show how the declarative and defeasible features of the framework can ease the specification of wellargued models (amongst others) of an argumentbased RL agent. Each model of the agent and its environment was formally specified by a probabilistic defeasible theory. Then, each specification was executed, i.e. animated, using our argumentbased SARSA algorithm (see Algorithm 5). As the framework is fully declarative, we just had to write the defeasible rules in a file, fix some parameters such as the learning parameters and the length of a run, to obtain a bench of simulations from which we eventually made some statistics.
For every experiment, the learning parameters used by the argumentbased SARSA agent were as follows: learning rate \(\alpha = 0.1\), discount rate \(\gamma = 0.9\), and temperature \(\tau = 1\). The initial state was the safe state. For each setting, we traced the probability of careful and negligent behaviours in the safe state and the dangerous state, averaged over 100 runs, along with the absolute deviation.
Basic agent (control)
Let us consider the MDP pictured in Fig. 1 and formatted with the environment and agent probabilistic defeasible theory given in Example 6.4. We animated the theories, and the averaged probability of each action in the safe and dangerous state are shown in Fig. 11.
S–X–Y independent agent, with selfsanctioning desire
S–X–Y independent agent, with selfsanctioning desire and obligation
Results of the animations are given in Fig. 13. We observe that such a S–X–Y independent agent, with a selfsanctioning desire to behave with negligence, learns to behave with care, because of the enforcement of an obligation to behave with care.
Smotivationaldeontic compliant agent, with selfsanctioning desire and obligation
Spracticaldeontic compliant agent, with selfsanctioning desire and obligation
Suppose now a practical application for sociotechnical systems, where an artificial agent is required to be fully compliant with some regulations, that is, the agent has to be regimented.
By providing the agent with the knowledge that an obligation holds, and by ‘hardwiring’ the agent as a Spracticaldeontic compliant agent, the agent was guided towards the ‘right’ decisions to make for the purpose of the application.
In summary, agents with different types not only learnt different behaviours, but also learnt at different speed: for example, agents with selfsanctioning desire and enforced obligations (Figs. 14 and 15) had their behaviours converged significantly faster than basic agents (Fig. 11). Since a major challenge faced by RL is how to improve learning speed especially in the face of huge real applications (see Sect. 3), this observation suggests that learning speed can be significantly improved by modelling an agent with logic frameworks such as PA. This observation is consistent with those in [24, 25], as the argumentation accelerated RL framework proposed in those works can be viewed as special cases of our present proposal, in which all ‘applicable’ arguments have probability 1 and labelled \({\small {\textsf {ON}}}\) (all ‘inapplicable’ arguments are \({\small {\textsf {OFF}}}\)) and the preferred or grounded labellings are used [7].
8 Conclusion
How to model a bounded agent facing the uncertainty pertaining to the use of partial information and conflicting reasons as well as the uncertainty pertaining to the stochastic nature of the environment and agent’s actions? How to build argumentbased and executable specifications of such agents? And how to combine models of agency with RL and norms?
To address these questions, we motivated and investigated a combination of PA and RL allowing us (i) to provide a rich declarative argumentbased representation of MDPs (ii) to model an agent that can learn the utility of argument labellings from rewards, so that labellings that are more likely to lead to high rewards will have higher utility values, and will be selected more often.
This computational framework allowed us to move from a behavioural account of agent modelling to an argumentbased mentalistic approach where attitudes are labellings of arguments or mental statements. Interestingly, whilst argumentbased frameworks for reasoning agents often propose a combination of inference mechanisms for sceptical epistemic reasoning and credulous practical reasoning, our use of the grounded \(\{{\small {\textsf {IN}}}, {\small {\textsf {OUT}}}, {\small {\textsf {UND}}}, {\small {\textsf {OFF}}}\}\)labelling allows us to have a homogeneous inferential view covering epistemic and practical reasoning.
An advantage of this mentalistic approach is a finegrained distinction amongst attitudes and associated behaviours. In this regard, we may characterise an agent from a logicbased perspective as well as from an energybased and probabilistic perspective. In fact, nobody exactly knows what are the dynamics of mental attitudes. The mentalistic approach is thus meant to be a computational tool to account for and investigate hypotheses about agent attitudes in silico, with eventual inputs and validation from in vivo or in situ experiments.
A disadvantage of the argumentbased mentalistic approach lies in its computational complexity. As this approach is meant to account for the reinforcement of attitudes and since the number of attitudes can be large, it is computationally more demanding than the traditional reinforcement of pairs stateaction. Due to this computational complexity, this mentalistic approach may also have a strong impact on learning dynamics. However, we showed how the computational complexity can be circumscribed by using factors, and by setting an infinite energy to some attitudes. On this basis, we illustrated the framework with simple experiments to highlight the abilities of the setting to represent and animate finegrained models of agency.
As to future developments, the architecture of the framework invites extensions with respect to learning or reasoning abilities, most interestingly both. In regard to the learning features, more sophisticated learning mechanisms can be straightforwardly investigated such as eligibility traces [54] or rewardshaping [33]. The energybased argumentation model also paves the way to interesting functionalities: for example we may learn the factor values from reallife data, so that we can reproduce in silico an environment or the observed behaviour of an agent, and we may induce the types of an agent by observing its behaviour. In regard to reasoning, the logicframework paves the way to a development of an argumentbased beliefdesireintention architecture with learning abilities. Finally, as we alluded to, we focused on the setting of MDPs, as a necessary step towards more interesting computational settings. In particular, PA will be more compelling for the setting of POMDPs, where an agent partially observes its current state, and makes argumentbased decisions by deriving defeasible conclusions and by updating these conclusions in light of new information.
9 Key notations
Some key notations used in the paper
G  An argumentation graph 
T  A defeasible theory 
\(\textsf {ArgLabels}\)  A set of labels for arguments 
\(l\)  A label for arguments 
\(\mathrm {L}\)  An argument labelling function 
\(\mathscr {L}\)  A set of argument labellings 
\(L_A\)  The random labelling of argument A 
L  A set of random argument labellings 
\(\mathbf {l}\)  An assignment of a set of random argument labellings 
\(\textsf {LitLabels}\)  A set of labels for statements 
\(k\)  A label for statements 
\(\mathrm {K}\)  A statement labelling function of statements 
\(\mathscr {K}\)  A set of statement labellings 
\(K_\varphi \)  The random labelling of statement \(\varphi \) 
\(\mathbf K \)  A set of random statement labellings 
\(\mathbf {k}\)  An assignment of a set of random statement labellings 
Footnotes
 1.
Though there seems to be an emerging consensus in the literature conceiving ‘undercutting’ to mean an attack on a rule and ‘undermining’ to be an attack on premises, we prefer to adopt here a terminology closer to early work on rulebased argumentation, see e.g. [41].
 2.
Recall: the set of assumptive arguments supporting a set of assumptions \({Assum}\) is denoted \({\mathrm {AssumArg}}({Assum})\), see Notation 4.4.
 3.
Recall: the set of assumptive arguments supporting a set of assumptions \({Assum}\) is denoted \({\mathrm {AssumArg}}({Assum})\), see Notation 4.4.
 4.
We use the standard notation, so for \(\mathbf {Y} \subseteq \mathbf {X}\), we use \(\mathbf {x}(\mathbf {Y})\) to refer to the assignment within \(\mathbf {x}\) to the variables in \(\mathbf {Y}\). For example, if \(\mathbf {X}=\{X1,X2,X3\}\), \(\mathbf {Y}=\{X1,X2\}\) and \(\mathbf {x}=\{X1=1,X2=2,X3=3\}\), then \(\mathbf {x}(\mathbf {Y})=\{X1=1,X2=2\}\).
Notes
Acknowledgements
We would like to thank Pietro Baroni for his insights in argumentation. This work was supported by the Marie Curie IntraEuropean Fellowship PIEFGA2012331472.
References
 1.Alexy, R. (1989). A theory of legal argumentation: The theory of rational discourse as theory of legal justification. Oxford: Clarendon.Google Scholar
 2.Amgoud, L. (2009). Argumentation for decision making. In Argumentation in artificial intelligence (pp. 301–320). Springer.Google Scholar
 3.Artikis, A., Sergot, M., & Pitt, J. (2009). Specifying normgoverned computational societies. ACM Transactions on Computational Logic, 10(1), 1:1–1:42.MathSciNetCrossRefzbMATHGoogle Scholar
 4.Artikis, A., Sergot, M., Pitt, J., Busquets, D., & Riveret, R. (2016). Specifying and executing open multiagent systems. In Social coordination frameworks for social technical systems (pp. 197–212). Springer.Google Scholar
 5.Atkinson, K., Baroni, P., Giacomin, M., Hunter, A., Prakken, H., Reed, C., et al. (2017). Towards artificial argumentation. AI Magazine, 38(3), 25–36.CrossRefGoogle Scholar
 6.Atkinson, K., & BenchCapon, T. J. M. (2007). Practical reasoning as presumptive argumentation using action based alternating transition systems. Artificial Intellignence, 171(10–15), 855–874.MathSciNetCrossRefzbMATHGoogle Scholar
 7.Baroni, P., Caminada, M., & Giacomin, M. (2011). An introduction to argumentation semantics. The Knowledge Engineering Review, 26(4), 365–410.CrossRefGoogle Scholar
 8.Baroni, P., Governatori, G., & Riveret, R. (2016). On labelling statements in multilabelling argumentation. In Proceedings of the 22nd European conference on artificial intelligence (Vol. 285, pp. 489–497). IOS Press.Google Scholar
 9.Bellman, R. (1956). Dynamic programming and Lagrange multipliers. Proceedings of the National Academy of Sciences of the United States of America, 42(10), 767.MathSciNetCrossRefzbMATHGoogle Scholar
 10.BenchCapon, T. J. M., & Atkinson, K. (2009). Abstract argumentation and values. In L. Rahwan & G. Simari (eds.) Argumentation in artificial intelligence. Springer.Google Scholar
 11.Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vol. 1). Belmont, MA: Athena Scientific.zbMATHGoogle Scholar
 12.Besnard, P., García, A. J., Hunter, A., Modgil, S., Prakken, H., Simari, G. R., et al. (2014). Introduction to structured argumentation. Argument & Computation, 5(1), 1–4.CrossRefGoogle Scholar
 13.Broersen, J., Dastani, M., Hulstijn, J., & van der Torre, L. (2002). Goal generation in the BOID architecture. Cognitive Science Quarterly, 2(3–4), 428–447.Google Scholar
 14.Chen, S. H., & Huang, Y. C. (2005). Risk preference and survival dynamics. In: Agentbased simulation: From modeling methodologies to realworld applications, Agentbased social systems (Vol. 1, pp. 135–143). Tokyo: Springer.Google Scholar
 15.Conte, R., & Castelfranchi, C. (1995). Cognitive and social action. London: University College of London Press.Google Scholar
 16.Conte, R., & Castelfranchi, C. (2006). The mental path of norms. Ratio Juris, 19, 501–517.CrossRefGoogle Scholar
 17.Conte, R., Falcone, R., & Sartor, G. (1999). Introduction: Agents and norms: How to fill the gap? Artificial Intelligence and Law, 7(1), 1–15.CrossRefGoogle Scholar
 18.Cormen, T. H., Leiserson, C. E., Rivest, R. L., Stein, C., et al. (2001). Introduction to algorithms (Vol. 2). Cambridge: MIT press.zbMATHGoogle Scholar
 19.Dung, P. M. (1995). On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and nperson games. Artificial Intelligence, 77(2), 321–358.MathSciNetCrossRefzbMATHGoogle Scholar
 20.Edmonds, B. (2004). How formal logic can fail to be useful for modelling or designing mas. In Regulated agentbased social systems, Lecture Notes in Computer Science (Vol. 2934, pp. 1–15). Springer.Google Scholar
 21.Fasli, M. (2004). Formal systems and agentbased social simulation equals null? Journal of Artificial Societies and Social Simulation, 7(4), 1–7.Google Scholar
 22.Fornara, N., & Colombetti, M. (2009). Specifying and enforcing norms in artificial institutions. In Declarative agent languages and technologies VI, Lecture Notes in Computer Science (Vol. 5397, pp. 1–17). Springer.Google Scholar
 23.Fox, J., & Parsons, S. (1997). On using arguments for reasoning about actions and values. In Proceedings of the AAAI spring symposium on qualitative preferences in deliberation and practical reasoning.Google Scholar
 24.Gao, Y., & Toni, F. (2014). Argumentation accelerated reinforcement learning for cooperativeultiagent systems. In Proceedings of 21st European conference on artificial intelligence (pp. 333–338). IOS Press.Google Scholar
 25.Gao, Y., Toni, F., & Craven, R. (2012). Argumentationbased reinforcement learning for robocup soccer keepaway. In Proceedings of 20th European conference on artificial intelligence (pp. 342–347). IOS Press.Google Scholar
 26.Gaudou, B., Lorini, E., & Mayor, E. (2013). Moral guilt: An agentbased model analysis. In Advances in social simulation—Proceedings of the 9th conference of the european social simulation association (pp. 95–106).Google Scholar
 27.Governatori, G., & Rotolo, A. (2008). BIO logical agents: Norms, beliefs, intentions in defeasible logic. Autonomous Agents and MultiAgent Systems, 17(1), 36–69.CrossRefGoogle Scholar
 28.Hunter, A., & Thimm, M. (2017). Probabilistic reasoning with abstract argumentation frameworks. Journal of Artificial Intelligence Research, 59, 565–611.MathSciNetCrossRefzbMATHGoogle Scholar
 29.Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques—Adaptive computation and machine learning. Cambridge: The MIT Press.Google Scholar
 30.Kostrikin, A. I., Manin, Y. I., & Alferieff, M. E. (1997). Linear algebra and geometry. Washington, DC: Gordon and Breach Science Publishers.Google Scholar
 31.Modgil, S., & Caminada, M. (2009). Proof theories and algorithms for abstract argumentation frameworks. In Argumentation in artificial intelligence (pp. 105–129). Springer.Google Scholar
 32.Muller, J., & Hunter, A. (2012). An argumentationbased approach for decision making. In 24th international conference on tools with artificial intelligence (Vol. 1, pp. 564–571). IEEE.Google Scholar
 33.Ng, A., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations: theory and application to reward shaping. In Proceedings of 16th international conference on machine learning (pp. 278–287).Google Scholar
 34.Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., & Liang, E. (2006). Autonomous inverted helicopter flight via reinforcement learning. In Experimental robotics IX (pp. 363–372). Springer.Google Scholar
 35.Oren, N. (2014). Argument schemes for normative practical reasoning (pp. 63–78). Berlin: Springer.zbMATHGoogle Scholar
 36.Parsons, S., & Fox, J. (1996). Argumentation and decision making: A position paper. In Practical reasoning (pp. 705–709). Springer.Google Scholar
 37.Pattaro, E. (2005). The law and the right. In E. Pattaro (Ed.), Treatise of legal philosophy and general jurisprudence (Vol. 1). Berlin: Springer.Google Scholar
 38.Pollock, J. L. (1995). Cognitive carpentry: A blueprint for how to build a person. Cambridge, MA: MIT Press.Google Scholar
 39.Prakken, H. (2006). Combining sceptical epistemic reasoning with credulous practical reasoning. In Proceedings of the 1st conference on computational models of argument (pp. 311–322). IOS Press.Google Scholar
 40.Prakken, H. (2011). An abstract framework for argumentation with structured arguments. Argument and Computation, 1(2), 93–124.CrossRefGoogle Scholar
 41.Prakken, H., & Sartor, G. (1997). Argumentbased extended logic programming with defeasible priorities. Journal of Applied NonClassical Logics, 7(1–2), 25–75.MathSciNetCrossRefzbMATHGoogle Scholar
 42.Prakken, H., & Sartor, G. (2015). Law and logic: A review from an argumentation perspective. Artificial Intelligence, 227, 214–245.MathSciNetCrossRefzbMATHGoogle Scholar
 43.Rahwan, I., & Simari, G. R. (Eds.). (2009). Argumentation in artificial Intelligence. Berlin: Springer.Google Scholar
 44.Riveret, R., Baroni, P., Gao, Y., Governatori, G., Rotolo, A., & Sartor, G. (2018). A labelling framework for probabilistic argumentation. Annals of Mathamatics and Artificial Intelligence, 83(1), 21–71.MathSciNetCrossRefzbMATHGoogle Scholar
 45.Riveret, R., Korkinof, D., Draief, M., & Pitt, J. V. (2015). Probabilistic abstract argumentation: An investigation with boltzmann machines. Argumentation & Computation, 6(2), 178–218.CrossRefGoogle Scholar
 46.Riveret, R., Pitt, J. V., Korkinof, D., & Draief, M. (2015). Neurosymbolic agents: Boltzmann machines and probabilistic abstract argumentation with subarguments. In Proceedings of the 14th international conference on autonomous agents and multiagent systems (pp. 1481–1489). ACM.Google Scholar
 47.Riveret, R., Rotolo, A., & Sartor, G. (2012). Probabilistic rulebased argumentation for normgoverned learning agents. Artificial Intelligence and Law, 20(4), 383–420.CrossRefzbMATHGoogle Scholar
 48.Ross, A. (1958). On law and justice. London: Stevens.Google Scholar
 49.Rummery, G. A., & Niranjan, M. (1994). Online Qlearning using connectionist systems. Technical report. University of Cambridge.Google Scholar
 50.Sartor, G. (2005). Legal reasoning: A cognitive approach to the law. Berlin: Springer.Google Scholar
 51.Shams, Z., Vos, M. D., Oren, N., Padget, J., & Satoh, K. (2015). Argumentationbased normative practical reasoning. In Proceedings of the 3rd international workshop on theory and applications of formal argumentation, revised selected papers (pp. 226–242). Springer.Google Scholar
 52.Simari, G. I., Shakarian, P., & Falappa, M. A. (2016). A quantitative approach to belief revision in structured probabilistic argumentation. Annals of Mathematics and Artificial Intelligence, 76(3), 375–408.MathSciNetCrossRefzbMATHGoogle Scholar
 53.Stone, P., Sutton, R. S., & Kuhlmann, G. (2005). Reinforcement learning for robocup soccer keepaway. Adaptive Behavior, 13, 165–188.CrossRefGoogle Scholar
 54.Sutton, R. S., & Barto, A. (1998). Reinforcement learning: An introduction. Cambridge: MIT Press.zbMATHGoogle Scholar
 55.Tadepalli, P., Givan, R., & Driessens, K. (2004). Relational reinforcement learning: An overview. In Proceedings of the ICML04 workshop on relational reinforcement learning.Google Scholar
 56.van der Hoek, W., Roberts, M., & Wooldridge, M. (2007). Social laws in alternating time: Effectiveness, feasibility, and synthesis. Synthese, 156(1), 1–19.MathSciNetCrossRefzbMATHGoogle Scholar