Severitysensitive normgoverned multiagent planning
Abstract
In making practical decisions, agents are expected to comply with ideals of behaviour, or norms. In reality, it may not be possible for an individual, or a team of agents, to be fully compliant—actual behaviour often differs from the ideal. The question we address in this paper is how we can design agents that act in such a way that they select collective strategies to avoid more critical failures (norm violations), and mitigate the effects of violations that do occur. We model the normative requirements of a system through contrarytoduty obligations and violation severity levels, and propose a novel multiagent planning mechanism based on Decentralised POMDPs that uses a qualitative reward function to capture levels of compliance: NDecPOMDPs. We develop mechanisms for solving this type of multiagent planning problem and show, through empirical analysis, that joint policies generated are equally as good as those produced through existing methods but with significant reductions in execution time.
Keywords
Norms Multiagent planning DecPOMDPs1 Introduction
With increased automation, the need for systems to act in such a way that they are cognizant of normative expectations is critical. Norms declare ideals of behaviour, but they are inherently violable: the actual behaviour of agents may differ from the ideal. Subideal behaviour may, however, be inevitable. It may not be possible for an agent (or a group of agents) to be fully compliant, given resource limitations. An agent may decide to violate a norm now in order to avoid a more serious violation in the future. A norm may be violated due to an unexpected outcome of a sequence of actions. In fact, inaction may not be sufficient to avoid the violation of a norm: the world may change into a subideal state unless an agent acts. The challenge we address in this paper is how to develop effective reasoning mechanisms for agents such that they operate robustly, both individually and as a collective, under normative expectations. By robust, we mean that the agents act so that they are as compliant with the ideal as possible, given prevailing circumstances.
Any solution to this general problem must take into account uncertainties due to nondeterministic action outcomes, dependencies between and among agents with respect to actions and resources, and environmental changes that are not under their control. There are two other important considerations we take into account: there may be expectations on agents regarding how they should repair, or recover from, nonideal states of affairs (so called contrarytoduty obligations); and the violation of norms may vary in severity. The former of these is widely considered to be an important characteristic of realworld domains. Prakken and Sergot [21] use an example derived from regulations about the appearance of holiday cottages to illustrate this: there must be no fence (the primary obligation); and if there is a fence, it must be white (the contrarytoduty rule). In the case where there is a fence (the primary obligation is violated), it is the duty of the owner to ensure that it is painted white.
The idea that norms (or, strictly, the violation of norms) vary in severity is also widely recognised, but, we argue, often poorly modelled for the purposes of practical reasoning. In computational models, severity is often modelled through predefined sanctions [1]. Further, the vast majority of examples used are fines, or some other loss of utility, implying an underlying additive assumption. The argument we present against this rather simplistic approach is grounded, again, on how violations are classified in realworld domains. Distinguishing different qualitative levels of violation is an important principle in law, often referred to as “fair labelling”. According to Ashworth [3], for example, fair labelling is (in part) where “offences are subdivided and labelled so as to represent fairly the nature and magnitude of the lawbreaking”. This is a principle reflected in various legal systems; e.g. misdemeanour versus felony in the US. Similarly, in security contexts, information is often classified in terms of levels (restricted, secret, etc.), representing the idea that the revelation of any document at a higher security classification is always more severe than revealing any amount of information at a lower classification. Of course, revealing any classified information is undesirable, but severity levels give tipping points of compliance. There is an important pragmatic reason that qualitative levels of violation are specified in this way: sanctions are imposed after the fact and given an assessment of the context in which the norm was violated. All we know in advance (i.e. at the point where we need to make decisions about how to act) is that violations of some norms are more/less severe than others. Further, specifying sanctions for all norms over a single interval scale, equating to some loss of utility, can lead to additive fallacies [18], where some number of violations at a lower level of severity are taken to be as bad as, or worse than one at a qualitatively higher level. Such fallacies would lead to poor practical decisions.
Contrarytoduty obligations and violation severity provide complementary means to specify requirements for system robustness. The use of contrarytoduty obligations enables us to reason about behaviour that goes some way to repair a failure. The use of severity levels enables us to reason about behaviour that avoids critical levels of failure and that minimises accumulated failures at some level.
Our starting point is a deontic logic for the specification of normative systems that may contain contrarytoduty structures [29], along with a strict partial order over obligations that declares the relative severity of their violation. From this, we compute a preference relation over possible worlds that captures levels of system robustness (Sect. 3), which we prove to be both transitive and acyclic. A transitive and acyclic preference relation is necessary for reliable practical reasoning: with this input, an agent can compare worlds and hence possible courses of action for compliance with the normative specification. Next, we propose a novel model of multiagent planning under uncertainty that is suitable for reasoning about domains with qualitative reward functions such as those representing levels of system robustness. This multiagent planning mechanism is grounded on DecPOMDPs [2]: Normative Decentralised Partially Observable Markov Decision Processes, or NDecPOMDPs (Sect. 4). We provide an algorithm for computing joint policies that uses a sequence of linear programs that optimise against levels of robustness, iteratively introducing additional constraints at less critical levels until no additional improvement can be found (Sect. 4.2). The analysis of more/less preferred possible worlds is also exploited in the planner through a MostCriticalStates (MCS) heuristic that is used to identify belief states to optimize an NDecPOMDP policy towards a more compliant behaviour in a team of agents (Sect. 4.3). We demonstrate through empirical analysis (Sect. 5) that this approach offers significant reductions in execution times (by 50% in the most challenging problem considered) for the NDecPOMDP solver with no loss in solution quality. Before moving on to present the two key contributions of this research (in Sects. 3 and 4), we outline a scenario that both illustrates the normative concepts that are core to the model and gives an intuition of the practical reasoning problem we address. We defer our review of related research in normgoverned and preferencebased planning, and discussion of the model and possible avenues for future research to Sect. 6.2.
The core contributions we claim of this research are twofold. First, we propose a mechanism to efficiently compute a preference relation over possible worlds from a normative specification that correctly reflects both contrarytoduty structures and violation severity. Second, we present a novel multiagent planning model, NDecPOMDPs, and associated heuristic, MCS, that can compute effective joint plans given a qualitative reward function, such as one that represents levels of compliance derived from a normative system specification. We, therefore, contribute both to modelling and practical reasoning in normative multiagent systems, and to algorithms for decentralised planning under uncertainty.
2 Motivating scenario
 \(O_1\)

A UAV must always monitor the restricted area.
 \(O_2\)

If no UAV is monitoring the area, a helicopter must monitor the area.
 \(O_3\)

If an unauthorised boat is detected, at least one agent must intercept it.
 \(O_4\)

If an unauthorised boat is detected and no agent intercepts it, the incursion must be reported to headquarters.
 \(O_5\)

UAVs should not reveal their location.
Now, suppose that \(\alpha \) is the act of intercepting the unauthorised boat, and \(\beta \) is to monitor the restricted area. If, in the initial state, the UAV intercepts the unauthorised boat (does \(\alpha \)), then, regardless of what the helicopter does, we reach state B (‘bad’ state) in which the UAV’s location is revealed. Subsequent transitions may also mean that the UAV’s location is known, or we may return to a fully compliant state (the terminal state in Fig. 2), which we summarise using the transition probability p. What if the UAV chooses to continue to conduct surveillance (action \(\beta \))? The outcome depends on the actions of the other agent. If the helicopter intercepts the unauthorised boat (joint action \(\langle \beta , \alpha \rangle \)), then all is well. If not, and the helicopter conducts surveillance, then there is some chance that the system will transition to the state W (‘worse’ state) in which the unauthorised boat is neither intercepted nor reported. Depending on the probabilities p and q in this summarised situation, the likelihood of the system entering state W and the proability of multiple violations of \(O_5\) (entering state B) may vary. What we want is a mechanism that produces plans for multiple agents such that the qualitative differences between the possible execution paths that the system as a whole may take are taken into account to drive more compliant behaviour. The resulting plans need to provide guidance to agents for situations in which fully compliant behaviour is not possible. In this small example this could occur if, for example, the helicopter is low on fuel, leaving only a choice between a path with 1 or more violations of \(O_5\) versus violation of \(O_3\).
3 Levels of robustness in normative system specifications
Given a normative specification, such as the one described in the previous section, we need to identify how compliant each state of affairs is so that we can guide the planning process. In essence, our aim is to compute a preference relation over possible worlds that reflects the level of compliance of those worlds with a set of norms. For example, we want worlds in which \(O_5\) is violated (B in Fig. 2) to be preferred to those in which \(O_3\) is violated (due to the severity specification), and worlds in which \(O_3\) is violated to be preferred to those in which \(O_4\) is violated (due to the contrarytoduty structure linking \(O_3\) and \(O_4\)). We then use this preference relation to build a ranking of possible worlds, which allows the space of possible worlds to be partitioned into different severity ranges. We first present the semantics of our model and define the notions of compliance of a world with a norm and coherence between an obligation and a pair of worlds. With these definitions, we specify a transitive and acyclic preference relation, \(P_{W}\), and present a method to efficiently compute a ranking of possible worlds from this relation.
3.1 Normative system semantics

\(W = \{w_1,\ldots , w_i,\ldots , w_n \}\) is a set of n possible worlds.

\(\textit{PV}\) is a set of propositional variables, and \(\phi \), \(\phi _2\) denote individual propositions. The set of well formed formulae of propositional logic, F, is such that \(\textit{PV} \subset F\), and if \(p, q \in F\) then \(\lnot p \in F\), \(p \wedge q \in F\), etc.

\( VA : W \rightarrow 2^{PV}\) is a valuation function that assigns, to each world \(w \in W\), the set of propositional variables that hold true in w.

\(OS= \{ O_1 = \mathbf O (p_1 \; \vert \; q_1) ) ,\ldots , O_m = \mathbf O (p_m \; \vert \; q_m) \}\) is a normative specification, where \(p_i\) and \(q_i\) are two formulae in F. Intuitively, \(\mathbf O (p_i \; \vert \; q_i)\) represents a dyadic obligation to achieve (or maintain) \(p_i\) that applies to worlds in which \(q_i\) holds: an obligation to achieve \(p_i\) that is conditional on \(q_i\).

\(P_{o}\subseteq OS\times OS\) is a strict partial order over obligations that reflects the relative severity of their violation. Given two obligations \(O_i\) and \(O_j\), \(( O_i, O_j ) \in P_{o}\) (or alternatively \(O_i \succ _{o}O_j\)) means that a violation of \(O_i\) is considered more severe than one of \(O_j\). \(P_{o}\) is a transitive relation, thus, if we consider a graph G, where each node represents an obligation, and each edge a member of \(P_{o}\), we say that violating \(O_a\) is more severe than violating \(O_b\) if and only if the node representing \(O_b\) is reachable from \(O_a\) through the edges of G.

\(M \models _{w_i} \phi \text { iff } \phi \in VA (w_i)\)

\(M \models _{w_i} \lnot \phi \text { iff } \phi \not \in VA (w_i)\)

\(M \models _{w_i} \phi _1 \wedge \phi _2 \text { iff } M \models _{w_i} \phi _1 \text { and } M \models _{w_i}\phi _2\)
We now define the compliance of a world with a dyadic obligation, and the coherence of an ordered pair of worlds with respect to an obligation. These two concepts are used to define the relationship between the normative and severity specifications and the preference relation over worlds.
Definition 1
A world \(w_i\) is compliant with an obligation \(O_j = \mathbf O (p \; \vert \; q)\) if \(M \models _{w_i} \lnot q \vee p\); in other words, if the obligation does not apply to \(w_i\) (\(\lnot q\)) or the obligation is satisfied (p). We denote this by \( compliant (w_i,O_j)\).
Definition 2
A preference for world \(w_i\) over world \(w_j\) is coherent with respect to \(O_k \in OS\), written \( coherent (w_i,w_j,O_k)\), iff \( compliant (w_i,O_k)\) and \(\lnot compliant (w_j,O_k)\).
Definition 3
A preference for \(w_i\) over \(w_j\) is incoherent with respect to \(O_k \in OS\), written \( incoherent (w_i,w_j,O_k)\), iff \( compliant (w_j,O_k)\) and \(\lnot compliant (w_i,O_k)\).
This concept of (in)coherence is used in considering whether or not a pair of worlds \((w_i,w_j)\) is part of the preference relation over worlds representing their relative “ideality”, or compliance with a normative specification. Informally, the pair \((w_i,w_j)\) is coherent with obligation \(O_k\) if and only if, taking into account only compliance with \(O_k\), \(w_i\) would be preferred to \(w_j\); i.e. if \(w_i\) satisfies the obligation but \(w_j\) does not. Note that incoherence does not simply mean that \(w_i\) is not preferred to \(w_j\), but that \(w_j\) is preferred to \(w_i\); i.e. that obligation \(O_j\) is incompatible with the preference \((w_i,w_j)\). Therefore, while \( incoherent (w_i,w_j,O_k)\) implies that \( coherent (w_i,w_j,O_k)\) does not hold, the fact that \( coherent (w_i,w_j,O_k)\) is false does not imply incoherence. A pair of worlds can be neither coherent nor incoherent with an obligation; e.g. if both the worlds comply with the obligation. We chose the term incoherence, rather than conflict, in order to avoid confusion with the concept of conflicts among norms.
Norms in the harbour protection scenario
Id  Norm 

\(O_1\)  \(\mathbf O (m_u \; \vert \; \top )\) 
\(O_2\)  \(\mathbf O (m_h \; \vert \; \lnot m_u)\) 
\(O_3\)  \(\mathbf O (i_u \ \vee \ i_b \ \vee \ i_h \; \vert \; \top )\) 
\(O_4\)  \(\mathbf O ( rep \; \vert \; \lnot (i_u \ \vee \ i_b \ \vee \ i_h))\) 
\(O_5\)  \(\mathbf O (\lnot r_u \; \vert \; \top )\) 
There will be causal constraints on possible worlds in any domain model. In the harbour protection scenario, for example, we have \(i_u \rightarrow r_u\) (if the UAV is intercepting the unauthorised boat, its location is revealed) and \(\lnot m_h \vee \lnot i_h\) (the helicopter cannot both monitor and intercept). Possible worlds are then all the joint assignments of values for the propositional variables that satisfy these constraints. Consider, for example, the following two possible worlds: \(w_3\) (where \( rep \) and \(m_u\) are true with all other propositions false) and \(w_{16}\) (where \(m_h\), \(r_u\) and \(i_u\) are true). World \(w_3\) violates obligation \(O_3\) because none of the agents is intercepting. World \(w_{16}\) violates \(O_1\) (the UAV is not monitoring) and \(O_5\) (the UAV’s location has been revealed). This means that obligation \(O_5\) is violated in world \(w_{16}\), but not in world \(w_3\): \( coherent (w_3,w_{16},O_5)\). Similarly, \( coherent (w_{16},w_3,O_3)\) holds.^{2}
3.2 A preference relation over possible worlds
In order to illustrate this concept, and how the introduction of severity preferences affects the resulting preference order over possible worlds, consider the situation depicted in Fig. 3. We consider two obligations: \(O_1 = \mathbf O (m_u \; \vert \; \top )\), the UAV should monitor the restricted area; and \(O_3^\prime = \mathbf O (i_u \; \vert \; \top )\), a simplification of \(O_3\) in Table 1 that requires the UAV to intercept an unauthorised boat. These are enforced over two possible worlds \(w_1\) and \(w_2\) such that \(M \models _{w_1} m_u \wedge \lnot i_u\) and \(M \models _{w_2} \lnot m_u \wedge i_u\). Clearly, \(w_1\) complies with \(O_1\) but violates \(O_3^\prime \), whereas \(w_2\) complies with \(O_3^\prime \) but violates \(O_1\). An arrow (solid or dotted) labelled with an obligation and directed from a world \(w_i\) to a world \(w_j\) represents the fact that the obligation is coherent with \(w_i\) being preferred to \(w_j\). Figure 3a represents the situation where no severity relation is specified, whereas Fig. 3b illustrates the result of introducing a severity relation \(O_3^\prime \succ _{o}O_1\), which reflects the requirement in our scenario that intercepting is more critical than monitoring. In the first case we have that \( coherent (w_1,w_2,O_1)\) and \( coherent (w_2,w_1,O_3^\prime )\) hold. Since the two obligations are incomparable, no preference between the two worlds can be inferred. In the second case, since violations of \(O_3^\prime \) are defined to be more severe than those of \(O_1\), and there is no other obligation coherent with \((w_1,w_2)\) being included in the preference relation, we have that \(w_2\) is preferred to \(w_1\). We can think of the arrow labelled with an obligation \(O_i\) as overriding the arrows labelled with any \(O_j\) such that \(O_i \succ _{o}O_j\). Equation 1 can be interpreted as saying that \(w_i\) is preferred to \(w_j\) if for each arrow from \(w_j\) to \(w_i\) there is one arrow from \(w_i\) to \(w_j\) that overrides it, and there is at least one such arrow from \(w_i\) to \(w_j\).
Our “ideality” preference relation, computed according to Eq. 1, must be guaranteed to be transitive and acyclic. Transitivity is an intuitive property for a preference relation, including those used in other preferencebased deontic logics. Acyclicity is required in order for us to be able to rank the possible worlds from the most to the least compliant, which we do in Sect. 3.3. Moreover, given transitivity, and given the fact that our preference relation is strict, the presence of a cycle would imply that each world in the cycle is less compliant than itself. These properties are, therefore, necessary for this relation to effectively guide practical reasoning.
Lemma 1
Given a set of possible worlds, W, a set of obligations, OS, and an acyclic severity specification that contains no infinite chain of preferences, \(P_{o}\), the preference relation over possible worlds computed according to Eq. 1 is transitive.
Proof
 1
\(O_3 \in (w_1 \cap w_2 \cap w_3)\) or \(O_3 \in (w_1 \cap w_3) \setminus w_2\). In these situations, \(O_3\) is also in \(w_1\), and therefore \(O_3\) is not incoherent with \((w_3,w_1)\).
 2\(O_3 \in ((w_2 \cap w_3) \setminus w_1)\). In this case, the obligation \(O_3\) is also violated in \(w_2\). Since \(w_2 \prec _{w} w_1 \) holds, and since \(O_3 \not \in w_1\), there must be an obligation \(O_{1,2} \in w_1 \setminus w_2\) such that \({O_{1,2}}\succ _{o}{O_3}\). We can distinguish between two subcases:
 2.1
\(O_{1,2} \in w_1 \setminus (w_2 \cup w_3)\). In this case, \(O_{1,2}\) is coherent with \((w_3,w_1)\), and is more severe than \(O_3\).
 2.2\(O_{1,2} \in ((w_1 \cap w_3) \setminus w_2)\). In this situation, \(O_{1,2}\) is not coherent with \((w_3,w_1)\) because it is also violated in \(w_3\). Since \(O_{1,2}\) is also in \(w_3\), but not in \(w_2\), it is incoherent with \((w_3,w_2)\). Therefore, there must exist an obligation \(O_j \in w_2 \setminus w_3\) that is more severe than \(O_{1,2}\). We can distinguish between two further subcases:
 2.2.1
\(O_{j} \in w_2 \setminus (w_3 \cup w_1)\). This situation is depicted in Fig. 4. Since \(O_j\) is incoherent with \((w_2,w_1)\) there must be an obligation \(O_k \in w_1 \setminus w_2\) that is more severe than \(O_j\). From the transitivity of \(P_{o}\) we have that \(O_k \succ _{o}O_3\). If \(O_k\) is not in \(w_3\), (that is, \(O_k \in w_1 \setminus (w_2 \cup w_3)\)), then \(O_k\) is coherent with the pair \((w_3,w_1)\). If \(O_k\) is also in \(w_3\), that is, \(O_k \in ((w_1 \cap w_3) \setminus w_2)\) we can apply again Case 2, but taking \(O_{1,2} = O_k\). Note that at each recursive application of Case 2, \(O_k\) must be different from any previous value of \(O_{1,2}\), otherwise \(P_{o}\) would be cyclic. Since \(P_{o}\) does not contain any infinite chain of preferences, it follows that this recursive reasoning must eventually terminate with an \(O_k \in w_1 \setminus (w_2 \cup w_3)\) or with case 2.2.2 detailed below.
 2.2.2
\(O_{j} \in (w_2 \cap w_1) \setminus w_3\). In this case, \(O_j\) is also in \(w_1\), and therefore is coherent with \((w_3,w_1)\). Moreover, for the transitivity of \(P_{o}\), we have that \(O_j \succ _{o}O_3\).
 2.2.1
 2.1
 3\(O_3 \in (w_3 \setminus (w_1 \cup w_2))\). Given \(w_3 \prec _{w} w_2 \), and since \(O_3\) is incoherent with \((w_3,w_2)\) there must be an obligation \(O_{2,3} \in (w_2 \setminus w_3)\) such that \(O_{2,3} \succ _{o}O_{3}\). We distinguish two subcases:
 3.1\(O_{2,3} \in (w_2 \setminus (w_1 \cup w_3))\). Given \(w_2 \prec _{w} w_1 \), and since \(O_{2,3}\) is incoherent with \((w_2,w_1)\), there must be an obligation \(O_{1,2} \in (w_1 \setminus w_2)\) such that \(O_{1,2} \succ _{o}O_{2,3}\). We distinguish between two further subcases:
 3.1.1
\(O_{1,2} \in (w_1 \setminus (w_2 \cup w_3))\). For the transitivity of \(P_{o}\), we have that \(O_{1,2} \succ _{o}O_{3}\). Moreover, \(O_{1,2}\) is coherent with \((w_3,w_1)\).
 3.1.2
\(O_{1,2} \in ((w_1 \cap w_3) \setminus w_2)\). This situation is depicted in Fig. 5. In this case, obligation \(O_{1,2}\) is neither coherent, nor incoherent with \((w_3,w_1)\). By reasoning in a similar way to Case 2.2.1, it is easy to see that there must be an obligation \(O_k \in (w_1 \setminus w_3)\) that is more severe than \(O_3\).
 3.1.1
 3.2
\(O_{2,3} \in (w_2 \cap w_1) \setminus w_3\). Since \(O_{2,3}\) is also in \(w_1\), it is coherent with \((w_3,w_1)\).
 3.1
This casebycase analysis proves that, for each obligation \(O_3\) that is incoherent with \((w_3,w_1)\), there exists at least one obligation \(O_1\) that is coherent with \((w_3,w_1)\), and that is more severe than \(O_3\).
 4
\(O_{2,3} \in ((w_1 \cap w_2) \setminus w_3)\). Since \(O_{1,2} \in w_1\), it is also coherent with \((w_3,w_1)\).
 5\(O_{2,3} \in (w_2 \setminus (w_1 \cup w_3))\). Since \(w_2 \prec _{w} w_1 \), and \( incoherent (w_2,w_1,O_{2,3})\), there must be an obligation \(O_{1,2} \in (w_1 \setminus w_2)\) that is more severe than \(O_{2,3}\). There are two subcases:
 5.1
\(O_{1,2} \in (w_1 \setminus (w_2 \cup w_3))\). If so, \(O_{1,2}\) is coherent with \((w_3,w_1)\).
 5.2
\(O_{1,2} \in ((w_1 \cap w_3) \setminus w_2\)). This situation is depicted in Fig. 6. Since \(O_{1,2}\) is also in \(w_3\), and \(w_3 \prec _{w} w_2 \), there must be an \(O_j \in w_2 \setminus w_3\) that is more severe than \(O_{1,2}\). This, in turn, must be different from \(O_{2,3}\), otherwise there would be a cycle in \(P_{o}\). If \(O_j\) is in \(((w_2 \cap w_1) \setminus w_3)\), then \(O_j\) is also coherent with \((w_3,w_1)\). If, on the other hand, \(O_j \in (w_2 \setminus (w_1 \cup w_3))\), then, by reasoning in a similar way to Case 2.2.1 we can see that there must be an \(O_k\) in \((w_1 \setminus (w_2 \cup w_3))\) that is coherent with \((w_3,w_1)\). \(\square \)
 5.1
Lemma 2
Given a set of possible worlds, W, a set of obligations, OS, and an acyclic severity specification that contains no infinite chain of preferences, \(P_{o}\), the preference relation among possible worlds, \(P_{W}\), computed according to Eq. 1 does not contain any finite cycle.
Proof
Assume that there is a cycle in \(P_{W}\). From the transitivity of \(P_{W}\), we have that, for all possible worlds \(w_i\) in the cycle, \(w_i \prec _{w} w_i \) holds. Consider now a possible world \(w_i\) in the cycle. Since there is no obligation that is coherent with \((w_i,w_i)\), and from Eq. 1 it follows that \(w_i \prec _{w} w_i \) does not hold. Therefore, there must be no cycle in \(P_{W}\). \(\Box \)
3.3 Computing a ranking over possible worlds
We can now use \(P_{W}\) to rank worlds from the most to the least compliant.
Definition 4
Since there are no cycles in \(P_{W}\), such a ranking can always be computed, but the question remains: how to do this efficiently?
Suppose there is a function \( VI : W \times OS\rightarrow 2^{OS}\) that, given a set of obligations, associates with each possible world the set of obligations that are violated in that world. If the satisfaction of a formula in a possible world can be computed in constant time, then the complexity of a naïve algorithm for this function will be \(O(\vert OS\vert \cdot \vert W \vert )\).
Given two worlds \(w_1\) and \(w_2\), Algorithm 1 verifies whether \(w_1 \prec _{w} w_2 \). The algorithm first computes the set of violated obligations \(V_2\) that are coherent with \((w_1,w_2)\) and the set \(V_1\) of those that are incoherent with \((w_1,w_2)\). If \(V_2\) is empty, then we can conclude that \(w_1 \prec _{w} w_2 \) does not hold. The algorithm then proceeds to use a Depth First Search from multiple starting points (all the violated obligations in \(V_2\)) to compute the set of all obligations that are reachable from at least one violated obligation in \(V_2\); that is, all those violated obligations that are less severe than at least one in \(V_2\). Finally, we just need to verify whether \(V_1\) is included in the set of reachable states. Since we run a single depth first search, we visit every violated obligation, and every member of \(P_{o}\) at most once: the algorithm runs in time \(O(\vert OS\vert + \vert P_{o}\vert )\).
To compute the preference relation, we need to compare each ordered pair of possible worlds, or each pair of subsets of \(OS\), depending on which one is smaller. The resulting algorithm has complexity \(O( \min ( \vert W \vert ^2 , 2^{2 \vert OS\vert }) \cdot (\vert OS\vert + \vert P_{o}\vert ))\). Some properties of the preference relation can be used in order to decrease the number of comparison that are needed. In particular, from Eq. 1, it is straightforward that, if a world \(w_i\) violates a set of obligations \(VI(w_i,OS)\), then every world \(w_k\) such that \(VI(w_k,OS) \subset VI(w_i,OS)\) is preferable to \(w_i\). Since \(P_{W}\) is transitive, given two worlds \(w_i\) and \(w_j\), once we have established that \(w_i \prec _{w} w_j \) holds, we can infer that, for all worlds \(w_k\) such that \(VI(w_k,OS) \subseteq VI(w_i,OS)\), \(w_k \prec _{w} w_j \).
This ranking of possible worlds, computed on the basis of a normative system specification that captures both contrarytoduty obligations and varying severity of violation, can be used to guide agents within a team to make effective decisions about what to do. The challenge now is that these decisions need to take into account strategies of action rather than simply considering the compliance of agents with isolated states of affairs. Agents need to take into account future possible compliance with a set of norms in making (collective) action decisions now. Decision mechanisms need to take into account uncertainties in terms of action outcomes and exogenous influences on the state of the environment, and enable agents to coordinate their behaviour with others with influence over the environmental state. In order to model decisions in this context, we propose a novel, decentralised planning mechanism that is driven by qualitative rewards reflecting this normbased ranking of possible worlds.
Ranking of possible worlds in the harbour protection scenario
R  Id  World  Violations  

1  \(w_{9}\)  \(\lnot i_h\)  \(\lnot rep \)  \( i_b\)  \(\lnot m_h\)  \(\lnot r_u\)  \(m_u\)  \(\lnot i_u\)  
...  ...  
3  \(w_{16}\)  \(\lnot i_h\)  \(\lnot rep \)  \(\lnot i_b\)  \( m_h\)  \(r_u\)  \(\lnot m_u\)  \(i_u\)  \(O_1\), \(O_5\) 
4  \(w_{15}\)  \( i_h\)  \(\lnot rep \)  \(\lnot i_b\)  \(\lnot m_h\)  \(\lnot r_u\)  \(\lnot m_u\)  \(\lnot i_u\)  \(O_1\), \(O_2\) 
...  ...  
6  \(w_{3}\)  \(\lnot i_h\)  \( rep \)  \(\lnot i_b\)  \(\lnot m_h\)  \(\lnot r_u\)  \(m_u\)  \(\lnot i_u\)  \(O_3\) 
7  \(w_1\)  \(\lnot i_h\)  \( rep \)  \(\lnot i_b\)  \(m_h\)  \(\lnot r_u\)  \(\lnot m_u\)  \(\lnot i_u\)  \(O_1\), \(O_3\) 
7  \(w_2\)  \(\lnot i_h\)  \( rep \)  \(\lnot i_b\)  \(\lnot m_h\)  \(r_u\)  \(m_u\)  \(\lnot i_u\)  \(O_3\) ,\(O_5\) 
...  ...  
\(\Lambda = 15\)  \(w_{22}\)  \(\lnot i_h\)  \(\lnot rep \)  \(\lnot i_b\)  \(\lnot m_h\)  \( r_u\)  \(\lnot m_u\)  \(\lnot i_u\)  \(O_1\), \(O_2\), \(O_3\), \(O_4\), \(O_5\) 
Given that we can reliably compute a ranking over possible worlds that takes into account normative constraints, we now turn to the problem of normgoverned planning for a team of agents.
4 Normgoverned multiagent planning
Decentralized Partially Observable Markov Decision Processes (DecPOMDPs) are an effective means to model collective, distributed decision making where multiple agents, each of them with a particular view of the environment, must coordinate their actions in a decentralized fashion in order to optimize some jointreward [2]. Existing DecPOMDP formalisations are founded on a realvalued reward function that specifies the value an agent obtains from performing some action in some state of affairs. Our problem is different, however. We require a model of decentralised planning in which agents are rewarded for remaining as compliant with social norms as possible. We have shown that norms are most naturally organised as levels of compliance. We want agents to operate in such a way that they maximise their compliance, and in order to motivate agent behaviour in this way we require a model of rewards that reflects these qualitative levels of compliance. To achieve this aim, here we propose a novel model of DecPOMDPs with qualitative rewards, which we dub NDecPOMDPs to reflect our aim of developing a model of severitysensitive and normgoverned multiagent planning.
4.1 NDecPOMDPs
An NDecPOMDP is defined as a tuple, \(\langle I, S, b ^0 , \{ A_{i} \} , P_{s}, \{ E_{i} \} , P_{\mathbf {e}}, R \rangle \) where: \(I\) is a set of agents, and \(S\) is the set of states; \(b ^0\) is an initial belief state, i.e. a probability distribution over possible initial states; \(A_{i}\) is a finite set of actions available to agent i and \(\mathbf {a}_{} = \langle a_{1} ,\ldots , a_{n} \rangle \) is a jointaction (one for each agent); \(P_{s} (s_{j} \vert s_{i} , \mathbf {a}_{} {} )\) represents the probability that taking jointaction \(\mathbf {a}_{}\) in state \(s_{i}\) will result in a transition to state \(s_{j}\); \(E_{i}\) is a finite set of observations available to agent i and \(\mathbf {E}_{} \) is the set of joint observations \(\mathbf {e}_{}\) consisting of one local observation for each agent; \(P_{\mathbf {e}} (\mathbf {e}_{} \; \vert \; s_{j}, \mathbf {a}_{} ) \) specifies the probability of observing \(\mathbf {e}_{} \) when performing a jointaction \(\mathbf {a}_{} \) that leads to a state \(s_{j}\); \(R \) is a reward function, the definition of which we provide below.
We focus on finitehorizon NDecPOMDPs, and so assume that the execution terminates after \(H\) steps. An actionstate history as a sequence of joint actions, each followed by a state \((\mathbf {a}_{} ^1,s^1, \ldots , \mathbf {a}_{} ^t,s^t)\), and an actionobservation history is a sequence of local actions each of them followed by a local observation \((a^1,e^1,\) \( \ldots ,\) \( a^t,e^t)\) up to an instant of time t. Agents decide how to act only according to their local observations, and so a solution for a NDecPOMDP is a jointpolicy \(\mathbf {q} \in \mathbf {Q}\), consisting of a local policy \(q_i \in Q_i\) for each agent i; i.e. \(\mathbf {Q} = Q_1 \times \cdots \times Q_n\) . Each local policy maps actionobservation histories to stochastic subpolicy choices.
To get an intuition of the strategies that are developed for agents during this planning process, consider again the harbour protection scenario introduced in Sect. 2. A good policy for the UAV may be to continue surveillance, even if it has observed an unauthorised boat in the restricted area, but this depends on the context. If it is operating in a team with some other agent (e.g. a helicopter), it may keep monitoring with a high probability if it observes only one boat in the area (assuming the other agent will intercept it), and with a low probability it will intercept the boat. In situations where the UAV intercepts more than one boat in the area, the UAV may decide to intercept (or report) one of the boats with a higher probability. These policies are, of course, stochastic, and observations (e.g. detection of an incursion) lead to a choice between different subpolicies for each agent in the team. The objective we have in this planning process is to find, given the initial belief state \(b^0\), a jointpolicy for the agents in the team that maximizes the total expected value of the jointreward over the horizon \(H\). Given a tstepstogo jointpolicy \(\mathbf {q}^{t}\), \(q_i^{t}\) is the local policy for agent i, \(\mathbf {a}_{\mathbf {q}^{t}} \) is the jointaction prescribed by the policy, and \(\pi ^t_i : E_{} \times Q_i^{t1} \rightarrow [0,1]\) is the stochastic mappings that return, for each agent i and observation \(e_{}\), the probability of selecting local subpolicy \(q_i^{t1} \in Q_i^{t1}\) after observing \(e_{}\).
 1.
Histories that include states with lower compliance levels are less preferred; and
 2.
If two histories are incomparable with respect to violations for all compliance levels lower than level i, the history with fewer violations at level i is preferred.
4.2 Policy optimisation in NDecPOMDPs
Given that finding a \(\gamma \)approximation of an optimal policy for a DecPOMDP is NEXPcomplete, finding such a policy for an NDecPOMDP will be at least as hard given that the introduction of qualitative levels of reward, representing norm compliance, does not simplify the underlying decision problem. Due to this complexity, our approach is to develop an algorithm that can efficiently find solutions without providing any guarantees on the solution quality with respect to the optimum. One of the most successful existing algorithms in solving large instances of DecPOMDPs is PointBased Policy Generation (PBPG) [32]. As discussed in Sect. 6.1, PBPG starts from the last time step and moves backwards using the tstepstogo policies as possible subpolicies for the \((t+1)\)stepstogo policies. At each step, a heuristic is used to select a set of reachable belief states. A set of candidate policies is then generated and evaluated from those belief states and only the best \( maxTrees \) policies are retained. In the policy generation phase, one candidate for each possible jointaction is created, and a linear program is used to find suboptimal stochastic mappings for the given belief state and jointaction. The mappings for each agent are iteratively improved while the other agents’ policies are fixed. We adapt PBPG in order to approximately solve NDecPOMDPs, and, in Sect. 4.3, propose a novel heuristic for qualitative reward domains to restrict the selected belief states.
The reward function of an NDecPOMDP has its codomain in \(\Psi \), and so we need to define a procedure for policy optimisation that accepts reward values and returns expected values in \(\Psi \). Note that \(P_{s}\), \(P_{\mathbf {e}}\), and \(\pi _{i}^{t}\) are defined as functions with real codomain. To do this we can use a combination of Eqs. 7 and 10 to evaluate joint policies.
Note that improving the jth component might result in situations where, for some lth component, with \(l > j\), only a negative improvement is possible. These improvements represent a tradeoff where we accept a decrease in one component of \(V^{t}(\mathbf {\pi }, b)\) in order to improve one that is associated with a higher ranking level (lower compliance). We say that an improvement sequence \(\delta _0 , \ldots , \delta _{\Lambda }\) is acceptable if and only if, for each negative improvement \(\delta _j < 0\), there exists a \(\delta _k > 0\) such that \(k < j\). Informally, an improvement is not acceptable if it leads to a policy that has lower expected value than the initial one according to the ordering among extended reals.
Algorithm 3 formalises the greedy LP optimization. The algorithms takes as input the belief against which we are evaluating our policy \(b^t\), the candidate joint action \(\mathbf {a}_{} \), and the set of possible subpolicies \(Q^t_i\) for each agent \(1 \le i \le n\), and it returns, for each agent \(1 \le i \le n\), a function \(\pi _i\) that maps its local observations to a probability distribution over \(Q^t_i\). The algorithm starts by initializing the mappings randomly and evaluating these mappings over the belief \(b^t\) using Eq. 4. It then considers each agent in turn in order to improve their local policy (Lines 5–16). For each agent, the algorithm applies the LP of Fig. 9 to improve each component of the expected value function \(V^{\pi }\) (7) starting from the one associated with the highest ranked (least normcompliant) level. After each call to the LP, the value \(V^{\pi }\) is updated accordingly, and, if the value for the level being considered is greater than 0, the improvement for the current agent is terminated (Line 13). This procedure is repeated until we complete an iteration without any change in the value function (Lines 3–17).
4.3 The mostcriticalstates heuristic
In PBPG, heuristics are used to identify relevant belief states against which to optimize the policy. The intuition behind PBPG is that, if the agents act in a way that is close to the optimum, only a subset of states will be reachable. In building policies in a bottomup fashion, therefore, we can optimize them only against those states that are most likely to be encountered during execution, increasing the scalability of the algorithm.
With our method of mapping a normative system specification into a ranking over possible worlds, the magnitude and greedy approaches to solving an NDecPOMDP, and the Most Critical States (MCS) heuristic, we can move on to evaluate our model. Existing benchmark problems for (Dec)POMDPs do not include normative, or soft constraints, and so we use the multiagent harbour protection scenario that involves both CTD structures and varying severity. We can, however, directly compare the magnitude and greedy approaches to solving an NDecPOMDP, and compare MCS with standard PBPG.
5 Evaluation
In the harbour protection scenario introduced in Sect. 2, there are restricted and unrestricted areas, and both agents (UAV, helicopter and patrol boat) and unauthorized boats can move between them. The UAV and the helicopter can perform the action \( monitor \) in order to start monitoring their current area; this action always suceeds, but monitoring does not guarantee detection of an unauthorised boat. Each agent is able to observe the location of unathorized boats in the same area with probability 0.15. By monitoring an area, an agent increases this probability to 0.75. Each agent can perform \( intercept _i\) in order to intercept the ith unauthorized boat, and an action \( report \) to report an incursion. Each of these actions will succeed with a probability of 0.8, and the agent must commit to this task over two time steps to have an effect. Each agent is able to observe its own location and, with a certain probability, the location of agents in the same area. By monitoring an area, an agent increases its probability of correctly observing other agents’ locations. The behaviour of unauthorized boats is controlled by the simulation. Throughout the simulation, each boat will move from the unrestricted area to the restricted area with probability 0.11 or return to the unrestricted area with probability 0.3. Initial exploration of possible values for these probabilities indicated that these gave a good level of dynamism and indeterminism in the simulation, and hence represent a good level of challenge for the planning problem. We chose a horizon, \(H=20\), for all simulations; preliminary experiments showed that a horizon greater than 20 offered no additional benefit to the quality of the plans computed for any of the algorithm/heuristic combinations. For the same reason, for each simulation \( maxTrees =2\) (the number of policies retained at each step during plan generation). Each experimental condition was repeated 20 times with identical initial conditions: all unauthorized boats in the unrestricted zone, and all agents in the restricted zone with no agent monitoring.
Planning results for standard PBPG heuristics and MCS with the magnitude and greedy linear program planners
Ag  B  Standard  MCS  

Value  Time  Value  Time  
Magnitude LP  
2  3  \(5.88\)e\(\)20  32.05  \(5.88\)e\(\)20  27.10 
\(\pm 3.76\)e\(\)23  \(\pm 2.89\)  \(\pm 3.59\)e\(\)23  \(\pm 2.10\)  
3  2  \(3.68\)e\(\)20  185.25  \(2.69\)e\(\)20  146.75 
\(\pm 3.70\)e\(\)23  \(\pm 15.69\)  \(\pm 8.27\)e\(\)23  \(\pm 14.61\)  
3  3  \(4.61\)e\(\)20  7609.10  \(4.76\)e\(\)20  5375.60 
\(\pm 3.58\)e\(\)23  \(\pm 484.27\)  \(\pm 2.64\)e\(\)22  \(\pm 578.30\)  
Greedy LP  
2  3  \(5.88\)e\(\)20  31.15  \(5.88\)e\(\)20  23.65 
\(\pm 3.02\)e\(\)23  \(\pm 2.96\)  \(\pm 3.04\)e\(\)23  \(\pm 2.78\)  
3  2  \(3.28\)e\(\)20  112.20  \(2.61\)e\(\)20  79.85 
\(\pm 2.86\)e\(\)21  \(\pm 7.38\)  \(\pm 4.56\)e\(\)22  \(\pm 7.66\)  
3  3  \(4.58\)e\(\)20  4984.80  \(4.70\)e\(\)20  3236.25 
\(\pm 3.04\)e\(\)22  \(\pm 483.49\)  \(\pm 5.94\)e\(\)22  \(\pm 392.69\) 
Pairwise differences in execution time: pvalues
Comparison of LPs  Comparison of Heuristics  

Scenario  Heuristics  Scenario  LPs  
Ag  B  Standard  MCS  Ag  B  Magnitude  Greedy 
2  3  1.000  0.133  2  3  0.000  0.000 
3  2  0.000  0.000  3  2  0.048  0.034 
3  3  0.000  0.000  3  3  0.003  0.003 
We further explored the differences in performance of the LPs in order to isolate the effect of choosing the greedy LP over the magnitude LP. There is no significant difference in the execution time of the magnitude and greedy LPs with both Standard and MCS heuristics in the 2agents, 3boats case (Fig. 10, pvalues being 1.000 and 0.133, respectively). We believe that this is due to the fact that this is an overconstrained problem, such that we cannot easily discount those strategies that are more likely to lead to the most severe violations. The strategy of solving multiple, smaller LPs at different ranking levels and terminating when no significant improvement can be found, therefore, has little effect. There are, however, significant differences in all other comparisons between the LPs, such that the greedy LP significantly outperforms the magnitude LP.
We then explored the differences in performance of the heuristics in order to isolate the effect of choosing the MCS heuristic over the Standard heuristic. The positive effect of using the MCS heuristic over Standard is significant in all cases for either LP. It is interesting to note that, although significant, the benefit in using the MCS heuristic is more marginal in the 3 agents/2boats case. This is the least constrained of the problems considered, given that even if both unauthorised boats enter the restricted area at the same time, there are sufficient agents to intercept both and maintain surveillance. This is expected because the MCS heuristic was specifically designed to provide additional guidance to decisionmaking in more challenging, overconstrained scenarios.
Comparison of MCS+Greedy and random play
Ag  B  MCS + Greedy  Random 

2  3  \(5.88\)e\(\)20  \(9.10\)e\(\)1 
\(\pm 3.04\)e\(\)23  \(\pm 1.03\)e\(\)1  
3  2  \(2.61\)e\(\)20  \(2.40\)e\(\)1 
\(\pm 4.56\)e\(\)22  \(\pm 2.69\)e\(\)02  
3  3  \(4.70\)e\(\)20  \(7.59\)e\(\)1 
\(\pm 5.94\)e\(\)22  \(\pm 7.17\)e\(\)02 
6 Discussion
In placing this research in context, we first discuss the current research landscape on single and multiagent planning under normative (or equivalent) constraints. We then move on to explore in more detail the model of norms used in this research, alternative modelling approaches and discuss some avenues for future investigation. Our conclusions follow this extended discussion.
6.1 Related research in normgoverned planning
Models of practical reasoning where action is both constrained by causal dependencies and guided by ideals of behaviour (soft constraints or preferences) have been studied in a range of contexts. Gerevini and Long [17] extended the Planning Domain Description Language, PDDL, to include preferences. These are represented as boolean formulae that are satisfied or violated by a plan. Even in the presence of preferences, however, PDDL requires the domain designer to specify a quantitative metric function to be optimised by a planner. These metrics may, or may not, depend on the satisfaction of each constraint. For example, it is possible to assign a realvalued weight to each constraint violation. Our approach is different because we specify qualitative preferences among constraint violations. Bienvenu et al. [5] do consider qualitative priorities over constraints. Their formalism allows alternative plans to be evaluated according to a lexicographic ordering over the set of constraints. Using this approach, we can specify that avoiding the violation of a single norm is more important than avoiding the violation of any other norms that follow in the lexicographic order. Our approach is different, however: we do not simply consider norms in a lexicographic order, rather we compute severity levels that depend on all norms that are violated in a state, and then consider severity levels in a lexicographic order. Moreover, Bienvenu et al. [5] do not take into account how many times each constraint is violated. They do model the specification of temporally extended preferences; that is, preferences that regulate execution paths, instead of single states. While we do not discuss this issue in this paper, our model can be directly extended in order to represent and reason about temporally extended norms, as discussed in prior research [16]. These models only capture deterministic, single agent scenarios, and do not support the specification of contrarytoduty constraints.
In normaware practical reasoning, a number of different methods for reasoning about norm compliance have been proposed. Meneguzzi and Luck [20], for example, extend the BDI architecture to take into consideration norms. They describe algorithms that enable agents to react to the activation and expiration of norms by modifying their intentions; i.e. by introducing plans for the fulfilment of obligations and removing plans that violate prohibitions. Dignum et al. [12] discuss the introduction of a preference relation over norms to solve normative conflicts. This preference relation is taken into account only in situations where it is not possible to comply with all norms. Our decision theoretic approach is different in the sense that an agent might decide to violate a less severe norm even in the absence of a conflict if, in doing so, the probability of violating a more severe norm in the future decreases. Moreover, Dignum et al. only consider single agent scenarios and with a simplified (statebased) representation of norms. While these approaches support the specification and reasoning about contrarytoduty obligations, they only consider single agent scenarios where the environment is fully observable and actions are deterministic.
Fagundes et al. [13] use Markov Decision Processes (MDPs) [23] to model a selfinterested agent that takes into account norms, and the possibility of violating them, in deciding how to act. Violations are associated with sanctions, which result in the modification of the transition probabilities, or of the agent’s capabilities. Agents consider the effects of sanctions on their expected utility and weigh these against the potential benefits of violating norms in order to decide upon a course of action. This model does not, however, explicitly capture the relative severity of norm violations, and the representation of sanctions relies on the assumption that the norm enforcement authority has the power to affect the agent’s capabilities and the probabilities of transitions.
A more appropriate representation of severity levels of norm violation could be obtained by representing the problem as a multiobjective MDP; that is, an MDP where the reward is a vector, rather than scalar quantity, and where each component of the vector may represent a different objective. A number of researchers have focussed on methods to efficiently solve multiobjective POMDPs (Partially Observable MDPs) [24, 26, 30]. Some of these methods attempt to find a set of policies that maximise the expected value for a set of possible scalarizations. A scalarization essentially gives a weight to each component of the reward value and is formally defined as a linear function that takes a vector and returns a scalar. Roijers et al. [24], for example, present OLSAR, a pointbased algorithm based on Perseus [27] that efficiently finds a set of approximately optimal policies for different scalarizations. Reasoning about norm compliance may be seen as a particular case of a multiobjective POMDP, where each component represents the degree of compliance. From this point of view, different scalarizations may be used to represent different degrees of severity for norm violation. Such an approach, however, does not avoid the sort of fallacies in reasoning illustrated in the introduction in terms of “fair labelling”, or with different classification levels in a security setting. An alternative would be to take the approach proposed by Soh and Demiris [26], where genetic algorithms are used to find the set of Pareto optimal solutions for a multiobjective POMDP. A solution is Paretooptimal if it is not possible to improve any component of the expected reward vector without decreasing the value of another component. This approach assumes that the different components of the reward are of incomparable importance and requires the user to decide which of the Pareto optimal solutions to adopt. We propose a method that exploits the knowledge of the relative importance of each component to improve the efficiency of the planning algorithm. A similar approach is taken by Wray and Zilberstein [30], where they propose an algorithm to solve multiobjective POMDPs where the reward components are ordered according to their degree of importance. This work only deals with the single agent case, however, without considering the issue of coordination among agents.
Problems in which a coalition of agents collaborate in order to maximise a joint reward are often modelled as Decentralized Partially Observable MDPs (DecPOMDPs) [2]. In a DecPOMDP, each agent has a local and partial view of the environment, and must take a decision on what action to perform based only on its local observations. Finding a \(\gamma \)approximation of an optimal policy for a DecPOMDP^{4} has been proven to be intractable [4]. For this reason, a substantial amount of research has focused on algorithms that can efficiently find suboptimal solutions without providing guarantees on the solution quality. The vast majority of this focuses on quantitative models, where the joint reward is realvalued.
Wu et al. [32], for example, propose Point Based Policy Generation (PBPG), an algorithm for solving finite horizon DecPOMDPs with realvalued rewards. The algorithm relies on a set of heuristics to find belief states (probability distributions over possible states) that are likely to be reachable after a given number of steps. Given an execution horizon H, the algorithm starts by finding the best onestep policies (a onestep policy consists of a single action) and evaluating them from the beliefs that are reachable at time \(H1\). It then uses these policies as subpolicies to build a set of candidate twostep policies, which are evaluated from the beliefs reachable at time \(H2\). The algorithm proceeds in this way until it builds the set of candidate policies for time 0. At each step, only the best \( MaxTrees \) policies are retained and used as possible subpolicies, resulting in bounded memory complexity and time complexity linear in the execution horizon. Because of this pruning, however, the algorithm does not provide guarantees on the quality of the solution with respect to the optimum.
To the best of our knowledge, the problem of qualitative decision making in decentralized, stochastic scenarios has been previously addressed only by Brafman et al. [9]. Their work is different in spirit, however. The authors build upon a simplified DecPOMDP, where only qualitative statements about the possible transitions and observations are available, and a set of goal states is defined in place of a reward. They show that this problem can be solved using classical planning techniques. Their formalism does not permit the specification of different degrees of preferences among goals, however. Norms are often seen as constraints over the behaviour of (groups of) agents. From this point of view, our work is related to research on constrained DecPOMDPs by Wu et al. [31]. Wu et al. consider a DecPOMDP with a single reward function, but multiple cost functions. The objective is then to maximise the reward function, subject to constraints over cumulative costs. Rather than trying to minimise the number of constraint violations, their algorithm excludes all solutions that violate one or more constraints. Our aim is to find solutions that minimise the qualitative level of violation severity that occurs, and minimise the number of violations at each level.
Before moving on to present the two key contributions of this research (in Sects. 3 and 4), we outline a scenario that both illustrates the normative concepts that are core to the model and gives an intuition of the practical reasoning problem we address.
6.2 Discussion of alternatives and future research
Our starting point in this research is the extensive body of research in normative systems specification and reasoning. We model classical contrarytoduty structures, which capture the important notion of reparation. We also argue for the related, but complementary notion of violation severity levels. There are, of course, assumptions we make in this research, and, as discussed in Sect. 6.1, practical reasoning under normative constraints is closely related to preferencebased planning. In this discussion, therefore, we briefly explore alternative approaches to model violation severity, and alternative reward functions for NDecPOMDPs. We discuss some of the limitations of the mechanisms proposed and indicate some avenues for future research.
 1.
It ought to be that the UAV is monitoring: \(O_1\) = \(\mathbf O (m_u \; \vert \; \top )\).
 2.
If the UAV is monitoring, it ought to be that the helicopter is not monitoring: \(O_2\) = \(\mathbf O (\lnot m_h \; \vert \; m_u)\).
 3.
If the UAV is not monitoring, it ought to be that the helicopter is: \(O_3\) = \(\mathbf O (m_h \; \vert \; \lnot m_u)\).
The most preferred world is one in which the UAV is monitoring, but the helicopter is not. The two worlds where only \(O_2\) is violated (\(\lnot m_u \wedge m_h\)) or only \(O_1\) is violated (\(m_u \wedge m_h\)) are incomparable, and worlds in which neither agent is monitoring the area (\(O_1\) and \(O_3\) are violated) is least compliant.
 1.
\(P_1 : m_u \prec \lnot m_u\)
 2.
\(P_2 : \lnot m_h \prec m_h\) if \(m_u\)
 3.
\(P_3 : m_h \prec \lnot m_h\) if \(\lnot m_u\)
In subsequent research, Brafman et al. [8] extend CPNets by introducing preference relationships among variables. If, for example, variable \(X_1\) is more important than \(X_2\), we should always prefer an improvement in \(X_1\) to one of \(X_2\). This addresses some of the limitations of CPNets, but the two variables concerned must be mutually preferentially independent. In other words, the preference over the values of one variable must not depend on the value of the other variable. Since, in our example, the preference over \(m_h\) depends on the value of \(m_u\), importance relationships are not sufficient. Thus, CPNets cannot be used to express contrarytoduty obligations.

\(\mathbf O (L_1 \; \vert \; true)\)

\(\mathbf O (L_2 \; \vert \; \lnot L_1)\)

...

\(\mathbf O (L_n \; \vert \; \wedge _{i=1}^{n1} \lnot L_i)\)
Our severity specification is defined as a strict partial order over single obligations; i.e. \(P_{o}\subseteq OS \times OS\). This allows us to specify that the violation of one obligation is more severe than any number of violations of another. Moreover, if we define \(O_1 \succ _{o}O_2\) and \(O_1 \succ _{o}O_3\), worlds violating both \(O_2\) and \(O_3\) will be preferred to worlds that violate \(O_1\). It would be interesting to consider an alternative relation: \(P_{o}\subseteq 2^{OS} \times 2^{OS}\). We could then express relationships such as \(\{ O_1\} \succ _{o}\{O_2\}\), \(\{O_1\} \succ _{o}\{O_2\}\) and \(\{ O_2, O_3\} \succ _{o}\{ O_1\}\). An interesting direction for future research would to be study how to compute a meaningful, acyclic preference relation over possible worlds, given this richer severity specification. Assuming such an ordering can be reliably computed, this alternative domain analysis could be directly used as input to our NDecPOMDP solver.
The reward function of an NDecPOMDP favours histories where states associated with an higher ranking level are visited less often. This approach may lead to unexpected results in some situations that involve independent^{5} norms of incomparable severity. Consider two histories, \(h_1\) and \(h_2\). History \(h_1\) consists of a sequence of states, \((s_{1.1},s_{1.2})\), such that in state \(s_{1.1}\) there are no violations, but in state \(s_{1.2}\) both obligations \(O_1\) and \(O_2\) are violated. History \(h_2\) consists of a sequence of states, \((s_{2.1},s_{2.2})\), such that in state \(s_{2.1}\), \(O_1\) is violated, in \(s_{2.2}\), \(O_2\) is violated. Assuming that violations of \(O_1\) and \(O_2\) are incomparable with respect to their severity, we might expect these two histories to be equally good (or bad). In the model proposed here, state \(s_{1.2}\) in history \(h_1\) would lie at a higher ranking level than either states \(s_{2.1}\) or \(s_{2.2}\) in \(h_2\), and hence \(h_2\) will be preferred to \(h_1\). The reason for this is that our objective is not to minimise the sanctions received as a result of norm violation, but to minimise the possible consequences of these violations. The goal of the norm analysis phase is to ensure that more severe consequences are associated with higher ranked states. Of course, this does not guarantee that any increase in ranking is associated with more severe consequences.
An alternative reward function for an NDecPOMDP may be defined that would result in histories \(h_1\) and \(h_2\) being assessed as equally good. We could, for example, rank all the obligations according to their severity using an adaptation of Algorithm 2, applied to the set of norms rather than the set of possible worlds. We may then give rewards to states that equate to, for each ranking level l, \(n.\varepsilon ^{\Lambda l}\) where n is the number of violations at level l. Histories \(h_1\) and \(h_2\) would then have the same reward. It is not clear, however, how we could capture the fact that violating contrarytoduty norms should be considered less desirable than violating the corresponding primary norms, which is an important aspect of our model.
In our model, norm compliance is necessarily evaluated on single states. On the face of it, this restricts the types of norm that can be represented. In many domains, for example, obligations may include a deadline for fulfilment: temporallyextended norms. Norms may also link individual actions, such as in separation of duty constraints where two actions must be performed by two different agents. In order to evaluate compliance with such norms, we must take into account subhistories rather than individual states. It is possible, however, to directly extend our model to consider such norms by keeping track of the evolution of norm instances (activation, expiration, etc.) in each state, as discussed in previous research [16]. The cost is an (potentially significant) increase in the number of states, placing additional burden on the planner.
In this paper, we have focussed exclusively on normative motives. These are social drivers of action, but autonomous agents may also be driven by individual goals. Individual goals may be encoded as obligations, but this would be to combine compliance to social expectations and individual drives. Severity could be used to capture the relative importance of individual goals and norms if goals are expressed as obligations. It may, however, be more appropriate to make explicit the distinction between social norms and individual goals. We could then employ multipleobjective optimisation methods to manage the tradeoff between remaining compliant with social expectations and satisfying individual goals. This is an avenue for future research, given that a suitable approach would need to account for the naturally qualitative nature of the reward function for normgoverned planning.
7 Conclusions
In the introduction, we claimed contributions both to modelling and practical reasoning in normative multiagent systems, and to algorithms for decentralised planning under uncertainty. For the former, we have presented what we believe to be the first endtoend model from the analysis of a domain where the behaviour of agents is governed by norms, through to a decentralised planning mechanism for multiple agents to act in concert such that they maximise their compliance with these norms. We consider normative system specifications that include guidance for recovering from violations (contrarytoduty obligations) and avoiding critical levels of failure (severity). The domain analysis mechanism proposed is guaranteed to generate a transitive and acyclic preference relation over possible worlds. This preference relation enables possible worlds to be ranked from the most to least compliant. This is then used to guide collective decision making in the presence of uncertainty, with the goal of maximising the expected compliance of states in an execution history.
The NDecPOMDP planning mechanism is an adaptation of DecPOMDPs for use with a qualitative reward function. Our greedy LP algorithm approximately solves an NDecPOMDP by starting with the problem of optimising against the highest levels of the reward function, adding additional constraints associated with lower levels until no significant improvement can be found. The mostcriticalstates (MCS) heuristic also exploits the qualitative structure of the reward function to guide planning effort. From the results obtained from evaluating this planning mechanism, we may reliably conclude that both the greedy LP and the MCS heuristic provide significant and considerable savings in terms of execution time without affecting the quality of policies computed.
Footnotes
 1.
In van der Torre and Tan [29] and in our prior research [15], a model also includes an accessibility relation \(R\subseteq W \times W\) in order to evaluate temporal logic formulae. This is not necessary here because our aim is only to compute a ranking of possible worlds for use within a multiagent planning mechanism.
 2.
 3.
The random heuristics sample for reachable states by simulating agents that choose a random action at each step.
 4.
Given a real \(\gamma \), and an optimal policy with value \(V^*\), a \(\gamma \)approximation of this policy is a policy with value \(V' \ge V^*  \gamma \).
 5.
Two norms are independent if neither is a contrarytoduty obligation of the other.
Notes
Acknowledgements
This research was funded by Selex ES. The software developed during this research, including the norm analysis and planning algorithms, the simulator and harbour protection scenario used during evaluation is freely available from doi: 10.5258/SOTON/D0139.
References
 1.Alechina, N., Dastani, M., Logan, B. (2012). Programming normaware agents. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1057–1064).Google Scholar
 2.Amato, C. (2014). Cooperative decision making. In M. J. Kochenderfer (Ed.), Decision making under uncertainty: Theory and application. Cambridge: MIT Press.Google Scholar
 3.Ashworth, A. (2006). Principles of criminal law (5th ed.). Oxford: Oxford University Press.Google Scholar
 4.Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4), 819–840.MathSciNetCrossRefzbMATHGoogle Scholar
 5.Bienvenu, M., Fritz, C., McIlraith, S.A. (2006). Planning with qualitative temporal preferences. In: Proceedings of the 10th International Conference on Knowledge Representation and Reasoning, (pp. 134–144).Google Scholar
 6.Bonet, B., Pearl, J. (2002). Qualitative MDPs and POMDPs: An orderofmagnitude approximation. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, (pp. 61–68).Google Scholar
 7.Boutilier, C., Brafman, R. I., Domshlak, C., Hoos, H. H., & Poole, D. (2004). CPnets: A tool for representing and reasoning with conditional ceteris paribus preference statements. Journal of Artificial Intelligence Research, 21, 135–191.MathSciNetzbMATHGoogle Scholar
 8.Brafman, R. I., Domshlak, C., & Shimony, S. E. (2006). On graphical modeling of preference and importance. Journal of Artificial Intelligence Research, 25, 389–424.MathSciNetzbMATHGoogle Scholar
 9.Brafman, R.I., Shani, G., Zilberstein, S. (2013). Qualitative planning under partial observability in multiagent domains. In: Proceedings of the 27th AAAI Conference on Artificial Intelligence, (pp. 130–137).Google Scholar
 10.Castelfranchi, C. (2003). Formalising the informal? Dynamic social order, bottomup social control, and spontaneous normative relations. Journal of Applied Logic, 1(1–2), 47–92.MathSciNetCrossRefzbMATHGoogle Scholar
 11.Chisholm, R. M. (1963). Contrarytoduty imperatives and deontic logic. Analysis, 24(2), 33–36.CrossRefGoogle Scholar
 12.Dignum, F., Morley, D., Sonenberg, E.A., Cavedon, L. (2000). Towards socially sophisticated BDI agents. In: Proceedings of the 4th International Conference on MultiAgent Systems, (pp. 111–118).Google Scholar
 13.Fagundes, M. S., Billhardt, H., & Ossowski, S. (2010). Normative reasoning with an adaptive selfinterested agent model based on Markov decision processes. In A. KuriMorales & G. R. Simari (Eds.), Advances in artificial intelligence—IBERAMIA 2010. Lecture notes in computer science (Vol. 6433, pp. 274–283). Berlin: Springer.Google Scholar
 14.Forrester, J. W. (1984). Gentle murder, or the adverbial Samaritan. The Journal of Philosophy, 81(4), 193–197.MathSciNetCrossRefGoogle Scholar
 15.Gasparini, L., Norman, T. J., Kollingbaum, M. J., & Chen, L. (2015). Severitysensitive robustness analysis in normative systems. In A. Ghose, N. Oren, P. Telang, & J. Thangarajah (Eds.), Coordination, organizations, institutions, and norms in agent systems X. Lecture notes in computer science (Vol. 9372, pp. 72–88). Berlin: Springer.CrossRefGoogle Scholar
 16.Gasparini, L., Norman, T.J., Kollingbaum, M.J., Chen, L. (2016). Decisiontheoretic normgoverned planning. In: Proceedings of the 15th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1265–1266).Google Scholar
 17.Gerevini, A., & Long, D. (2005). Plan constraints and preferences in PDDL3 (Vol. 75). Brescia, Italy: Department of Electronics for Automation, University of Brescia.Google Scholar
 18.Kagan, S. (1988). The additive fallacy. Ethics, 99(1), 5–31.MathSciNetCrossRefGoogle Scholar
 19.Kahn, A. B. (1962). Topological sorting of large networks. Communications of the ACM, 5(11), 558–562.CrossRefzbMATHGoogle Scholar
 20.Meneguzzi, F., Luck, M. (2009). Normbased behaviour modification in BDI agents. In: Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems, (pp. 177–184).Google Scholar
 21.Prakken, H., & Sergot, M. (1996). Contrarytoduty obligations. Studia Logica, 57, 91–115.MathSciNetCrossRefzbMATHGoogle Scholar
 22.Prakken, H., & Sergot, M. (1997). Dyadic deontic logic and contrarytoduty obligations. In D. Nute (Ed.), Defeasible deontic logic. Synthese library (Vol. 263, pp. 223–262). Berlin: Springer.CrossRefGoogle Scholar
 23.Puterman, M. L. (2014). Markov decision processes: Discrete stochastic dynamic programming. Hoboken: Wiley.zbMATHGoogle Scholar
 24.Roijers, D.M., Whiteson, S., Oliehoek, F.A. (2015). Pointbased planning for multiobjective POMDPs. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence, (pp. 1666–1672).Google Scholar
 25.Seuken, S., Zilberstein, S. (2007). Memorybounded dynamic programming for DECPOMDPs. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, (pp. 2009–2015).Google Scholar
 26.Soh, H., Demiris, Y. (2011). Evolving policies for multireward partially observable Markov decision processes (MRPOMDPs). In: Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, (pp. 713–720). ACM.Google Scholar
 27.Spaan, M. T., & Vlassis, N. (2005). Perseus: Randomized pointbased value iteration for POMDPs. Journal of Artificial Intelligence Research, 24, 195–220.zbMATHGoogle Scholar
 28.Tijs, S. (2006). Lexicographic optimization on polytopes is linear programming. Discussion paper, Tilburg University, Center for Economic ResearchGoogle Scholar
 29.van der Torre, L., & Tan, Y. H. (1999). Contrarytoduty reasoning with preferencebased dyadic obligations. Annals of Mathematics and Artificial Intelligence, 27(1–4), 49–78.MathSciNetCrossRefzbMATHGoogle Scholar
 30.Wray, K.H., Zilberstein, S. (2015). Multiobjective POMDPs with lexicographic reward preferences. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence, (pp. 1719–1725).Google Scholar
 31.Wu, F., Jennings, N., Chen, X. (2012). Samplebased policy iteration for constrained DECPOMDPs. In: Proceedings of the 20th European Conference on Artificial Intelligence.Google Scholar
 32.Wu, F., Zilberstein, S., Chen, X. (2010). Pointbased policy generation for decentralized POMDPs. In: Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1307–1314).Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.