Policy regularization for legible behavior

In this paper we propose a method to augment a Reinforcement Learning agent with legibility. This method is inspired by the literature in Explainable Planning and allows to regularize the agent’s policy after training, and without requiring to modify its learning algorithm. This is achieved by evaluating how the agent’s optimal policy may produce observations that would make an observer model to infer a wrong policy. In our formulation, the decision boundary introduced by legibility impacts the states in which the agent’s policy returns an action that is non-legible because having high likelihood also in other policies. In these cases, a trade-off between such action, and legible/sub-optimal action is made. We tested our method in a grid-world environment highlighting how legibility impacts the agent’s optimal policy, and gathered both quantitative and qualitative results. In addition, we discuss how the proposed regularization generalizes over methods functioning with goal-driven policies, because applicable to general policies of which goal-driven policies are a special case.


Introduction
As widely agreed in Explainable Artificial Intelligence, well-functioning collaboration between humans and artificial agents requires transparency [1].Agents should not only perform their assigned tasks efficiently and accurately, but should also make sure that the humans in their operative context understand their intentions and actions.
Facilitating intention recognition through a behavior that is understandable by a human observer has several advantages [2].For example, in human-robot interaction signaling the robot's intention increases collaborators' trust in the robot, safety, and fluency of interactions because aiding collaborators to predict what the robot is doing or will do [3][4][5], and in conditions of shared control allows to mediate, arbitrate, and guide the interaction [6] by informing the user about the robot's intended action.In applications for autonomous vehicles simple solutions augmenting the driver understanding of the car's intentional state, like sharing its goal, is sufficient to increase trustworthiness and acceptability of the autonomous driving system, as well as acceptance of higher levels of automation [7].In addition, recent developments in technologies for virtual or mixed reality are further enabling and enhancing methods for intentionality in physical robots, by allowing to plot and manipulate the robots' intentional states in the virtual 3D world [8].
Given the importance of intentions during interactions with artificial agents, it is therefore becoming relevant to combine methods that allow to express intentions with techniques generating highly performing behavior.The online creation of behavior of which intention is easily discernable or that is furnished with congruent explanations is addressed in Explainable Planning under the umbrella of interpretable behavior, where several methods to regularize behavior for explicability [9], predictability [10] or legibility [11,12] have been proposed.These techniques relate to an implicit communication of intention by making it transparent to its user, and is in contrast with explanations that instead is an explicit communication.Transparency is achieved by interacting with a user observer model.For example, legibility skews plan trajectories such that their goal is easily discernable, explicability makes sure that observations have at least one associated complete plan, or predictability reduces the amount of possible future possible trajectories.
While a substantial amount of formalizations of interpretable behavior exists in the Explainable Planning literature, there is very little related work for the framework of Reinforcement Learning (RL).RL has been shown to produce powerful agents for a variety of domains (including robotics, games or recommender systems) often surpassing human performance, however, the RL framework still lacks formalization about creating interpretable agents as intended in Explainable Planning, and mostly borrows its definition of interpretability from the Machine Learning (ML) literature.This definition is more concerned into making the decision taken by the algorithm explainable by a domain expert upon inspection in an offline setting, rather than to enable interpretability online during collaborations, therefore resulting unsuitable in fulfilling the needs of transparency of online interactions.
There is therefore still a large untapped potential in adapting methods for interpretability to RL.This would also provide valuable input for research in explainability that at the moment contemplates advanced methods such as those based on neural networks mostly as black boxes generating behavior that is optimal yet highly uninterpretable from a human perspective [13].To this purpose, in this paper we translate the legibility criteria from Explanable Planning to the RL framework as a measure of discernability of policy, that we loosely equal to the agent's intention.As we propose, injecting legibility inside an agent's policy doesn't require to modify components of the learning algorithm.We rather suggest to evaluate how the optimal policy may produce state-action pairs that would make the observer infer a wrong policy, to later find a trade-off that minimizes those while remaining consistent to the original policy.

Background
Since RL borrows the term "interpretability" mostly from the ML literature [14,15], merging the terminology from Explainable Planning and Reinforcement Learning could create some confusion.In ML interpretability generally means to provide insight into the agent's mechanisms such that its decisions are understandable by an expert upon inspection [15].This can be achieved firstly by translating the classifiers' latent features responsible for its decisions into a space that is interpretable, and then compute explanations on that space [16].In RL, [17] for example proposes to use attention to visualize which features the deep Q-network attends when taking decisions, while [18] trains linear tree models on Deep Q-networks to obtain corresponding interpretable models.See [14] for a survey of this type of techniques applied to RL.
These techniques for interpretability have been shown useful in many ML application domains by giving insight into models' decisions.They have, for example, been successful in health-care [19], and societal (eg.decisions regarding loans, hiring, risks, etc.) applications.However, they may be less suitable in domains characterized by real-time interaction, such as in human-robot interaction, where the fluency of the interaction prohibits deep inspections of the decision making algorithm.Also, while the produced explanations in terms of relevant features could be understood by an expert, they may be unsuitable for users who are uninformed of the underlying models, and more focused on common sense reasoning.People are in general very good at forming hypotheses on intentions and beliefs explaining an observed behavior through what is referred to a theory of mind reasoning [20].However, it has been commonly shown how the behavior of advanced agents operating at human level, such as in competitive games, are often beyond human intuition and highly inexplicable [21,22].Especially for such cases, but also in general, it is therefore necessary to regularize artificial agents towards behaviors compatible with common sense reasoning, while maintaining their high performance.
To this end, in this paper we refer to interpretability as intended in planning, where an agent behavior is interpretable when an observer can easily discern what the agent is doing by understanding its intention [23].Also when applied to RL, this definition conforms better to real-time interaction in the presence of an observer that could be either passive or part of a larger collaborating agent, such as a human.As previously introduced, in this context a multitude of definitions capturing smaller aspects of interpretability have been used.Each aspect expresses different types of expectations that an eventual observer has on the agent, such as expectations about its goal [11], expectations about entire future trajectories [10], or expectations towards a communication model [24].While there is a lot of variety in the models and theories leveraged by all this techniques, it can be generally shown that this set of methods requires an expectation model that is a second-order theory of mind focused on the observer's inferences about the agent [2,23], and that interpretable behavior can be seen as minimizing the distance between the estimated model possessed by the observer and the true model of the agent (see Figure 1).The agent's behavior is interpretable whenever conforming with the expectations casted by the second-order model, and uninterpretable when not conforming [2].
In agents applications the second-order theory of mind is the model that the agent thinks the observer is using to interpret its behavior and can have many forms, for example, in [10] it is a label predicting whether a human observer is understanding the agent, while in [25] is a complete planning model.In general, simple observer models are easier to maintain aligned with the actual expectations of the user, while those that are more structured allow to simulate with greater detail the inferences of the observer.Also, structured models can be selectively changed through a reconciliation process [25] thus ultimately allowing the agent to autonomously re-align its model with the observer's whenever it detects the need.
To the best of our knowledge very little work exists in RL relating to interpretable behavior as we just described.Both [26,27] propose methods relying on a transposition of the original formulation of legibility.The methods result applicable only for goal-driven policies, thus excluding all other types of policies available in various RL frameworks.In addition, they require to specify distance measures between states that, while easy for manipulators working in the cartesian space, can be a difficult task for arbitrary state-spaces.
Rather than relying of goal locations, we define a legibility criteria that is directly applicable on policies.A regularization method similar to ours is proposed in works on offline policy learning [28][29][30] where during training the agent's on-policy behavior is regularized towards another behavior.We can see our method as a specific application inside this class methods, where the policy is regularized towards the legible policy.

Method
The main goal of interpretable behavior is to bring the intention predicted by the observer's model close to the intention of the agent, and to maintain such closeness in time.Consistently with the definition of a legible intention we define a legible policy as: An agent's policy is legible if it is discernible from a set of other policies.It is useful to work with this definition because it reflects the general case where an observer is attempting to understand which policy the agent is currently enacting among a set of candidates.Furthermore, the definition doesn't pose constraint on the type of policy but can be applied to arbitrary policies.The goal of legibility is therefore to help the observer to infer the correct policy from the set of those being considered.For this case we hypothesize an observer watching the agent and inferring the policy it is currently pursuing.
Fig. 2 Agent model and second-order theory of mind as equivalent Bayesian Networks.The networks model how agent and observer respectively select and infer actions using the current state and a set of predefined policies, while the function H measures the distance between these two processes.
The agent can simulate the presence of an observer by implementing a secondorder theory of mind modeling the expectations that it is using to infer intentions.To implement the second-order theory of mind we utilize a middle way between the expressiveness of a complete agent model, and the simplicity of using a hand-crafted solution.This model for theory of mind reasoning, that we refer to as the Mirror Agent Model, describes agent and observer models as two equivalent Bayesian networks denoted P R and P H R (Figure 2).P R determines how the agent acts, while P H R is the observer's model of how the agent acts.Since the real observer model is part of the observer it is not directly accessible by the agent.The agent must therefore for all computations rely on the estimated model P H R , the second-order theory of mind.To simplify notations, we make in the following no distinction between these two entities, and we use observer model and second-order theory of mind as interchangeable.
The Bayesian networks are structurally the same and describe the agent as a Markov Decision Process (MDP) with multiple possible policies, however, the random variables (Π, S and A) can be differently distributed in P R compared to P H R , depending on the agent's reasoning and prior information about the observer.A simplifying assumption this model makes is that the user internalizes an agent model with the same structure as the true agent model.While this assumption may not hold in the general case, it can, for example, be achieved by communicating the agent model, or by performing model alignment dialogues with the goal of communicating the latent variables that the agent uses to act.
We assume that the agent has a fixed set of pre-trained policies identified by the random variable Π = {π 0 , ..., π n }.Notably, among these there is the currently pursued policy π R with P R (Π = π R ) = 1.Initially, the observer is modelled as ignorant of which policy the agent is pursuing, leading to a uniform prior of the policies: ∀i P H R (π i ) = k, k = 1 |Π| .When using Q-learning, two corresponding Q-value tables Q R (a, π, s) and Q H R (a, π, s) respectively determine the probability distribution for the agent selecting actions, with P R (a|π, s) = f (Q R (π, s, a)), and for the observer inferring the agent's actions, with P H R (a|π, s) = g(Q H R (π, s, a)).The Q-value tables can be obtained using any of the available RL methods, while f and g are arbitrary functions that transform Q-values into probability distributions of actions, for example the Boltzmann or the -greedy distributions [31].
To be legible, the agent should select actions that communicate the observer its policy π R , while avoiding communicating the others.This is obtained by selecting actions based on how they reduce the distance between the probability distribution over the agent policies, P R (Π), and the corresponding distribution P H R (Π|s, a) that the observer infers, given an observation of state-action pair.As distance measure we use cross-entropy, obtaining the following formulation: Since the action probabilities in Q-learning depend on the Q-values, we can use Eq. 1 to define regularized versions of the Q-values as: with α > 0 determining the magnitude of regularization.In this way, the right part of Eq. 2 regularizes the resulting policy such that the selected actions aim at a small distance between the agent policy and the policy inferred by the observer.Equation 1expresses that the resulting decision boundary introduced by legibility impacts the states in which π R (s) returns an action that has high probability also in other policies.In these cases, a trade-off between such actions, and a sub-obtimal/legible actions is made.

Experiments and evaluation
We tested and evaluated the proposed model with two experiments.The first is an illustrative example in a gridworld setting and is intended to provide insight into how the legible policy modifies the original policy.The second experiment is more extensive and is performed with a Deep Q-Network.
Original Legible Fig. 3 Left: policies for the three goals (red dots) learned with Q-learning.Right: legible policies.The legible policies avoid ambiguity of goal location.

Grid-world experiment
In this experiment we tested the proposed method on a gridworld scenario.The grid is 7x7 and without obstacles.There are 3 possible goals at the corners, for which we trained three corresponding policies with Q-learning.For simplicity we set Q R = Q H R and f = g, meaning that the agent assumes the observer to use the same Q-values and derived action probabilities as its own, i.e., ∀i P R (A|π i , S) = P H R (A|π i , S). α was set to 1, which has the advantage of not require modeling how the observer models the task, which is a costly procedure.However, nothing prohibits usage of different Q-values for the observer.In such cases, the agent would be evaluated by a different set of policies than those it possesses.
Figure 3 shows in the left column the optimal policies learned by the agent.In the right column the correspondingly legible policies obtained using α = 1.The learned policies move towards a wall adjacent the goal, and then approach the goal by walking along the wall.However, to be legible, it is important to approach the right wall that disambiguates the goal location.The legible policies systematically approach an unambiguous wall.Notice also how for g 1 , the legible policy makes the agent walk in the middle to avoid approaching the other goals.

Deep Q-Network experiment
In the second experiment we used OpenAI Gym [32].We designed a simulated environment in which the agent had to pass through tunnels of length L and width W , composed of C + 2 types of cells: empty cells, obstacle cells, and C types of cells of different colors (see Figure 4).The agent was defined to see a maximum of S cells in front of it and had 3 possible actions: move one cell up, move one cell down, or stay at the same position.If the agent moves to a colored cell it receives a reward of +1 while if it moves to an obstacle it gets a punishment of −10 and a new episode restarts.Moving to an empty cell or to a cell of a color different from its own does not result in any reward or punishment.The environment is not goal-oriented but rather defines regions of reward and of punishment for the agent.These regions can be of arbitrary shape and we used rectangles for colored regions and squares or lines for obstacles.
Fig. 4 Sampled tunnel environment.While traversing a tunnel the agent is rewarded to walk on cells of its same color (green).Hitting an obstacle (teal) instead punishes the agent and resets the episode.
Since the agent is unaffected by cells of a color different from its rewarding color, to simplify the learning process it was trained on tunnels containing only one color and obstacles.Later, tunnels containing C colors are obtained by using C tunnels sharing obstacles and agent position.Inside a single-color tunnel, at every timestep the observation corresponds to a set of three matrices M 0 , M 1 , M 2 of size W •S, each representing a slice of the tunnel up to the agent's sight distance.The first matrix contains only colored cells, the second only obstacles and the third the agent's position.Inside every matrix, each cell is characterized by the summation of three embedding vectors: where w i and s j are position embeddings identifying the cell inside the matrix.For example, w 0 , s 5 indicates cell 0 − 5.While t ij identifies whether that cell is occu- Fig. 5 Q-network for the tunnel enviroment.φ: convolution network shared by the three inputs.ψ: fullyconnected network Figure 5 shows the employed Q-network.In the network, φ is a convolution network which convolves on the matrices of embeddings, and is shared by all the inputs M 0 , M 1 and M 2 .ψ is a fully connected network that takes as input the vector φ(M 0 ), φ(M 1 ), φ(M 2 ) and outputs a vector of size 3 for the Q-values.
We trained the agent on 30000 random, single-color tunnels of length 200 and width 12 cells, while the agent's observation windows was set to 20 cells.For every tunnel 5 colored rectangles and 10 obstacles of shape square or line were randomly placed.As previously mentioned, after training to obtain a tunnel with C colors we merged C tunnels at once, with each tunnel containing only cells of the respective color, while all sharing the same obstacles and agent position.In this way, at each step the agent has C different policies to follow, each one seeking a particular color.This is equal to the result of training C different policies simultaneously.

Quantitative Evaluation
We tested the proposed method for legible policy in a setting where both agent and observer use the same Q-function (the trained Q-network) and the greedy policy to always select the action with highest Q-value.Since the introduced regularization penalizes actions with high probability in other policies, we expected the agent to avoid cells of colors that are not its own.In other words, since the observer model judges the agent's behavior by confronting it with policies that seek cells of given colors, by avoiding cells of other colors the agent decreases the probability of those policies in the observer's inferences.
We tested this hypothesis first quantitatively by measuring the average gathered reward over 200 episodes, while using increasing values for the regularization factor α. Every random tunnel had C = 4 colors, 5 rectangular colored patches for each color, and 10 square obstacles.In this setting we measured the reward gathered by the agent when pursuing the color C 0 , and the average reward for the other colors C 1..3 accumulated while pursuing C 0 .We then divided these scores by the maximum rewards that the policy could have gathered for the corresponding colors, thereby obtaining a reward ratio with values between 0 and 1.For example, a reward ratio of 0.5 means that the agent accumulated half of the possible maximal reward.As a complement to the reward ratio, success rate was computed as the probability of succeeding, i.e. reaching the end of the tunnel without hitting any obstacles during an episode.Table 2 summarizes this experiment.Table 3 instead summarizes the degree of legibility of the agent's policy measured as the expected probability that the observer model gives to the agent's policy through the episodes: where every state transition is given equal weight p(s, a) = 1 |E| .The second row of the table shows the gain of legibility obtained by using the legible policy rather that the original: We tested and evaluated the proposed model with two experiments.The first is an illustrative example in a gridworld setting and is intended to provide insight into how the legible policy modifies the original policy.The second experiment is more extensive and is performed with a Deep Q-Network.

Qualitative Evaluation
Figure 6 shows the effect of regularization on two sampled tunnels.In the plots red is the rewarding color of the agent and obstacles are in brown.The trajectories in yellow are obtained by simulating and averaging 200 trials.In addition, to better understand the effect of regularization, the agent's behaviors for three different configurations of colored regions and increasing factor α are plotted in Figure 7.The plots have two colors: red as reward color for the agent's policy, and blue as reward color for a different policy.Legibility clearly skews the trajectories such that they pass farther away from non-red cells in a way that is proportional to α.Notice also how in Figure 7 regularization becomes detrimental for values of α that are too high.In such cases, the 2) Fig. 6 Qualitative results for two sampled tunnels.The left column shows the agent's optimal learned behavior while the left side its regularized version. 2) 3) Fig. 7 Qualitative results for three types of positions of reward regions and increasing levels of α.
agent's original policy of walking over red cell is dominated by the regularization to avoid blue cells, and in some cases the agent is not able to pass over any red cells even if there aren't any obstacles.

Discussion
Our quantitative results indicate that the proposed regularization for legibility is effective in making the observer model discriminate the true agent's policy.This is highlighted in Table 2 where we can see that the reward ratio for colors different from C 0 decrease as α increases, signifying that the agent avoids regions with colors different from its own.The qualitative results also confirm this observation, by showing that as α increases so does the effort of the regularized policy to avoid other colors.We calculated how high values of α are detrimental both in terms of accumulated reward and success rate of the episodes.The reason is that the agent is regularized so much to avoid other policies that its original policy is overridden rather than regularized.However, the agent incurs a noticeable loss in terms of accumulated reward or success rate only for high regularization factors.
The general results confirm our originally formulated hypothesis that legibility increases the probability of the agent policy appearing in the observer's model by making the agent avoid rewarding regions of other policies.This is an emergent behavior that was not coded in the equations and in our experiments represented a generalization over goal-driven solutions for legibility, because the reward regions of goal-driven policies are a special case of those of arbitrary policies that put reward at goal locations.
Furthermore, we noticed that our results are obtained with a simple observer model having uniformly distributed probabilities, and that considers the same policies that are available to the agent.Because our results are qualitative similar to those reported in earlier works on legibility [11,26] ie.legible trajectories are skewed to avoid other goal locations, it is suggested that a similarly uniformly initialized observer model was implicitly utilized in those papers as well.However, in contrast with previous solutions the proposed method allows to easily generalize over the difference between agent and observer models, that is by using different corresponding networks.This was not possible in previous methods because the observer was fixed.

Conclusion
In this paper we introduced a model that allows to regularize a reinforcement learning agent for legibility.In our formulation we propose a legibility criteria that induces an observer model to disambiguate the agent's intention from a set of others, with intentions being implemented as policies.We suggest that rather than modifying the learning procedure of the agent we can wrap a priorly learned set of policies by a pair of Bayesian Networks that model agent and observer respectively.The coupled networks describes a setting of second-order theory of mind that, by reasoning on how the observer infers policies, increases the discrimination between the agent's true policy and other candidate policies.
We evaluated the method on an illustrative example showing how legibility impacts the decision boundary of the agent, and on a Deep-RL example.In general, our model is successful at increasing the legibility of trajectories without incurring in losses for the agent when the regularization factor is kept at a reasonable level.Furthermore, our qualitative results show that the obtained trajectories are similarly arced as those obtained in earlier work on Explainable Planning, but with the main difference of computing legibility on reward regions rather goal states.
The proposed methods introduces two relevant degrees of freedom in legibility.The first is that legibility is computed with respect to reward regions rather than goal locations.This allows to regularize arbitrary policies and especially those that can run indefinitely.Policies of this type can't be regularized by methods relying on the original formulation of legibility because of the need of a goal state.The second degree of freedom is on the possibility of decoupling agent and observer models.This allows to specify that the the observer uses a different reward distribution, and legibility is to be computed against that distribution rather than the agent's.This decoupling is not easy to implement using previous methods relying on distance measures, because would require to specify how the observer measures distances on the state-space.
Since the agent's learning algorithm is unmodified, it is straightforward to apply our method to arbitrary problems and types of agents.Even though we couldn't test it on extensive test beds of agents and problems it is reasonable to think that problems effectively captured as MDPs can be regularized without major additional implementations.

Declarations
• The author declares that there are no conflicts of interest associated with this study.• This study did not involve human participants or animals.
• This study was funded by the Department of Computing Science of Umeå University, Universitetstorget 4, Umeå, 901 87, Sweden.

Fig. 1 P
Fig. 1 P R : an RL agent interacting with its environment.P H R : model of the expectations that an observer has about the agent.The goal of interpretable behavior is to keep the distance | P R − P H R | low, signifying that the agent's behavior effectively matches the observer's expectations.

Table 1
Table 1 shows the network's hyperparameters used for training the Q-network.Hyperparameters of the Q-network for training the agent.

Table 2
Average accumulated reward ration by the policies for color C 0 and colors C 1..3 for increasing values of α.The row Success indicates the probability of completing a tunnel without hitting obstacles.

Table 3
Policy legibity for increasing values of α.