Policy Space Identification in Configurable Environments

We study the problem of identifying the policy space of a learning agent, having access to a set of demonstrations generated by its optimal policy. We introduce an approach based on statistical testing to identify the set of policy parameters the agent can control, within a larger parametric policy space. After presenting two identification rules (combinatorial and simplified), applicable under different assumptions on the policy space, we provide a probabilistic analysis of the simplified one in the case of linear policies belonging to the exponential family. To improve the performance of our identification rules, we frame the problem in the recently introduced framework of the Configurable Markov Decision Processes, exploiting the opportunity of configuring the environment to induce the agent revealing which parameters it can control. Finally, we provide an empirical evaluation, on both discrete and continuous domains, to prove the effectiveness of our identification rules.


Introduction
Reinforcement Learning (RL, Sutton and Barto 2018) deals with sequential decision-making problems in which an artificial agent interacts with an environment by sensing perceptions and performing actions. The agent's goal is to find an optimal policy, i.e., a prescription of actions that maximizes the (possibly discounted) cumulative reward collected during its interaction with the environment. The performance of an agent in an environment is constrained by its perception and actuation possibilities, along with the ability in mapping observations to actions. These three elements define the agent's policy space. Agents with different policy spaces could display different optimal behaviors, even in the same environment. Therefore, the notion of optimality is necessarily connected to the agent's policy space. While in tabular RL we typically assume access to the full (and finite) space of Markovian stationary policies, in continuous control, the policy space needs to be limited. In policy search methods (Deisenroth, Neumann, and Peters 2013), the policy is explicitly modeled considering a parametric functional space (Sutton et al. 1999; Preprint. Under review. Peters and Schaal 2008) or a kernel space (Deisenroth and Rasmussen 2011;Levine and Koltun 2013); but also in value-based RL, a function approximator induces a set of representable (greedy) policies.
The knowledge of the agent's policy space could be of crucial importance when the learning process involves the presence of an external supervisor. Recently, the notion of Configurable Markov Decision Process (Conf-MDP, Metelli, Mutti, and Restelli 2018) has been introduced to account for the real-world scenarios in which it is possible to exercise a, maybe partial, control over the environment, by means of a set of environmental parameters (e.g., Silva, Melo, and Veloso 2018;Silva et al. 2019). This activity, called environment configuration, can be carried out by the agent itself or by an external supervisor. While previous works focused on the former case (e.g., Metelli, Ghelfi, and Restelli 2019), in this paper, we explicitly consider the presence of a supervisor who acts on the environment with the goal of finding the most suitable configuration for the agent.
Intuitively, the best environment configuration is intimately related to the possibilities of the agent in terms of policy space. For instance, in a car racing problem, the best car configuration depends on the car driver and has to be selected, by a track engineer (the supervisor), according to the driver's skills. Thus, the external supervisor has to be aware of the agent's policy space. Besides the Conf-MDPs, there are other contexts in which knowing the policy space can be beneficial, such as Imitation Learning, i.e., the framework in which an agent learns by observing an expert (Osa et al. 2018). In behavioral cloning, where recovering an imitating policy is cast as a supervised learning problem (Argall et al. 2009), knowing the expert's policy space means knowing a suitable hypothesis space, preventing possible over/underfitting phenomena. However, also Inverse Reinforcement Learning algorithms (IRL, Ng and Russell 2000), whose goal is to retrieve a reward function explaining the expert's behavior, can gain some advantages. In particular, the IRL approaches based on the policy gradient (e.g., Pirotta and Restelli 2016;Metelli, Pirotta, and Restelli 2017;Tateo et al. 2017) require a parametric representation of the expert's policy, whose choice might affect the quality of the recovered reward function.
In this paper, motivated by the examples presented above, we study the problem of identifying the agent's policy space in a Conf-MDP, by observing the agent's behavior and, possibly, exploiting the configuration opportunities of the environment. We consider the case in which the policy space of the agent is a subset of a known super-policy space Π Θ induced by a parameter space Θ Ď R d . Thus, any policy π θ is determined by a d-dimensional parameter vector θ P Θ. However, the agent has control over a smaller number d˚ă d of parameters (which are unknown), while the remaining ones have a fixed value, namely zero. 1 Our goal is to identify the parameters that the agent can control (and possibly change) by observing demonstrations of the optimal policy π˚. It is worth noting that the formulation based on the identification of the parameters effectively covers the limitations of the policy space related to perception, actuation, and mapping. To this end, we formulate the problem as deciding whether each parameter θ i for i P t1, ..., du is zero, and we address it by means of a statistical test. In other words, we check whether there is a statistically significant difference between the likelihood of the agent's behavior with the full set of parameters and the one in which θ i is set to zero. In such case, we conclude that θ i is not zero and, consequently, the agent can control it. On the contrary, either the agent cannot control the parameter or zero is the value consciously chosen by the agent.
Indeed, there could be parameters that, given the peculiarities of the environment, are useless for achieving the optimal behavior or whose optimal value is actually zero, while they could prove to be essential in a different environment. For instance, in a grid world where the goal is to reach the right edge, the vertical position of the agent is useless, while if the goal is to reach the upper right corner both horizontal and vertical positions become relevant. In this spirit, configuring the environment can help the supervisor in identifying whether a parameter set to zero is actually uncontrollable by the agent or just useless in the current environment. Thus, the supervisor can change the environment configuration ω P Ω, so that the agent will adjust its policy, possibly changing the parameter value and revealing whether it can control such parameter. Thus, the new configuration should induce an optimal policy in which the considered parameters have a value significantly different from zero. We formalize this notion as the problem of finding the new environment configuration that maximizes the power of the statistical test and we propose a surrogate objective for this purpose.
It is worth emphasizing that we use the Conf-MDP notion for two purposes. First, we propose the problem of learning the optimal configuration in a Conf-MDP as a motivating example in which the knowledge of the policy space is valuable. Second, we use the environment configurability as a tool to improve the identification of the policy space.
The paper is organized as follows. In Section 2, we intro-duce the necessary background. The identification rules to perform parameter identification in a fixed environment are presented in Section 3 and analyzed in Section 4. Section 5 shows how to improve them by exploiting the environment configurability. Finally, the experimental evaluation, on discrete and continuous domains, is provided in Section 6. The proofs of all the results can be found in Appendix A.

Preliminaries
In this section, we report the essential background that will be used in the subsequent sections.
(Configurable) Markov Decision Processes A discretetime Markov Decision Process (MDP, Puterman 2014) is defined by the tuple M " pS, A, p, µ, r, γq, where S and A are the state space and the action space respectively, p is the transition model that provides, for every state-action pair ps, aq P SˆA, a probability distribution over the next state pp¨|s, aq, µ is the distribution of the initial state, r is the reward model defining the reward collected by the agent rps, aq when performing action a P A in state s P S, and γ P r0, 1s is the discount factor. The behavior of an agent is defined by means of a policy π that provides a probability distribution over the actions πp¨|sq for every state s P S. An MDP M paired with a policy π induces a γ-discounted stationary distribution over the states (Sutton et al. 1999), defined as d π µ psq " p1´γq ř`8 t"0 γ t Pr ps t " s|M, πq. We limit the scope to parametric policy spaces Π Θ " tπ θ : θ P Θu, where Θ Ď R d is the parameter space. The goal of the agent is to find an optimal policy, i.e., any policy parametrization that maximizes the expected return: a"π θ p¨|sq rrps, aqs . (1) In this paper, we consider a slightly modified version of the Conf-MDPs (Metelli, Mutti, and Restelli 2018). Definition 2.1. A Configurable Markov Decision Process (Conf-MDP) induced by the configuration space Ω Ď R p is defined as the set of MDPs C Ω " tM ω " pS, A, p ω , µ ω , r, γq : ω P Ωu. The main differences w.r.t. the original definition are: i) we allow the configuration of the initial state distribution µ ω , in addition to the transition model p ω ; ii) we restrict to the case of parametric configuration spaces Ω; iii) we do not consider the policy space Π Θ as a part of the Conf-MDP.

Generalized Likelihood Ratio Test The Generalized
Likelihood Ratio test (GLR, Barnard 1959;Casella and Berger 2002) aims at testing the goodness of fit of two statistical models. Given a parametric model having density function p θ with θ P Θ, we aim at testing the null hypothesis H 0 : θ˚P Θ 0 , where Θ 0 Ă Θ is a subset of the parametric space, against the alternative H 1 : θ˚P ΘzΘ 0 . Given a dataset D " tX i u n i"1 sampled independently from p θ˚, where θ˚is the true parameter, the GLR statistic is: where p Lpθq " ś n i"1 p θ pX i q is the likelihood function. We denote with p pθq "´log p Lpθq the negative log-likelihood function, p θ P arg sup θPΘ p Lpθq and p θ 0 P arg sup θPΘ0 p Lpθq, i.e., the maximum likelihood solutions in Θ and Θ 0 respectively. Moreover, we define the expectation of the likelihood under the true parameter: pθq " E Xi"p θ˚r p pθqs. As the maximization is carried out employing the same dataset D and recalling that Θ 0 Ă Θ, we have that Λ P r0, 1s. It is usually convenient to consider the logarithm of the GLR statistic: λ "´2 log Λ " 2p p p p θ 0 q´p p p θqq. Therefore, H 0 is rejected for large values of λ, i.e., when the maximim likelihood parameter searched in the restricted set Θ 0 significantly underfits the data D, compared to Θ. Wilk's theorem provides the asymptomatic distribution of λ when H 0 is true (Wilks 1938;Casella and Berger 2002). Let d " dimpΘq and d 0 " dimpΘ 0 q ă d. Under suitable regularity conditions (see Casella and Berger (2002) 10.6.2), if H 0 is true, then when n Ñ`8, the distribution of λ tends to a χ 2 distribution with d´d 0 degrees of freedom.
The significance of a test α P r0, 1s, or type I error probability, is the probability to reject H 0 when H 0 is true, while the power of a test 1´β P r0, 1s is the probability to reject H 0 when H 0 is false, β is the type II error probability.

Policy Space Identification in a Fixed Environment
As we introduced in Section 1, we aim at identifying the agent's policy space, by observing a set of demonstrations coming from the optimal policy π˚P Π Θ 2 only, i.e., D " tps i , a i qu n i"1 where s i " d πμ and a i " π˚p¨|s i q sampled independently. In particular, we assume that the agent has control over a limited number of parameters d˚ă d whose value can be changed during learning, while the remaining d´d˚are kept fixed to zero. 3 Given a set of indexes I Ď t1, ..., du we define the subset of the parameter space: Θ I " tθ P Θ : θ i " 0, @i P Izt1, ..., duu. Thus, the set I represents the indexes of the parameters that can be changed if the agent's parameter space were Θ I . Our goal is to find a set of parameter indexes I˚that are sufficient to explain the agent's policy, i.e., π˚P Π Θ I˚b ut also necessary, in the sense that when removing any i P I˚the remaining ones are insufficient to explain the agent's policy, i.e., π˚R Π Θ I˚ztiu . We formalize these notions in the following definition.
We denote with I˚the set of all correct I˚. The uniqueness of I˚is guaranteed under the assumption that each policy admits a unique representation in Π Θ .
The following two subsections are devoted to the presentation of the identification rules based on the application of Definition 3.1 (Section 3.1) and Lemma 3.1 (Section 3.2) when we only have access to a dataset of samples D. The goal of an identification rule consists in producing a set p I, approximating I˚.

Combinatorial Identification Rule
In principle, using D " tps i , a i qu n i"1 , we could compute the maximum likelihood parameter p θ P arg sup θPΘ p Lpθq and employ it with Definition 3.1. However, this approach has, at least, two drawbacks. First, when Assumption 3.1 is not fulfilled, it would produce a single approximate parameter, while multiple choices might be viable. Second, because of the estimation errors, we would hardly get a zero value for the parameters the agent might not control. For this reasons, we employ a GLR test to assess whether a specific set of parameters is zero. Specifically, for all I Ď t1, ..., du we consider the pair of hypotheses H 0,I : π˚P Π Θ I against H 1,I : π˚P Π ΘzΘ I and the GLR statistic: where the likelihood is defined as p Lpθq " ś n i"1 π θ pa i |s i q, p θ I P arg sup θPΘ I p Lpθq and p θ P arg sup θPΘ p Lpθq. We now state the identification rule derived from Definition 3.1. Identification Rule 3.1. p I c contains all and only the sets of parameter indexes I Ď t1, ..., du such that: λ I ď cp|I|q^@i P I : λ Iztiu ą cp|Iztiu|q, (4) where cplq are the critical values.
Thus, I is defined in such a way that the null hypothesis H 0,I is not rejected, i.e., I contains parameters that are sufficient to explain the data D, and necessary since for all i P I the set Iztiu is no longer sufficient, as H 0,Iztiu is rejected. The critical values cplq, that depend on the cardinality l of the tested set of indexes, should be determined in order to enforce guarantees on the type I and II errors. We will show in Section 6 how to set them in practice. Refer to Algorithm 1 for the pseudocode of the identification rule.

Simplified Identification Rule
Identification Rule 3.1 is usually impractical, as it requires performing O`2 d˘s tatistical tests. However, under Assumption 3.1, to retrieve I˚we do not need to test all subsets, but we can just examine one parameter at a time (see if λI ď cp|I|q and @i P I : λ Iztiu ą cp|Iztiu|q then Lemma 3.1). Thus, for all i P t1, ..., du we consider the pair of hypotheses H 0,i : θi " 0 against H 1,i : θi ‰ 0 and define Θ i " tθ P Θ : θ i " 0u. The GLR test can be performed straightforwardly, using the statistic: where the likelihood is defined as p Lpθq " ś n i"1 π θ pa i |s i q, p θ i " arg sup θPΘi p Lpθq and p θ " arg sup θPΘ p Lpθq. 4 In the spirit of Lemma 3.1, we define the identification rule.
Identification Rule 3.2. p I c contains the unique set of parameter indexes p I c such that: Therefore, the identification rule constructs p I c by taking all the indexes i P t1, ..., du such that the corresponding null hypothesis H 0,i : θi " 0 is rejected, i.e., those for which there is statistical evidence that their value is not zero. We will show in Section 4 how the critical value cp1q can be computed, in a theoretically sound way, for linear policies belonging to the exponential family.
This second procedure requires a test for every parameter, i.e., Opdq instead of Op2 d q tests. However, it comes with the cost of assuming the identifiability property. What happens if we employ this second procedure in a case where the assumption does not hold? Consider for instance the case in which two parameters are exchangeable, we will include none of them in p I c as, individually, they are not necessary to explain the agent's policy. Refer to Algorithm 2 for the pseudocode of the identification rule.

Analysis for the Exponential Family
In this section, we provide an analysis of the Identification Rule 3.2 for a policy π θ linear in some state features φ that belongs to the exponential family (Brown 1986). 5 Algorithm 2 Identification Rule 3.2 (Simplified) Definition 4.1 (Exponential Family). Let φ : S Ñ R q be a feature function. The policy space Π Θ is a space of linear policies, belonging to the exponential family, if Θ " R d and all policies π θ P Π Θ have form: where h is a positive function, t ps, aq is the sufficient statistic that depends on the state via the feature function φ (i.e., t ps, aq " tpφpsq, aq) and Apθ, sq " log ş A hpaq exptθ T tps, aquda is the log partition function. We denote with tps, a, θq " tps, aq´E a"π θ p¨|sq rtps, aqs the centered sufficient statistic.
This definition allows modelling the linear policies that are often used in RL (Deisenroth, Neumann, and Peters 2013). Table 1 shows how to map the Gaussian linear policy with fixed covariance, typically used in continuous action spaces, and the Boltzmann linear policy, suitable for finite action spaces, to Definition 4.1 (details in Appendix A.1).
For the sake of the analysis, we enforce the following assumption concerning the tail behavior of the policy π θ . Assumption 4.1 (Subgaussianity). For any θ P Θ and for any s P S the centered sufficient statistic tps, a, θq is subgaussian with parameter σ ě 0, i.e., for any α P R d : Proposition A.2 of Appendix A.2 proves that, when the features are uniformly bounded, i.e., }φpsq} 2 ď Φ max for all s P S, this assumption is fulfilled by both Gaussian and Boltzmann linear policies with parameter σ " 2Φ max and σ " Φ max { a λ min pΣq respectively. Furthermore, limited to the policies complying with Definition 4.1, the identifiability (Assumption 3.1) can be restated in terms of the Fisher Information matrix (Rothenberg and others 1971;Little, Heidenreich, and Li 2010). Lemma 4.1 (Rothenberg and others 1971, Theorem 3). Let Π Θ be a policy space, as in Definition 4.1. Then, under suitable regularity conditions (see Rothenberg and others 1971), if the Fisher Information matrix (FIM) Fpθq: Fpθq " E s"d πμ a"π θ p¨|sq " tps, a, θqtps, a, θq T ‰ is non-singular for all θ P Θ, then Π Θ is identifiable. In this case, we denote with λ min " inf θPΘ λ min pFpθqq ą 0. Table 1: Action space A, probability density function π r θ , sufficient statistic t, and function h for the Gaussian linear policy with fixed covariance and the Boltzmann linear policy. For convenience of representation r θ P R kˆq is a matrix and θ " vecp r θ T q P R d , with d " kq. We denote with e i the i-th vector of the canonical basis of R k and with b the Kronecker product.
Proposition A.1 of Appendix A.2 shows that a sufficient condition for the identifiability in the case of Gaussian and Boltzmann linear policies is that the second moment matrix of the feature vector E s"d πμ " φpsqφpsq T ‰ is non-singular along with the fact that the policy π θ plays each action with positive probability for the Boltzmann policy.

Concentration Result
We are now ready to present a concentration result, of independent interest, for the parameters and the negative log-likelihood that represents the central tool of our analysis (details and derivation in Appendix A.2). Theorem 4.1. Under Assumption 3.1 and Assumption 4.1, let D " tps i , a i qu n i"1 be a dataset of n ą 0 independent samples, where s i " d π θμ and a i " π θ˚p¨| s i q. Let p θ " arg min θPΘ p pθq and θ˚" arg min θPΘ pθq . If the empirical FIM: has a positive minimum eigenvalue p λ min ą 0 for all θ P Θ, then, for any δ P r0, 1s, with probability at least 1´δ: Furthermore, with probability at least 1´δ, individually: The theorem shows that the L 2 -norm of the difference between the maximum likelihood parameter p θ and the true parameter θ˚concentrates with rate Opn´1 {2 q while the likelihood p and its expectation concentrate with faster rate Opn´1q. Note that the result assumes that the empirical FIM p Fpθq has a strictly positive eigenvalue p λ min ą 0. This condition can be enforced as long as the true Fisher matrix Fpθq has a positive minimum eigenvalue λ min , i.e., under identifiability assumption (Lemma 4.1) and given a sufficiently large number of samples. Proposition A.4 of Appendix A.2 provides the minimum number of samples such that with probability at least 1´δ it holds that p λ min ą 0.

Identification Rule Analysis
The goal of the analysis of the identification rule is to find the critical value cp1q so that the following probabilistic requirement is enforced.
We denote with α " 1 d´d˚E "ˇˇ i R I˚: i P p I c (ˇˇ‰ the expected fraction of parameters that the agent does not control selected by the identification rule and with β " 1 d˚E "ˇˇ i P I˚: i R p I c (ˇˇ‰ the expected fraction of parameters that the agent does control not selected by the identification rule. 6 We now provide a result that bounds α and β and employs them to derive δ-correctness.
Since α and β are functions of cp1q, we could, in principle, employ Theorem 4.2 to enforce a value δ, as in Definition 4.2, and derive cp1q. However, Theorem 4.2 is not very attractive in practice as it holds under an assumption regarding the minimum eigenvalue of the FIM and the corresponding estimate, i.e., p λ min ě λmin

Policy Space Identification in a Configurable Environment
The identification rules presented so far are unable to distinguish between a parameter set to zero because the agent cannot control it, or because zero is its optimal value. To overcome this issue, we employ the Conf-MDP properties to select a configuration in which the parameters we want to examine have an optimal value other than zero. Intuitively, if we want to test whether the agent can control parameter θ i , we should place the agent in an environment ω i P Ω where θ i is "maximally important" for the optimal policy. This intuition is justified by Theorem 4.2, since to maximize the power of the test (1´β), all other things being equal, we should maximize the log-likelihood gap lpθi q´lpθ˚q, i.e., parameter θ i should be essential to justify the agent's behavior. Let I P t1, ..., du be a set of parameter indexes we want to test, our ideal goal is to find the environment ω I such that: where θ˚pωq P arg max θPΘ J Mω pθq and θI pωq P arg max θPΘ I J Mω pθq are the parameters of the optimal policies in the environment M ω in Π Θ and Π Θ I respectively. Clearly, given the samples D collected with a single optimal policy π˚pω 0 q in a single environment M ω0 , solving problem (10) is hard as it requires performing an off-distribution optimization both on the space of policy parameters and configurations. For these reasons, we consider a surrogate objective that assumes that the optimal parameter in the new configuration can be reached by performing a single gradient step. 7 Theorem 5.1. Let I P t1, ..., du and I " t1, ..., duzI. For a vector v, we denote with v| I the vector obtained by setting to zero the components in I. Let θ˚pω 0 q P Θ the initial parameter. Let α ě 0, θI pωq " θ 0`α ∇ θ J Mω pθ˚pω 0 qq| I and θ˚pωq " θ 0`α ∇ θ J Mω pθ˚pω 0 qq. Then, under Assumption 3.1, we have: Thus, we maximize the L 2 -norm of the gradient components that correspond to the parameters we want to test. Since we have at our disposal only samples D collected with the current policy π θ˚pω0q and in the current environment ω 0 , we have to perform an off-distribution optimization over ω. To this end, we employ an approach analogous to that of  where we optimize the empirical version of the objective with a penalization that accounts for the distance between the distribution over trajectories: where p ∇ θ J M ω{ω 0 pθ˚pω 0 qq is an off-distribution estimator of the gradient ∇ θ J Mω pθ˚pω 0 qq using samples collected with ω 0 , p d 2 is the estimated 2-Rényi divergence (van Erven and Harremoës 2014) that works as a penalization to dis-

Experimental Evaluation
In this section, we present the experimental evaluation of the identification rules in three RL domains. To set the values of cplq we resort to the Wilk's asymptotic approximation (Theorem 2.1) to enforce (asymptotic) guarantees on the type I error. For Identification Rule 3.1 we perform 2 d statistical tests by using the same dataset D. Thus, we partition δ using Bonferroni correction and setting cplq " χ 2 l,1´δ{2 d , where χ 2 l,ξ is the ξ-quintile of a chi square distribution with l degrees of freedom. Instead, for Identification Rule 3.2, we perform d statistical test, and thus, we set cp1q " χ 2 1,1´δ{d .

Discrete Grid World
The grid world environment is a simple representation of a two-dimensional world (5ˆ5 cells) in which an agent has to reach a target position by moving in the four directions. The goal of this set of experiments is to show the advantages of configuring the environment when performing the policy space identification using rule 3.2. The initial position of the agent and the target position are drawn at the beginning of each episode from a Boltzmann distribution µ ω . The agent plays a Boltzmann linear policy π θ with binary features φ indicating its current row and column and the row and column of the goal. 8 For each run, the agent can control a subset I˚of the parameters θ I˚a ssociated with those features, which is randomly selected. Furthermore, the supervisor can configure the environment by changing the parameters ω of the initial state distribution µ ω . Thus, the supervisor can induce the agent to explore certain regions of the grid world and, consequently, change the relevance of the corresponding parameters in the optimal policy. Figure 1 shows the empirical p α and p β, i.e., the fraction of parameters that the agent does not control that are wrongly selected and the fraction of those the agent controls that are not selected respectively, as a function of the number n of episodes used to perform the identification. We compare two cases: conf where the identification is carried out by also configuring the environment, i.e., optimizing Equation (11), and no-conf in which the identification is performed in the original environment only. In both cases, we can see that p α is almost independent of the number of samples, as it is directly controlled by the critical value cp1q. Differently, p β decreases as the number of samples increases, i.e., the power of the test 1´p β increases with n. Remarkably, we observe that configuring the environment gives a significant advantage in understanding the parameters controlled by the agent w.r.t. using a fixed environment, as p β decreases faster in the conf case. This phenomenon also justifies empirically our choice of objective (Equation (11)) for selecting the new environment. Hyperparameters, further experimental results, together with experiments on a continuous version of the grid world, are reported in Appendix C.1-C.2.  Figure 2: Mingolf : Performance of the optimal policy varying the putter length ω for agents A 1 and A 2 (left) and performance of the optimal policy for agent A 2 with four different strategies for selecting ω (right). 100 runs 95% c.i.

Minigolf
In the Minigolf environment (Lazaric, Restelli, and Bonarini 2007), an agent hits a ball using a putter with the goal of reaching the hole in the minimum number of attempts. Surpassing the hole causes the termination of the episode and a large penalization. The agent selects the force applied to the putter by playing a Gaussian policy linear in some polynomial features (complying to Lemma 4.1) of the distance from the hole (x) and the friction of the green (f ). We consider two agents: A 1 has access to both the x and f whereas A 2 knows only x. Thus, we expect that A 1 learns a policy that allows reaching the hole in a smaller number of hits, compared to A 2 , as it can calibrate force according to friction; whereas A 2 has to be more conservative, being unaware of f . There is also a supervisor in charge of selecting, for the two agents, the best putter length ω, i.e., the configurable parameter of the environment. In this experiment, we want to highlight that knowing the policy space might be of crucial importance when learning in a Conf-MDP. Figure 2-left shows the performance of the optimal policy as a function of the putter length ω. We can see that for agent A 1 the optimal putter length is ωÅ 1 " 5 while for agent A 2 is ωÅ 2 " 11.5. Figure 2-right compares the performance of the optimal policy of agent A 2 when the putter length ω is chosen by the supervisor using four different strategies. In (i) the configuration is sampled uniformly in the interval r1, 15s. In (ii) the supervisor employs the optimal configuration for agent A 1 (ω " 5), i.e., assuming the agent is aware of the friction. (iii) is obtained by selecting the optimal configuration of the policy space produced by using our identification rule 3.2. Finally, (iv) is derived by employing an oracle that knows the true agent's policy space (ω " 11.5). We can see that the performance of the identification procedure (iii) is comparable with that of the oracle (iv) and notably higher than the performance when employing an incorrect policy space (ii). Hyperparameters and additional experiments are reported in Appendix C.3.

Simulated Car Driving
We consider a simple version of a car driving simulator, in which the agent has to reach the end of a road in the mini-mum amount of time, avoiding running off-road. The agent perceives its speed, four sensors placed at different angles that provide distance from the edge of the road and it can act on acceleration and steering. The purpose of this experiment is to show a case in which the identifiability assumption (Assumption 3.1) may not be satisfied. The policy π θ is modeled as a Gaussian policy whose mean is computed via a single hidden layer neural network with 8 neurons. Some of the sensors are not available to the agent, our goal is to identify which ones the agent can perceive. In Figure 3, we compare the performance of the Identification Rules 3.1 (Combinatorial) and 3.2 (Simplified), showing the fraction of runs that correctly identify the policy space. We note that, while for a small number of samples the simplified rule seems to outperform, when the number of samples increases the combinatorial rule displays remarkable stability, approaching the correct identification in all the runs. This is explained by the fact that, when multiple representations for the same policy are possible, considering one parameter at a time might induce the simplified rule to select a wrong set of parameters. Hyperparameters are reported in Appendix C.4.

Discussion and Conclusions
In this paper, we addressed the problem of identifying the policy space of an agent by simply observing its behavior when playing the optimal policy. We introduced two identification rules, both based on the GLR test, which can be applied to select the parameters controlled by the agent. Additionally, we have shown how to use the configurability property of the environment to enhance the effectiveness of identification rules. The experimental evaluation highlights two essential points. First, the identification of the policy space brings advantages to the learning process in a Conf-MDP, helping to choose wisely the most suitable environment configuration. Second, we have illustrated that configuring the environment is beneficial to speed up the identification process. We believe that this work opens numerous future research directions, both theoretical, such as the analysis of the combinatorial identification rule, and empirical, like the application of our identification rules to imitation learning settings.

A Proofs and Derivations
In this appendix, we report the proofs and derivations of the results presented in the main paper.

A.1 Gaussian and Boltzmann Linear Policies as Exponential Family distributions
In this appendix, we show how a multivariate Gaussian with fixed covariance and a Boltzmann policy, both linear in the state features φpsq can be cast into Definition 4.1. We are going to make use of the following identities regarding the Kronecker product (Petersen, Pedersen, and others 2008): where vecpXq is the vectorization of matrix X obtained by stacking the columns of X into a single column vector.
Multivariate Linear Gaussian Policy with fixed covariance The typical representation of a multivariate linear Gaussian policy is given by the following probability density function: where r θ P R kˆq is a properly sized matrix. Recalling Definition 4.1, we rephrase the previous equation as: Recalling the identities at Equation (12) and (13) and observing that φpsq T r θ T Σ´1a and φpsq T r θ T Σ´1 r θφpsq are scalar, we can rewrite: Now, by redefining the parameter of the exponential family distribution θ " vec´r θ T¯w e state the following definitions to comply with Definition 4.1: Boltzmann Linear Policy The Boltzmann policy on a finite set of actions ta 1 , ..., a k`1 u is typically represented by means of a matrix of parameters r θ P R kˆq : 9 π r θ pa i |sq " where with r θ i we denote the i-th row of matrix r θ. In order to comply to Definition 4.1, we rewrite the density function in the following form: 9 Notice that we are considering a set made of k`1 actions but the matrix r θ has only k rows. This allows enforcing the identifiability property, otherwise if we had a row for each of the k`1 actions we would have multiple representation for the same policy (rescaling the rows by the same amount). By introducing the vector e i as the i-th vector of the canonical basis of R k , i.e., the vector having 1 in the i-th component and 0 elsewhere, and recalling the definition of Kronecker product, we can derive the following identity for i ď k: In the case i " k it is sufficient to replace the previous term with the zero vector 0. Therefore, by renaming θ " vec´r θ T¯w e can make the following assignments in order to get the relevant quantities in Definition 4.1:

A.2 Results on Exponential Family
In this appendix, we derive several results that are used in Section 4, concerning policies belonging to the exponential family, as in Definition 4.1.

Fisher Information Matrix
We start by providing an expression of the Fisher Information matrix (FIM) for the specific case of the exponential family, that we are going to use extensively in the derivation. We first define the FIM for a fixed state and then we provide its expectation under the state distribution d πμ . For any state s P S, we define the FIM induced by π θ p¨|sq as: We can derive the following immediate result.
Lemma A.1. For a policy π θ belonging to the exponential family, as in Definition 4.1, the FIM for state s P S is given by the covariance matrix of the sufficient statistic: tps, a, θqtps, a, θq T ‰ " Cov a"π θ p¨|sq rtps, aqs .
We now define the expected FIM Fpθq and its corresponding estimator p Fpθq under the γ-discounted stationary distribution induced by the agent's policy π˚: Finally, we provide a sufficient condition to ensure that the FIM Fpθq is non singular in the case of Gaussian and Boltzmann linear policies.
Proposition A.1. If the second moment matrix of the feature vector E s"d πμ " φpsqφpsq T ‰ is non-singular, the identifiability condition of Lemma 4.1 is fulfilled by the Gaussian and Boltzmann linear policies for all θ P Θ, provided that each action is played with non-zero probability for the Boltzmann policy.
Proof. Let us start with the Boltzmann policy and consider the expression of tps, aiq with i P t1, ..., ku: tps, ai, θq " tps, aiq´E a"π θ p¨|sq rtps, aqs where π is a vector defined as π " pπ θ pa1|sq, ..., π θ pa k |sqq T and we exploited the distributivity of the Kronecker product. While for i " k`1, we have p0´πq b φpsq. For the sake of the proof, let us define r ei " ei if i ď k and r e k`1 " 0. Let us compute the FIM: where we exploited the distributivity of the Kroneker product, observed that E a"π θ p¨|sq r r eis " π and E a"π θ p¨|sq " r ei r ei T ı " diagpπq. Let us now consider the matrix: Consider a generic row i P t1, ..., ku. The element on the diagonal is π θ pai|sq´π θ pai|sq 2 " π θ pai|sq p1´π θ pai|sqq, while the absolute sum of the elements out of the diagonal is: π θ pai|sq ÿ jPt1,...ku^j‰i π θ paj|sq " π θ pai|sq p1´π θ pai|sq´π θ pa k`1 |sqq .
Therefore, if all actions are played with non-zero probability, i.e., π θ pai|sq ą 0 for all i P t1, ..., k`1u it follows that the matrix is strictly diagonally dominant by rows and thus it is positive definite. If also E s"d πμ " φpsqφpsq T ‰ is positive definite, for the properties of the Kroneker product, the FIM is positive definite.
Let us now focus on the Gaussian policy. Let a P R d and denote µpsq " E a"π θ p¨|sq ras: tps, a, θq " tps, aq´E a"π θ p¨|sq rtps, aqs " Σ´1 pa´µpsqq b φpsq.
Let us compute the FIM: Fpθq " E a"π θ p¨|sq " tps, a, θqtps, a, θq T ı " E a"π θ p¨|sq If Σ has finite values, then Σ´1 will be positive definite and additionally, considering that E s"d πμ " φpsqφpsq T ‰ is positive definite, we have that the FIM is positive definite.
Subgaussianity Assumption From Assumption 4.1, we can prove the following result that upper bounds the maximum eigenvalue λ max of the Fisher information matrix with the subgaussianity parameter σ. Lemma A.2. Under Assumption 4.1, for any θ P Θ and for any s P S the maximum eigenvalue of the Fisher Information matrix Fpθ, sq is upper bounded by dσ 2 .
Proof. Recall that the maximum eigenvalue of a matrix A can be computed as sup x:}x} 2 ď1 x T Ax and the norm of a vector y can be computed as sup x:}x} 2 ď1 x T y. Consider now the derivation for a generic x P R d such that }x} 2 ď 1: where we employed Lemma A.1 and upper bounded the right hand side. By taking the supremum over x P R d such that }x} 2 ď 1 we get: λmax pFpθ, sqq " sup By applying the first inequality in Remark 2.2 of Hsu, Kakade, and Zhang (2011) and setting A " I we get that E a"π θ p¨|sq " › › tps, a, θq We now show that the subgaussianity assumption is satisfied by the Boltzmann and Gaussian policies, as defined in Table 1, under mild assumptions.
Proposition A.2. If the features φ are uniformly bounded in norm over the state space, i.e., Φ max " sup sPS }φpsq} 2 , then Assumption 4.1 is fulfilled by the Boltzmann linear policy with parameter σ " 2Φ max and Gaussian linear policy with parameter σ " Φmax ? λminpΣq .
Let us now consider the Gaussian policy. Let a P R d and denote with µpsq " E a"π θ p¨|sq ras: tps, a, θq " tps, aq´E a"π θ p¨|sq rtps, aqs " Σ´1 pa´µpsqq b φpsq.
Let us first observe that we can rewrite: ř j αijφpsqj for i P t1, ..., ku. We now proceed with explicit computations: Now we complete the square: Thus, we have: Now, we observe that: › › 2 , having derived from Cauchy-Swartz inequality: We get the result by setting σ " Φmax Furthermore, we report for completeness the standard Hoeffding concentration inequality for subgaussian random vectors.
Proposition A.3. Let X 1 , X 2 , ..., X n be n i.i.d. zero-mean subgaussian d-dimensional random vectors with parameter σ ě 0, then for any α P R d and ą 0 it holds that: Proof. The proof is analogous to that of the Hoeffding inequality for bounded random variables. Let s ě 0:

C Additional Experimental details
In this appendix, we report the full experimental results, along with the hyperparameters employed.

C.1 Discrete Grid World
Hyperparameters In the following, we report the hyperparameters used for the experiments on the discrete grid world: • Horizon (T ): 50 • Discount factor (γ): 0.98 • Learning steps with G(PO)MDP: 200 • Batch size: 250 • Max-likelihood maximum update steps: 1000 • Max-likelihood learning rate (using Adam): 0.03 • Number of configuration attempts per feature (N conf ): 3 • Environment configuration update steps: 150 • Regularization parameter of the Rényi divergence (ζ): 0.125 • Significance of the likelihood-ratio tests (δ): 0.01 Example of configuration and identification in the discrete grid world In Figure 4, we show a graphical representation of a single experiment with the grid world environment using its configurability to better identify the policy space. The colors inside the squares indicate the probability mass function associated to the initial state distribution, consisting of the agent's position (blue) and the goal position (red), where sharper colors mean higher probabilities. The colored lines represent the features the agent has access to, they are binary features indicating if the agent is on a certain row or column (blue lines) and if the goal is on a certain row or column (red lines). Note that, to avoid redundancy of representation (and so enforcing the identifiability), the last row and column are not explicitly encoded, but they can be represented by the absence of the other rows and columns. When a line is not shown anymore, it means that it has been rejected, i.e., we think the agent has access to that feature. The agent has access to every feature except for the goal columns, i.e., only to its own position and to the goal row are known. The configuration of the environment is updated in the images at even position, the identification step is performed at even positions. The environment is configured in order to maximize the influence on the gradient of the first -not rejected -feature, considering the blue features first and then the red ones. After the model was configured three times for a feature, and the feature has not been rejected, the model was configured for the next one.
We can see that the general trend of this configuration is to change the parameters in order to spread the initial value of the mass probability functions across a greater number of grid cells. This is an expected behavior since with the initial model configuration, very often an episode starts with the agent in the bottom-left of the grid and the goal in the bottom-right, causing the policy to depend mostly on the position of the agent. In fact, only blue column features are rejected at the first iteration, as we can see in the third image. Instead, distributing the probabilities across the whole grid let an episode starts with the two positions extracted almost uniformly. Eventually, the correct policy space is identified. It is interesting to observe that such is can hardly be obtained without the configuration of the environment, given the initial state distribution shown in the first image.

C.2 Continuous Grid World
In this appendix, we report the experiments performed on the continuous version of the grid world. In this environment, the agent has to reach a goal region, delimited by a circle, starting from an initial position. Both initial position and center of the goal are sampled at the beginning of the episode from a Gaussian distribution with fixed covariance µ ω . The supervisor is allowed to change, via the parameters ω, the mean of this distribution. The agent specifies, at each time step, the speed in the vertical and horizontal direction, by means of a bivariate Gaussian policy with fixed covariance, linear in a set of radial basis functions (RBF) for representing both the current position of the agent and the position of the goal (5ˆ5 both for the agent position and the goal). The feature, and consequently the parameters, that the agent can control are randomly selected at the beginning. In Figure 5, we show the results of an experiment analogous to that of the discrete grid world, by comparing p α and p β for the case in which we do not perform environment configuration (no-conf) and the case in which the configuration is performed (conf). Once again, we confirm our findings that configuring the environment allows speeding up the identification process by inducing the agent chaining its policy and, as a consequence, revealing which parameters it can actually control.
Hyperparameters In the following, we report the hyperparameters used for the experiments on the continuous grid world: Example of configuration and identification in the continuous grid world In Figure 6, we show an example of model configuration in the continuous grid world environment. The two filled circles are a graphical representation of the normal distributions from which the initial position of the agent (light blue) and the position of the goal (pink) are sampled at the beginning of each episode. The circumferences correspond to the set of features (RBF) to which the agent has access, among which we want to discover the ones accessible by the agent. Since the policy space is composed by Gaussian policies with mean specified by a linear combination of these features, each one is associated to a parameter. If a circumference is not shown anymore at an iteration step, it means that the hypothesis associated to that feature was rejected, i.e., we believe that the agent has access to that feature.
The group of images is an alternated sequence of new environment configurations and parameter identifications. In the first image we can see the initial model with no rejected features. The identification with the initial model yields to the rejection of a certain set of features, which can be seen in the second image. The third image shows the new configuration of the model, in which the mean of the two initial state distributions are moved in order to investigate the remaining features. Then a new test is performed and the result is shown in the fourth image, and so on. In this experiment, the environment was configured in order to maximize the influence of one feature at a time, starting from the blue ones from bottom-left to top-right in row order, and then with the red ones in the same order. Each feature is used to configure the model for a maximum of three times, after that point the next feature is considered.
The only features that were not actually in the agent's set are the red ones on the two top rows. We can see that the mean Episodes n Error Probability α conf β conf α no-conf β no-conf of the initial position of the agent (a configurable parameter of the environment) always tracked the first available feature yet to be tested, as expected from this experiment. In fact, when the initial position is close enough to those features, the agent often moves around those blue circumferences to reach the goal, making them more important in the definition of the optimal policy. Eventually, the tests reject all the features that are actually accessible by the agent, and only them, yielding to a correct identification of the policy space. The rest of the configurations are not shown, since no more features were rejected. In this experiment, similarly to the discrete grid world case, the use of Conf-MDPs was crucial to obtain this result.