Abstract
We study the problem of identifying the policy space available to an agent in a learning process, having access to a set of demonstrations generated by the agent playing the optimal policy in the considered space. We introduce an approach based on frequentist statistical testing to identify the set of policy parameters that the agent can control, within a larger parametric policy space. After presenting two identification rules (combinatorial and simplified), applicable under different assumptions on the policy space, we provide a probabilistic analysis of the simplified one in the case of linear policies belonging to the exponential family. To improve the performance of our identification rules, we make use of the recently introduced framework of the Configurable Markov Decision Processes, exploiting the opportunity of configuring the environment to induce the agent to reveal which parameters it can control. Finally, we provide an empirical evaluation, on both discrete and continuous domains, to prove the effectiveness of our identification rules.
1 Introduction
Reinforcement Learning (RL, Sutton and Barto, 2018) deals with sequential decision–making problems in which an artificial agent interacts with an environment by sensing perceptions and performing actions. The agent’s goal is to find an optimal policy, i.e., a prescription of actions that maximizes the (possibly discounted) cumulative reward collected during its interaction with the environment. The performance of an agent in an environment is constrained by its perception and its actuation possibilities, along with the ability to map observations to actions. These three elements define the policy space available to the agent in the learning process. Agents having access to different policy spaces may exhibit different optimal behaviors, even in the same environment. Therefore, the notion of optimality is necessarily connected to the space of policies the agent can access, which we will call the agent’s policy space in the following. While in tabular RL we typically assume access to the complete space of Markovian stationary policies, in continuous control, the policy space needs to be limited. In policy search methods (Deisenroth et al., 2013), the policies are explicitly modeled considering a parametric functional space (Sutton et al., 1999; Peters and Schaal, 2008) or a kernel space (Deisenroth and Rasmussen, 2011; Levine and Koltun, 2013); but even in value–based RL, a function approximator induces a set of representable (greedy) policies. It is important to point out that the notion of policy space is not just an algorithmic convenience. Indeed, the need to limit the policy space naturally emerges in many industrial applications, where some behaviors have to be avoided for safety reasons.
The knowledge of the agent’s policy space might be useful in some subfields of RL. Recently, the framework of Configurable Markov Decision Process (Conf-MDP, Metelli et al., 2018a) has been introduced to account for the scenarios in which it is possible to configure some environmental parameters. Intuitively, the best environment configuration is intimately related to the agent’s possibilities in terms of policy space. When the configuration activity is performed by an external supervisor, it might be helpful to know which parameters the agent can control in order to select the most appropriate configuration. Furthermore, in the field of Imitation Learning (IL, Osa et al., 2018), figuring out the policy space of the expert’s agent can aid the learning process of the imitating policy, mitigating overfitting/underfitting phenomena.
In this paper, motivated by the examples presented above, we study the problem of identifying the agent’s policy space in a Conf–MDP,Footnote 1 by observing the agent’s behavior and, possibly, exploiting the configuration opportunities of the environment. We consider the case where the agent’s policy space is a subset of a known super–policy space \(\varPi _{\varTheta }\) induced by a parameter space \(\varTheta \subseteq {\mathbb {R}}^d\). Thus, any policy \(\pi _{\varvec{{\theta }}}\) is determined by a d–dimensional parameter vector \(\varvec{{\theta }} \in \varTheta\). However, the agent has control over a smaller number \(d^{\text {Ag}}< d\) of parameters (which are unknown), while the remaining ones have a fixed value, namely zero.Footnote 2 The choice of zero as a fixed value might appear arbitrary, but it is rather a common case in practice. Indeed, the formulation based on the identification of the parameters effectively covers the limitations of the policy space related to perception, actuation, and mapping. For instance, in a linear policy, the fact that the agent does not observe a state feature is equivalent to set the corresponding parameters to zero. Similarly, in a neural network, removing a neuron is equivalent to neglecting all of its connections, which in turn can be realized by setting the relative weights to zero. Figure 1 shows three examples of policy space limitations in the case of a 1–hidden layer neural network policy, which can be realized by setting the appropriate weights to zero.
Our goal is to identify the parameters that the agent can control (and possibly change) by observing some demonstrations of the optimal policy \(\pi ^{\text {Ag}}\) in the policy space \(\varPi _\varTheta\).Footnote 3 To this end, we formulate the problem as deciding whether each parameter \(\theta _i\) for \(i \in \{1,...,d\}\) is zero, and we address it by means of a frequentist statistical test. In other words, we check whether there is a statistically significant difference between the likelihood of the agent’s behavior with the full set of parameters and the one in which \(\theta _i\) is set to zero. In such a case, we conclude that \(\theta _i\) is not zero and, consequently, the agent can control it. On the contrary, either the agent cannot control the parameter, or zero is the value consciously chosen by the agent.
Indeed, there could be parameters that, given the peculiarities of the environment, are useless for achieving the optimal behavior or whose optimal value is actually zero, while they could prove essential in a different environment. For instance, in a grid world where the goal is to reach the right edge, the vertical position of the agent is useless, while if the goal is to reach the upper right corner, both horizontal and vertical positions become relevant. In this spirit, configuring the environment can help the supervisor in identifying whether a parameter set to zero is actually uncontrollable by the agent or just useless in the current environment. Thus, the supervisor can change the environment configuration \(\varvec{{\omega }} \in \varOmega\), so that the agent will adjust its policy, possibly by changing the parameter value and revealing whether it can control such a parameter. Consequently, the new configuration should induce an optimal policy in which the considered parameters have a value significantly different from zero. We formalize this notion as the problem of finding the new environment configuration that maximizes the power of the statistical test and we propose a surrogate objective for this purpose.
The paper is organized as follows. In Sect. 2, we introduce the necessary background. The identification rules (combinatorial and simplified) to perform parameter identification in a fixed environment are presented in Sect. 3 and the simplified one is analyzed in Sect. 4. Sect. 5 shows how to improve them by exploiting the environment configurability. The experimental evaluation, on discrete and continuous domains, is provided in Sect. 6. Besides studying the ability of our identification rules in identifying the agent’s policy space, we apply them to the IL and Conf-MDP frameworks. The proofs not reported in the main paper can be found in Appendix A.
2 Preliminaries
In this section, we report the essential background that will be used in the subsequent sections. For a given set \({\mathcal {X}}\), we denote with \({\mathscr {P}}({\mathcal {X}})\) the set of probability distributions over \({\mathcal {X}}\).
(Configurable) Markov Decision Processes A discrete–time Markov Decision Process (MDP, Puterman, 2014) is defined by the tuple \({\mathcal {M}} = \left( {\mathcal {S}}, {\mathcal {A}}, p, \mu, r, \gamma \right)\), where \({\mathcal {S}}\) and \({\mathcal {A}}\) are the state space and the action space respectively, \(p: {\mathcal {S}} \times {\mathcal {A}} \rightarrow {\mathscr {P}}({\mathcal {S}})\) is the transition model that provides, for every state-action pair \((s,a) \in \mathcal {S} \times \mathcal {A}\), a probability distribution over the next state \(p(\cdot |s,a)\), \(\mu \in {\mathscr {P}}({\mathcal {S}})\) is the distribution of the initial state, \(r: {\mathcal {S}} \times {\mathcal {A}} \rightarrow {\mathbb {R}}\) is the reward model, defining the reward collected by the agent r(s, a) when performing action \(a\in {\mathcal {A}}\) in state \(s\in {\mathcal {S}}\), and \(\gamma \in [0,1]\) is the discount factor. The behavior of an agent is defined by means of a policy \(\pi : {\mathcal {S}} \rightarrow {\mathscr {P}}({\mathcal {S}})\) that provides a probability distribution over the actions \(\pi (\cdot |s)\) for every state \(s \in {\mathcal {S}}\). We limit the scope to parametric policy spaces \(\varPi _\varTheta = \left\{ \pi _{\varvec{{\theta }}} : \varvec{{\theta }} \in \varTheta \right\}\), where \(\varTheta \subseteq {\mathbb {R}}^d\) is the parameter space. The goal of the agent is to find an optimal policy within \(\varPi _\varTheta\), i.e., any policy parametrization that maximizes the expected return:
In this paper, we consider a slightly modified version of the Conf–MDPs (Metelli et al., 2018a).
Definition 1
A Configurable Markov Decision Process (Conf–MDP) induced by the configuration space \(\varOmega \subseteq {\mathbb {R}}^p\) is defined as the set of MDPs:
The main differences w.r.t. the original definition are: i) we allow the configuration of the initial state distribution \(\mu _{\varvec{{\omega }}}\), in addition to the transition model \(p_{\varvec{{\omega }}}\); ii) we restrict to the case of parametric configuration spaces \(\varOmega\); iii) we do not consider the policy space \(\varPi _\varTheta\) as a part of the Conf–MDP.
Generalized Likelihood Ratio Test The Generalized Likelihood Ratio test (GLR, Barnard, 1959; Casella and Berger, 2002) aims at testing the goodness of fit of two statistical models. Given a parametric model having density function \(p(\cdot |{\varvec{{\theta }}})\) with \(\varvec{{\theta }} \in \varTheta\), we aim at testing the null hypothesis \({\mathcal {H}}_0 : \varvec{{\theta }}^\text {Ag} \in \varTheta _0\), where \(\varTheta _0 \subset \varTheta\) is a subset of the parametric space, against the alternative \({\mathcal {H}}_1 : \varvec{{\theta }}^\text {Ag} \in \varTheta \setminus \varTheta _0\). Given a dataset \({\mathcal {D}} = \left\{ X_i \right\} _{i=1}^n\) sampled independently from \(p(\cdot |{\varvec{{\theta }}^\text {Ag}})\), where \(\varvec{{\theta }}^\text {Ag}\) is the true parameter, the GLR statistic is:
where \(p({\mathcal {D}}|\varvec{{\theta }}) = \widehat{{\mathcal {L}}}(\varvec{{\theta }}) = \prod _{i=1}^n p(X_i|{\varvec{{\theta }}})\) is the likelihood function. We denote with \({\widehat{\ell }}(\varvec{{\theta }}) = -\log \widehat{{\mathcal {L}}}(\varvec{{\theta }})\) the negative log–likelihood function, \(\widehat{\varvec{{\theta }}} \in \mathop {{{\,\mathrm{{arg\,sup}}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } \widehat{{\mathcal {L}}}(\varvec{{\theta }})\) and \(\widehat{\varvec{{\theta }}}_0 \in \mathop {{{\,\mathrm{{arg\,sup}}\,}}}\limits _{\varvec{{\theta }} \in \varTheta _0} \widehat{{\mathcal {L}}}(\varvec{{\theta }})\), i.e., the maximum likelihood solutions in \(\varTheta\) and \(\varTheta _0\) respectively. Moreover, we define the expectation of the likelihood under the true parameter: \(\ell (\varvec{{\theta }}) = \mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{X_i \sim p(\cdot |{\varvec{{\theta }}^\text {Ag}})} [{\widehat{\ell }}(\varvec{{\theta }})]\). As the maximization is carried out employing the same dataset \({\mathcal {D}}\) and recalling that \(\varTheta _0 \subset \varTheta\), we have that \(\varLambda \in [0,1]\). It is usually convenient to consider the logarithm of the GLR statistic: \(\lambda = -2 \log \varLambda = 2 ({\widehat{\ell }}(\widehat{\varvec{{\theta }}}_0) - {\widehat{\ell }}(\widehat{\varvec{{\theta }}}) )\). Therefore, \({\mathcal {H}}_0\) is rejected for large values of \(\lambda\), i.e., when the maximum likelihood parameter searched in the restricted set \(\varTheta _0\) significantly underfits the data \({\mathcal {D}}\), compared to \(\varTheta\). Wilk’s theorem provides the asymptomatic distribution of \(\lambda\) when \({\mathcal {H}}_0\) is true (Wilks, 1938; Casella and Berger, 2002).
Theorem 1
(Casella and Berger, (2002), Theorem 10.3.3) Let \(d = \mathrm {dim}(\varTheta )\) and \(d_0 = \mathrm {dim}(\varTheta _0) < d\). Under suitable regularity conditions (see Casella and Berger, (2002) Section 10.6.2), if \({\mathcal {H}}_0\) is true, then when \(n \rightarrow +\infty\), the distribution of \(\lambda\) tends to a \(\chi ^2\) distribution with \(d-d_0\) degrees of freedom.
The significance of a test \(\alpha \in [0,1]\), or type I error probability, is the probability to reject \({\mathcal {H}}_0\) when \({\mathcal {H}}_0\) is true, while the power of a test \(1-\beta \in [0,1]\) is the probability to reject \({\mathcal {H}}_0\) when \({\mathcal {H}}_0\) is false, \(\beta\) is the type II error probability.
3 Policy space identification in a fixed environment
As we introduced in Sect. 1, we aim at identifying the agent’s policy space by observing a set of demonstrations coming from the optimal policy of the agent. We assume that the agent is playing a policy \(\pi ^{\text {Ag}}\) belonging to a parametric policy space \(\varPi _{\varTheta }\).
Assumption 1
(Parametric Agent’s Policy) The agent’s policy \(\pi ^{\text {Ag}}\) belongs to a known parametric policy space \(\varPi _{\varTheta }\), i.e., there exists a (maybe not unique) \(\varvec{{\theta }}^{\text {Ag}}\in \varTheta\) such that \(\pi _{\varvec{{\theta }}^{\text {Ag}}}(\cdot |s) = \pi ^{\text {Ag}}(\cdot |s)\) almost surely for all \(s \in {\mathcal {S}}\).
It is important to stress \(\pi ^{\text {Ag}}\) is one of the possibly many optimal policies within the policy space \(\varPi _{\varTheta }\), which, in turn, might be unable to represent the optimal Markovian stationary policy. Furthermore, we do not explicitly report the dependence on the agent’s parameter \(\varvec{{\theta }}^{\text {Ag}}\in \varTheta\) as, in the general case, there might exist multiple parameters yielding the same policy \(\pi ^{\text {Ag}}\).
We have access to a dataset \({\mathcal {D}} = \{(s_i,a_i)\}_{i=1}^n\) where \(s_i \sim \nu\) and \(a_i \sim \pi ^{\text {Ag}}(\cdot |s_i)\) sampled independently.Footnote 4\(\nu\) is a sampling distribution over the states. Although we will present the method for a generic \(\nu \in {\mathscr {P}}({\mathcal {S}})\), in practice, we employ as \(\nu\) the \(\gamma\)–discounted stationary distribution induced by \(\pi ^{\text {Ag}}\), i.e., \(d_{\mu }^{\pi ^{\text {Ag}}}(s) = (1-\gamma ) \sum _{t=0}^{+\infty } \Pr (s_t = s | {\mathcal {M}}, \pi ^{\text {Ag}})\) (Sutton et al., 1999). We assume that the agent has control over a limited number of parameters \(d^{\text {Ag}}< d\) whose value can be changed during learning, while the remaining \(d-d^{\text {Ag}}\) are kept fixed to zero.Footnote 5 Given a set of indexes \(I \subseteq \{1,...,d\}\) we define the subset of the parameter space: \(\varTheta _I = \left\{ \varvec{{\theta }} \in \varTheta : \theta _i = 0,\, \forall i \in \{1,...,d\}\setminus I \right\}\). Thus, the set I represents the indexes of the parameters that can be changed if the agent’s parameter space were \(\varTheta _I\). Our goal is to find a set of parameter indexes \(I^{\text {Ag}}\) that are sufficient to explain the agent’s policy, i.e., \(\pi ^{\text {Ag}}\in \varPi _{\varTheta _{I^{\text {Ag}}}}\) but also necessary, in the sense that when removing any \(i \in I^{\text {Ag}}\) the remaining ones are insufficient to explain the agent’s policy, i.e., \(\pi ^{\text {Ag}}\notin \varPi _{\varTheta _{I^{\text {Ag}}\setminus \{i\}}}\). We formalize these notions in the following definition.
Definition 2
(Correctness) Let \(\pi ^{\text {Ag}}\in \varPi _{\varTheta }\). A set of parameter indexes \(I^{\text {Ag}}\subseteq \{1,...,d\}\) is correct w.r.t. \(\pi ^{\text {Ag}}\) if:
We denote with \({\mathcal {I}}^{\text {Ag}}\) the set of all correct set of parameter indexes \(I^{\text {Ag}}\).
Thus, there exist multiple \(I^{\text {Ag}}\) when multiple parametric representations of the agent’s policy \(\pi ^{\text {Ag}}\) are possible. The uniqueness of \(I^{\text {Ag}}\) is guaranteed under the assumption that each policy admits a unique representation in \(\varPi _\varTheta\), i.e., under the identifiability assumption.
Assumption 2
(Identifiability) The policy space \(\varPi _{\varTheta }\) is identifiable, i.e., for all \(\varvec{{\theta }},\varvec{{\theta }}' \in \varTheta\), we have that if \(\pi _{\varvec{{\theta }}}(\cdot |s) = \pi _{\varvec{{\theta }}'}(\cdot |s) \; \text {almost surely}\) for all \(s \in {\mathcal {S}}\) than \(\varvec{{\theta }} = \varvec{{\theta }}'\).
The identifiability property allows rephrasing Definition 2 in terms of the policy parameters only, leading to the following result.
Lemma 1
(Correctness under Identifiability) Under Assumption 2, let \(\varvec{{\theta }}^{\text {Ag}}\in \varTheta\) be the unique parameter such that \(\pi _{\varvec{{\theta }}^{\text {Ag}}}(\cdot |s) = \pi ^{\text {Ag}}(\cdot |s)\) almost surely for all \(s \in {\mathcal {S}}\). Then, there exists a unique set of parameter indexes \(I^{\text {Ag}}\subseteq \{1,...,d\}\) that is correct w.r.t. \(\pi ^{\text {Ag}}\) defined as:
Consequently, \({\mathcal {I}}^{\text {Ag}}= \{ I^{\text {Ag}}\}\).
Proof
The uniqueness of \(I^\text {Ag}\) is ensured by Assumption 2. Let us rewrite the condition of Definition 2 under Assumption 2:
where line (P.1) follows since there is a unique representation for \(\pi ^\text {Ag}\) determined by parameter \(\varvec{{\theta }}^\text {Ag}\) and line (P.2) is obtained from the definition of \(\varTheta _I\). \(\square\)
Remark 1
(About the Optimality of \(\pi ^\text {Ag}\)) We started this section stating that \(\pi ^{\text {Ag}}\) is an optimal policy within the policy space \(\varPi _{\varTheta }\). This is motivated by the fact that typically we start with an overparametrized policy space \(\varPi _{\varTheta }\) and we seek for the minimal set of parameters that allows the agent to reach an optimal policy within \(\varPi _{\varTheta }\). However, in practice, we usually have access to an \(\epsilon\)-optimal policy \(\pi ^{\text {Ag}}_{\epsilon }\), meaning that the performance of \(\pi ^{\text {Ag}}_{\epsilon }\) is \(\epsilon\)-close to the optimal performance.Footnote 6 Nevertheless, the notion of correctness (Definition 2) makes no assumptions on the optimality of \(\pi ^\text {Ag}\). If we replace \(\pi ^\text {Ag}\) with \(\pi ^{\text {Ag}}_{\epsilon }\) we will recover a set of parameter indexes \(I^{\text {Ag}}_{\epsilon }\) that is, in general, different from \(I^{\text {Ag}}_{\epsilon }\), but we can still provide some guarantees. If \(I^{\text {Ag}} \subseteq I^{\text {Ag}}_{\epsilon }\), then \(I^{\text {Ag}}_{\epsilon }\) is sufficient to explain the optimal policy \(\pi ^{\text {Ag}}\), but not necessary in general (it might contain useless parameters for \(\pi ^{\text {Ag}}\)). Instead, if \(I^{\text {Ag}} \not \subseteq I^{\text {Ag}}_{\epsilon }\), then \(I^{\text {Ag}}_{\epsilon }\) is not sufficient to explain the optimal policy \(\pi ^{\text {Ag}}\). In any case, \(I^{\text {Ag}}_{\epsilon }\) is necessary and sufficient to represent, at least, an \(\epsilon\)-optimal policy.
The following two subsections are devoted to the presentation of the identification rules based on the application of Definition 2 (Sect. 3.1) and Lemma 1 (Sect. 3.2) when we only have access to a dataset of samples \({\mathcal {D}}\). The goal of an identification rule consists in producing a set \(\widehat{{\mathcal {I}}}\), approximating \({\mathcal {I}}^{\text {Ag}}\). The idea at the basis of our identification rules consists in employing the GLR test to assess the correctness (Definition 2 or Lemma 1) of a candidate set of indexes.
3.1 Combinatorial identification rule
In principle, using \({\mathcal {D}} = \{(s_i,a_i)\}_{i=1}^n\), we could compute the maximum likelihood parameter \(\widehat{\varvec{{\theta }}}\in \mathop {{{\,\mathrm{{arg\,sup}}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } \widehat{{\mathcal {L}}}(\varvec{{\theta }})\) and employ it with Definition 2. However, this approach has, at least, two drawbacks. First, when Assumption 2 is not fulfilled, it would produce a single approximate parameter, while multiple choices might be viable. Second, because of the estimation errors, we would hardly get a zero value for the parameters the agent might not control. For these reasons, we employ a GLR test to assess whether a specific set of parameters is zero. Specifically, for all \(I \subseteq \{1,...,d\}\) we consider the pair of hypotheses \({\mathcal {H}}_{0,I} \,:\, \pi ^{\text {Ag}}\in \varPi _{\varTheta _I}\) against \({\mathcal {H}}_{1,I} \,:\, \pi ^{\text {Ag}}\in \varPi _{\varTheta \setminus \varTheta _I}\) and the GLR statistic:
where the likelihood is defined as \(\widehat{{\mathcal {L}}}(\varvec{{\theta }}) = \prod _{i=1}^n \pi _{\varvec{{\theta }}}(a_i|s_i)\), \(\widehat{\varvec{{\theta }}}_I \in \mathop {{{\,\mathrm{{arg\,sup}}\,}}}\limits _{\varvec{{\theta }} \in \varTheta _I} \widehat{{\mathcal {L}}}(\varvec{{\theta }})\) and \(\widehat{\varvec{{\theta }}} \in \mathop {{{\,\mathrm{{arg\,sup}}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } \widehat{{\mathcal {L}}}(\varvec{{\theta }})\). We are now ready to state the identification rule derived from Definition 2.
Identification Rule 1
The combinatorial identification rule with threshold function \(c_l\) selects \(\widehat{{\mathcal {I}}}_c\) containing all and only the sets of parameter indexes \({I} \subseteq \{1,...,d\}\) such that:
Thus, I is defined in such a way that the null hypothesis \({\mathcal {H}}_{0,{I}}\) is not rejected, i.e., I contains parameters that are sufficient to explain the data \({\mathcal {D}}\), and necessary since for all \(i \in {I}\) the set \({I} \setminus \{i\}\) is no longer sufficient, as \({\mathcal {H}}_{0,{I} \setminus \{i\}}\) is rejected. The threshold function \(c_l\), which depend on the cardinality l of the tested set of indexes, controls the behavior of the tests. In practice, we recommend setting them by exploiting Wilk’s asymptotic approximation (Theorem 1) to enforce (asymptotic) guarantees on the type I error. Given a significance level \(\delta \in [0,1]\), since for Identification Rule 1 we perform \(2^d\) statistical tests by using the same dataset \({\mathcal {D}}\), we partition \(\delta\) using Bonferroni correction and setting \(c_l = \chi ^2_{l,1-{\delta }/{2^d}}\), where \(\chi ^2_{l,\bullet }\) is the \(\bullet\)–quintile of a chi square distribution with l degrees of freedom. Refer to Algorithm 1 for the pseudocode of the identification procedure.

3.2 Simplified identification rule
Identification Rule 1 is hard to be employed in practice, as it requires performing \({\mathcal {O}}( 2^d )\) statistical tests. However, under Assumption 2, to retrieve \(I^{\text {Ag}}\) we do not need to test all subsets, but we can just examine one parameter at a time (see Lemma 1). Thus, for all \(i \in \{1,...,d\}\) we consider the pair of hypotheses \({\mathcal {H}}_{0,i} \,:\, {\theta }^{\text {Ag}}_i = 0\) against \({\mathcal {H}}_{1,i} \,:\, {\theta }^{\text {Ag}}_i \ne 0\) and define \(\varTheta _i = \{ \varvec{{\theta }} \in \varTheta \,:\, \theta _i = 0\}\). The GLR test can be performed straightforwardly, using the statistic:
where the likelihood is defined as \(\widehat{{\mathcal {L}}}(\varvec{{\theta }}) = \prod _{i=1}^n \pi _{\varvec{{\theta }}}(a_i|s_i)\), \(\widehat{\varvec{{\theta }}}_i = \mathop {{{\,\mathrm{{arg\,sup}}\,}}}\limits _{\varvec{{\theta }} \in \varTheta _i} \widehat{{\mathcal {L}}}(\varvec{{\theta }})\) and \(\widehat{\varvec{{\theta }}} = \mathop {{{\,\mathrm{{arg\,sup}}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } \widehat{{\mathcal {L}}}(\varvec{{\theta }})\).Footnote 7 In the spirit of Lemma 1, we define the following identification rule.
Identification Rule 2
The simplified identification rule with threshold function \(c_1\) selects \(\widehat{{\mathcal {I}}}_c\) containing the unique set of parameter indexes \({\widehat{I}}_{c}\) such that:
Therefore, the identification rule constructs \({\widehat{I}}_c\) by taking all the indexes \(i \in \{1,...,d\}\) such that the corresponding null hypothesis \({\mathcal {H}}_{0,i} \,:\, {\theta }^{\text {Ag}}_i = 0\) is rejected, i.e., those for which there is statistical evidence that their value is not zero. Similarly to the combinatorial identification rule, we recommend setting the threshold function \(c_1\) based on Wilk’s approximation. Given a significance level \(\delta \in [0,1]\), since we perform d statistical tests, we employ Bonferroni correction and we set \(c_1 = \chi ^2_{1,1-{\delta }/{d}}\). Refer to Algorithm 2 for the pseudocode of the identification rule.

This second procedure requires a test for every parameter, i.e., \({\mathcal {O}}(d)\) instead of \({\mathcal {O}}(2^d)\) tests. However, the correctness of Identification Rule 2, in the sense of Definition 2, comes with the cost of assuming the identifiability property (Assumption 2). What happens if we employ this second procedure in a case where the assumption does not hold? Consider, for instance, the case in which two parameters \(\theta _1\) and \(\theta _2\) are exchangeable, we will include none of them in \({\widehat{I}}_{c}\) as, individually, they are not necessary to explain the agent’s policy, while the pair \((\theta _1,\theta _2)^T\) is indeed necessary. We will discuss how to enforce Assumption 2, for the case of policies belonging to the exponential family, in the following section.
Remark 2
(On Frequentist and Bayesian Statistical Tests) In this paper, we restrict our attention to frequentist statistical tests, but, in principle, the same approaches can be extended to the Bayesian setting (Jeffreys, 1935). Indeed, the GLR test admits a Bayesian counterpart, known as the Bayes Factor (BF, Goodman, 1999; Morey et al., 2016). We consider the same setting presented in Sect. 2 in which we aim at testing the null hypothesis \({\mathcal {H}}_0 : \varvec{{\theta }}^\text {Ag} \in \varTheta _0\), against the alternative \({\mathcal {H}}_1 : \varvec{{\theta }}^\text {Ag} \in \varTheta \setminus \varTheta _0\). We take the Bayesian perspective, looking at each \(\varvec{{\theta }}\) not as an unknown fixed quantity but as a realization of prior distributions on the parameters defined in terms of the hypothesis: \(p(\varvec{{\theta }} | {\mathcal {H}}_{\star })\) for \(\star \in \{0,1\}\). Thus, given a dataset \({\mathcal {D}} = \left\{ X_i \right\} _{i=1}^n\), we can compute the likelihood of \({\mathcal {D}}\) given a parameter \(\varvec{{\theta }}\) as usual: \(p({\mathcal {D}}|\varvec{{\theta }}) = \prod _{i=1}^n p(X_i|\varvec{{\theta }})\). Combining the likelihood and the prior, we define the Bayes Factor as:

The Bayesian approach has the clear advantage of incorporating additional domain knowledge by means of the prior. Furthermore, if also a prior on the hypothesis is available \(p({\mathcal {H}}_{\star })\) for \(\star \in \{0,1\}\) it is possible to compute the ratio of the posterior probability of each hypothesis:

Compared to the GLR test, the Bayes factor provides richer information, since we can compute the likelihood of each hypothesis, given the data \({\mathcal {D}}\). However, like any Bayesian approach, the choice of the prior turns out to be of crucial importance. The computationally convenient prior (which might allow computing the integral in closed form) is typically not correct, leading to a biased test. In this sense, GLR replaces the integral with a single-point approximation centered in the maximum likelihood estimate. For these reasons, we leave the investigation of Bayesian approaches for policy space identification as future work.
4 Analysis for the exponential family
In this section, we provide an analysis of the Identification Rule 2 for a policy \(\pi _{\varvec{{\theta }}}\) linear in some state features \(\varvec{{\phi }}\) that belongs to the exponential family.Footnote 8 The section is organized as follows. We first introduce the exponential family, deriving a concentration result of independent interest (Theorem 2) and then we apply it for controlling the identification errors made by our identification rule (Theorem 3).
Exponential Family We refer to the definition of linear exponential family given in (Brown, 1986), that we state as an assumption.
Assumption 3
(Exponential Family of Linear Policies) Let \(\varvec{{\phi }}: {\mathcal {S}} \rightarrow {\mathbb {R}}^q\) be a feature function. The policy space \(\varPi _{\varTheta }\) is a space of linear policies, belonging to the exponential family, i.e., \(\varTheta = {\mathbb {R}}^d\) and all policies \(\pi _{\varvec{{\theta }}} \in \varPi _{\varTheta }\) have form:
where h is a positive function, \(\varvec{{t}}\left( s,a\right)\) is the sufficient statistic that depends on the state via the feature function \(\varvec{{\phi }}\) (i.e., \(\varvec{{t}}\left( s,a\right) =\varvec{{t}}(\varvec{{\phi }}(s),a)\)) and \(A(\varvec{{\theta }},s) = \log \int _{{\mathcal {A}}} h(a) \exp \{ \varvec{{\theta }}^T \varvec{{t}}(s,a) \}\mathrm {d} a\) is the log partition function. We denote with \(\varvec{{{\overline{t}}}}(s,a,\varvec{{\theta }}) = \varvec{{t}}(s,a) - \mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{{\overline{a}} \sim \pi _{\varvec{{\theta }}} (\cdot |s)} \left[ \varvec{{t}}(s,{\overline{a}}) \right]\) the centered sufficient statistic.
This definition allows modeling the linear policies that are a popular choice in linear time-invariant systems and a valid option for robotic control (Deisenroth et al., 2013), sometimes even competitive with complex neural network parametrizations (Rajeswaran et al., 2017). Table 1 shows how to map the Gaussian linear policy with fixed covariance, typically used in continuous action spaces, and the Boltzmann linear policy, suitable for finite action spaces, to Assumption 3 (details in Appendix A.1).
For the sake of the analysis, we enforce the following assumption concerning the tail behavior of the policy \(\pi _{\varvec{{\theta }}}\).
Assumption 4
(Subgaussianity) For any \(\varvec{{\theta }} \in \varTheta\) and for any \(s \in {\mathcal {S}}\) the centered sufficient statistic \(\varvec{{{\overline{t}}}}(s,a,\varvec{{\theta }})\) is subgaussian with parameter \(\sigma \ge 0\), i.e., for any \(\varvec{{\alpha }} \in {\mathbb {R}}^d\):
A sufficient condition to ensure that Gaussian and Boltzmann are subgaussian is that the features \(\varvec{{\phi }}(s)\) are bounded in \(L_2\)-norm, uniformly over the state space \({\mathcal {S}}\) (Proposition 2). Furthermore, limited to the policies complying with Assumption 3, the identifiability (Assumption 2) can be restated in terms of the Fisher Information matrix (Rothenberg et al., 1971; Little et al., 2010).
Lemma 2
(Rothenberg et al., (1971), Theorem 3) Let \(\varPi _\varTheta\) be a policy space, as in Assumption 3. Then, under suitable regularity conditions (see Rothenberg et al., (1971)), if the Fisher Information matrix (FIM) \({\mathcal {F}}(\varvec{{\theta }})\):
is non–singular for all \(\varvec{{\theta }} \in \varTheta\), then \(\varPi _\varTheta\) is identifiable. In this case, we denote with \(\lambda _{\min } = \inf _{\varvec{{\theta }} \in \varTheta } \lambda _{\min } \left( {\mathcal {F}}(\varvec{{\theta }}) \right) > 0\).
Proposition 1 of Appendix A.2.1 shows that a sufficient condition for the identifiability in the case of Gaussian and Boltzmann linear policies is that the second moment matrix of the feature vector \(\mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{s \sim \nu } \left[ \varvec{{\phi }}(s)\varvec{{\phi }}(s)^T \right]\) is non–singular along with the fact that the policy \(\pi _{\varvec{{\theta }}}\) plays each action with positive probability for the Boltzmann policy.
Remark 3
(How to enforce identifiability?) Requiring that \(\mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{s \sim \nu } \left[ \varvec{{\phi }}(s)\varvec{{\phi }}(s)^T \right]\) is full rank is essentially equivalent to require that all features \(\phi _i\) are linearly independent for all \(i \in \{1,...,d\}\). This condition can be easily met with a preprocessing phase that removes the linearly dependent features, for instance, by employing Principal Component Analysis (PCA, Jolliffe, 2011). For this reason, in our experimental evaluation we will always consider the case of linearly independent features.
When working with samples, however, we need to estimate the FIM from samples, leading to the empirical FIM, in which the expectation over the states of Eq. (8), is replaced with the sample mean:
where \(\{s_i\}_{i=1}^n \sim \nu\). We denote with \({\widehat{\lambda }}_{\min } = \inf _{\varvec{{\theta }} \in \varTheta }\lambda _{\min }(\widehat{{\mathcal {F}}}(\varvec{{\theta }}))\) the minimum eigenvalue of the empirical FIM. In order to carry out the subsequent analysis, we need to require that this quantity is non-zero.
Assumption 5
(Positive Eigenvalues of Empirical FIM) The minimum eigenvalue of the empirical FIM \(\widehat{{\mathcal {F}}}(\varvec{{\theta }})\) is non-zero for all \(\varvec{{\theta }} \in \varTheta\), i.e., \({\widehat{\lambda }}_{\min } = \inf _{\varvec{{\theta }} \in \varTheta }\lambda _{\min }(\widehat{{\mathcal {F}}}(\varvec{{\theta }})) > 0\).
The condition of Assumption 5 can be enforced as long as the true FIM \({{\mathcal {F}}}(\varvec{{\theta }})\) has a positive minimum eigenvalue \(\lambda _{\min }\), i.e., under identifiability assumption (Lemma 2) and given a sufficiently large number of samples. Proposition 4 of Appendix A.2.1 provides the minimum number of samples such that with high probability it holds that \({\widehat{\lambda }}_{\min } > 0\).
We are now ready to present a concentration result, of independent interest, for the parameters and the negative log–likelihood that represents the central tool of our analysis.
Theorem 2
Under Assumptions 1, 2, 3, 4, and 5, let \({\mathcal {D}} = \{(s_i,a_i)\}_{i=1}^n\) be a dataset of \(n>0\) independent samples, where \(s_i \sim \nu\) and \(a_i \sim \pi _{\varvec{{\theta }}^{\text {Ag}}}(\cdot |s_i)\). Let \(\widehat{\varvec{{\theta }}} = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } {\widehat{\ell }}(\varvec{{\theta }})\) and \(\varvec{{\theta }}^{\text {Ag}}= \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } {\ell }(\varvec{{\theta }})\). Then, for any \(\delta \in [0,1]\), with probability at least \(1-\delta\) it holds that:
Furthermore, with probability at least \(1-\delta\), it holds that individually:
Proof sketch
The idea of the proof is to first obtain a probabilistic bound on the parameter difference in norm \(\left\| \widehat{ \varvec{{\theta }}} - \varvec{{\theta }}^\text {Ag} \right\| _2\). This result is given in Theorem 6. Then, we use the latter result together with Taylor expansion to bound the differences \(\ell (\widehat{\varvec{{\theta }}}) - \ell ({\varvec{{\theta }}}^\text {Ag})\) and \({\widehat{\ell }}({\varvec{{\theta }}}^\text {Ag}) - {\widehat{\ell }}(\widehat{\varvec{{\theta }}})\), as in Corollary 1. The full derivation can be found in Appendix A.2.3.
The theorem shows that the \(L_2\)–norm of the difference between the maximum likelihood parameter \(\widehat{\varvec{{\theta }}}\) and the true parameter \(\varvec{{\theta }}^{\text {Ag}}\) concentrates with rate \({\mathcal {O}}(n^{-1/2})\) while the likelihood \({\widehat{\ell }}\) and its expectation \(\ell\) concentrate with faster rate \({\mathcal {O}}(n^{-1})\).
Identification Rule Analysis We are now ready to start the analysis of Identification Rule 2. The goal of the analysis is, informally, to bound the probability of an identification error as a function of the number of samples n and the threshold function \(c_1\). For this purpose, we define the following quantities.
Definition 3
Consider an identification rule producing \({\widehat{I}}\) as approximate parameter index set. We define the significance \(\alpha\) and the power \(1-\beta\) of the identification rule as:
Thus, \(\alpha\) represents the probability that the identification rule selects a parameter that the agent does not control, whereas \(\beta\) is the probability that the identification rule does not select a parameter that the agent does control.Footnote 9
By employing the results we derived for the exponential family (Theorem 2) we can now bound \(\alpha\) and \(\beta\), under a slightly more demanding assumption on \({\widehat{\lambda }}_{\min }\).
Theorem 3
Let \({\widehat{I}}_{c}\) be the set of parameter indexes selected by the Identification Rule 2obtained using \(n>0\) i.i.d. samples collected with \(\pi _{\varvec{{\theta }}^{\text {Ag}}}\), with \(\varvec{{\theta }}^{\text {Ag}}\in \varTheta\). Then, under Assumptions 1, 2, 3, 4, and 5, let \({\varvec{{\theta }}}_i^{\text {Ag}} = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{{\theta }} \in \varTheta _i} \ell (\varvec{{\theta }})\) for all \(i \in \{1,...,d\}\) and \(\xi = \min \left\{ 1, \frac{\lambda _{\min }}{\sigma ^2} \right\}\). If \({\widehat{\lambda }}_{\min } \ge \frac{\lambda _{\min }}{2\sqrt{2}}\) and \(\ell ({\varvec{{\theta }}}_i^{\text {Ag}}) - {l}(\varvec{{\theta }}^{\text {Ag}}) \ge c_1\), it holds that:
Proof sketch
Concerning \(\alpha = \Pr \left( \exists i \notin I^{\text {Ag}}: i \in {\widehat{I}}_c \right)\), we employ a technique similar to that of Lemma 2 in (Garivier and Kaufmann, 2019) to remove the existential quantification. Instead, for \(\beta = \Pr \left( \exists i \in I^{\text {Ag}}: i \notin {\widehat{I}}_c \right)\) we first perform a union bound over \(i \in I^{\text {Ag}}\) and then we bound the individual \(\Pr \left( i \notin {\widehat{I}}_c \right)\). The full derivation can be found in Appendix A.3. \(\square\)
In principle, we could employ Theorem 3 to derive a proper value of \(c_1\) and n, given a required value of \(\alpha\) and \(\beta\). Unfortunately, their expression depend on \({\lambda }_{\min }\) which is unknown in practice. As already mentioned in the previous sections, we recommend employing Wilk’s asymptotic approximation to set the threshold function as \(c_1= \chi ^2_{1,1-{\delta }/{d}}\). This choice allows an asymptotic control of the significance of the identification rule.
Theorem 4
Let \({\widehat{I}}_{c}\) be the set of parameter indexes selected by the Identification Rule 2obtained using \(n>0\) i.i.d. samples collected with \(\pi _{\varvec{{\theta }}^{\text {Ag}}}\), with \(\varvec{{\theta }}^{\text {Ag}}\in \varTheta\). Then, under suitable regularity conditions (see Casella and Berger, (2002) Section 10.6.2), if \(c_1 = \chi ^2_{1,1-{\delta }/{d}}\) it holds that \(\alpha \le \delta\) when \(n\rightarrow +\infty\).
Proof
Starting from the definition of \(\alpha\), we first perform a union bound over \(i \notin I^{\text {Ag}}\) to remove the existential quantification.
Now, we bound each \(\Pr \left( i \in {\widehat{I}}_c \right)\) individually, recalling that \(\lambda _i\) is distributed asymptotically as a \(\chi ^2\) distribution with 1 degree of freedom and that \(c_1 = \chi^2 _{1,1-\delta /d}\):
Thus, we have that when \(n \rightarrow +\infty\):
\(\square\)
5 Policy space identification in a configurable environment
The identification rules presented so far are unable to distinguish between a parameter set to zero because the agent cannot control it or because zero is its optimal value. To overcome this issue, we employ the Conf–MDP properties to select a configuration in which the parameters we want to examine have an optimal value other than zero. Intuitively, if we want to test whether the agent can control parameter \(\theta _i\), we should place the agent in an environment \(\varvec{{\omega }}_i \in \varOmega\) where \(\theta _i\) is “maximally important” for the optimal policy. This intuition is justified by Theorem 3, since to maximize the power of the test (\(1-\beta\)), all other things being equal, we should maximize the log–likelihood gap \({l}({\varvec{{\theta }}_i^\text {Ag}}) - {l}({\varvec{{\theta }}^\text {Ag}})\), i.e., parameter \(\theta _i\) should be essential to justify the agent’s behavior. Let \(I \subseteq \{1,...,d\}\) be a set of parameter indexes we want to test, our ideal goal is to find the environment \(\varvec{{\omega }}_I\) such that:
where \({\varvec{{\theta }}^\text {Ag}}(\varvec{{\omega }}) \in \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } J_{{\mathcal {M}}_{\varvec{{\omega }}}}(\varvec{{\theta }})\) and \({\varvec{{\theta }}}_I^\text {Ag}(\varvec{{\omega }}) \in \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\varvec{{\theta }} \in \varTheta _I} J_{{\mathcal {M}}_{\varvec{{\omega }}}}(\varvec{{\theta }})\) are the parameters of the optimal policies in the environment \({\mathcal {M}}_{\varvec{{\omega }}}\) considering \(\varPi _{\varTheta }\) and \(\varPi _{\varTheta _I}\) as policy spaces respectively. Clearly, given the samples \({\mathcal {D}}\) collected with a single optimal policy \(\pi ^\text {Ag}(\varvec{{\omega }}_0)\) in a single environment \({\mathcal {M}}_{\varvec{{\omega }}_0}\), solving problem (10) is hard as it requires performing an off–distribution optimization both on the space of policy parameters and configurations. For these reasons, we consider a surrogate objective that assumes that the optimal parameter in the new configuration can be reached by performing a single gradient step.Footnote 10
Theorem 5
Let \(I \in \{1,...,d\}\) and \({\overline{I}} =\{1,...,d\}\setminus I\). For a vector \(\varvec{{v}} \in {\mathbb {R}}^d\), we denote with \(\varvec{{v}} \vert _I\) the vector obtained by setting to zero the components in I. Let \(\varvec{{\theta }}^\text {Ag}(\varvec{{\omega }}_0) \in \varTheta\) the initial parameter. Let \(\alpha \ge 0\) be a learning rate, \(\varvec{{\theta }}_I^\text {Ag} (\varvec{{\omega }}) = \varvec{{\theta }}_0 + \alpha \nabla _{\varvec{{\theta }}} J_{{\mathcal {M}}_{\varvec{{\omega }}}} (\varvec{{\theta }}^\text {Ag}(\varvec{{\omega }}_0)) \vert _I\) and \(\varvec{{\theta }}^\text {Ag} (\varvec{{\omega }}) = \varvec{{\theta }}_0 + \alpha \nabla _{\varvec{{\theta }}} J_{{\mathcal {M}}_{\varvec{{\omega }}}} (\varvec{{\theta }}^\text {Ag}(\varvec{{\omega }}_0))\). Then, under Assumption 2, we have:
Proof
By second-order Taylor expansion of \(\ell\) and recalling that \(\nabla _{\varvec{{\theta }}} {\ell }({\varvec{{\theta }}^\text {Ag}(\varvec{{\omega }})}) = \varvec{{0}}\), we have:
\(\square\)
Thus, we maximize the \(L_2\)–norm of the gradient components that correspond to the parameters we want to test. Since we have at our disposal only samples \({\mathcal {D}}\) collected with the current policy \(\pi _{\varvec{{\theta }}^\text {Ag}(\varvec{{\omega }}_0)}\) and in the current environment \(\varvec{{\omega }}_0\), we have to perform an off–distribution optimization over \(\varvec{{\omega }}\). To this end, we employ an approach analogous to that of (Metelli et al., 2018b, 2020) where we optimize the empirical version of the objective with a penalization that accounts for the distance between the distribution over trajectories:

where \(\zeta \ge 0\) is a regularization parameter. We assume to have access to a dataset of trajectories \({\mathcal {D}} = \{\tau _i\}_{i=1}^n\) independently collected using policy \(\pi _{\varvec{{\theta }}}\) in the environment \({\mathcal {M}}_{\varvec{{\omega }}_0}\). Each trajectory is a sequence of triples \(\{(s_{i,t},a_{i,t},r_{i,t})\}_{t=1}^T\), where T is the trajectory horizon. The expression of the gradient estimator is given by:

The expression is obtained starting from the well–known G(PO)MDP gradient estimator and adapting for off–distribution estimation by introducing the importance weight (Metelli et al., 2018b). The dissimilarity penalization term corresponds to the estimated 2–Rényi divergence (Rényi, 1961) is obtained from the following expression, which represents the empirical second moment of the importance weight:
Refer to (Metelli et al., 2018b) for the theoretical background behind the choice of this objective function. For conciseness, we report the pseudocode of the identification procedure in a configurable environment for Identification Rule 2 only (Algorithm 3), while the pseudocode for Identification Rule 2 can be found in Appendix B.

6 Experimental results
In this section, we present the experimental results, focusing on three aspects of policy space identification.
-
In Sect. 6.1, we provide experiments to assess the quality of our identification rules in terms of the ability to correctly identifying the parameters controlled by the agent.
-
In Sect. 6.2, we focus on the application of policy space identification to Imitation Learning, comparing our identification rules with commonly employed regularization techniques.
-
In Sect. 6.3, we consider the Conf-MDP framework and we show how properly identifying the parameters controlled by the agent allows learning better (more specific) environment configurations.
Additional experiments together with the hyperparameter values are reported in Appendix C.
6.1 Identification rules experiments
In this section, we provide two experiments to test the ability of our identification rules in properly selecting the parameters the agent controls in different settings. We start with an experiment on a discrete grid world (Sect. 6.1.1) to highlight the beneficial effects of environment configuration in parameter identification. Then, we provide an experiment on a simulated car driving domain (Sect. 6.1.2) in which we compare the combinatorial and the simplified identification rules.
6.1.1 Discrete grid world
The grid world environment is a simple representation of a two-dimensional world (5\(\times\)5 cells) in which an agent has to reach a target position by moving in the four directions. Whenever an action is performed, there is a small probability of failure (0.1) triggering a random action. The initial position of the agent and the target position are drawn at the beginning of each episode from a Boltzmann distribution \(\mu _{\varvec{{\omega }}}\). The agent plays a Boltzmann linear policy \(\pi _{\varvec{{\theta }}}\) with binary features \(\varvec{{\phi }}\) indicating its current row and column and the row and column of the goal.Footnote 11 For each run, the agent can control a subset \(I^{\text {Ag}}\) of the parameters \(\varvec{{\theta }}_{I^{\text {Ag}}}\) associated with those features, which is randomly selected. Furthermore, the supervisor can configure the environment by changing the parameters \(\varvec{{\omega }}\) of the initial state distribution \(\mu _{\varvec{{\omega }}}\). Thus, the supervisor can induce the agent to explore certain regions of the grid world and, consequently, change the relevance of the corresponding parameters in the optimal policy.
The goal of this set of experiments is to show the advantages of configuring the environment when performing the policy space identification using rule 2. Figure 2 shows the empirical \({\widehat{\alpha }}\) and \({\widehat{\beta }}\), i.e., the fraction of parameters that the agent does not control that are wrongly selected and the fraction of those the agent controls that are not selected respectively, as a function of the number m of episodes used to perform the identification. We compare two cases: conf where the identification is carried out by also configuring the environment, i.e., optimizing Eq. (11), and no-conf in which the identification is performed in the original environment only. In both cases, we can see that \({\widehat{\alpha }}\) is almost independent of the number of samples, as it is directly controlled by the threshold function \(c_1\). Differently, \({\widehat{\beta }}\) decreases as the number of samples increases, i.e., the power of the test \(1-{\widehat{\beta }}\) increases with m. Remarkably, we observe that configuring the environment gives a significant advantage in understanding the parameters controlled by the agent w.r.t. using a fixed environment, as \({\widehat{\beta }}\) decreases faster in the conf case. This phenomenon also empirically justifies our choice of objective (Eq. (11)) for selecting the new environment. Hyperparameters, further experimental results, together with experiments on a continuous version of the grid world, are reported in Appendix C.1.1–C.1.2.
6.1.2 Simulated car driving
We consider a simple version of a car driving simulator, in which the agent has to reach the end of a road in the minimum amount of time, avoiding running off-road. The agent perceives its speed, four sensors placed at different angles that provide distance from the edge of the road and it can act on acceleration and steering.
The purpose of this experiment is to show a case in which the identifiability assumption (Assumption 2) may not be satisfied. The policy \(\pi _{\varvec{{\theta }}}\) is modeled as a Gaussian policy whose mean is computed via a single hidden layer neural network with 8 neurons. Some of the sensors are not available to the agent, our goal is to identify which ones the agent can perceive.
In Fig. 3, we compare the performance of the Identification Rules 1 (Combinatorial) and 2 (Simplified), showing the fraction of runs that correctly identify the policy space. We note that, while for a small number of samples, the simplified rule seems to outperform, when the number of samples increases, the combinatorial rule displays remarkable stability, approaching the correct identification in all the runs. This is explained by the fact that, when multiple representations for the same policy are possible (like in this case when having a neural network as policy), considering one parameter at a time might induce the simplified rule to select a wrong set of parameters. Hyperparameters are reported in Appendix C.1.3.
Discrete Grid World: Norm of the difference between the expert’s parameter \(\varvec{{\theta }}^{\text {Ag}}\) and the estimated parameter \(\widehat{\varvec{{\theta }}}\) (left) and expected KL-divergence between the expert’s policy \(\pi _{\varvec{{\theta }}^{\text {Ag}}}\) and the estimated policy \(\pi _{\widehat{\varvec{{\theta }}}}\) (right) as a function of the number of collected episodes m. 25 runs, 95% c.i
6.2 Application to imitation learning
IL aims at recovering a policy replicating the behavior of an expert’s agent. Selecting the parameters that an agent can control can be interpreted as applying a form of regularization to the IL problem (Osa et al., 2018). In the IL literature, a widely used technique is based on entropy regularization (Neu et al., 2017), which was employed in several successful algorithms, such as Maximum Causal Entropy IRL methods (MCE, Ziebart et al., 2008,, 2010), and Generative Adversarial IL (Ho and Ermon, 2016). Alternatively, other approaches aim at enforcing a sparsity constraint on the recovered policy parameters (e.g., Lee et al., 2018; Reddy et al., 2019; Brantley et al., 2020).
The goal of this experiment consists in showing that if we have appropriately identified the expert’s policy space, we can mitigate overfitting/underfitting phenomena, with a general benefit on the process of learning the imitating policy. This experiment is conducted in the grid world domain, introduced in Sect. 6.1.1, using the same setting. In each run, the expert agent plays a (near) optimal Boltzmann policy \(\pi _{\varvec{{\theta }}^{\text {Ag}}}\) that makes use of a subset of the available parameters and provides a dataset \({\mathcal {D}} = \{(s_i,a_i)\}_{i=1}^n\) of n samples coming from m episodes.
In the IL framework knowing the policy space of the expert agent means properly tailoring the hypothesis space in which we search for the imitation policy. For this reason, we propose a comparison with common regularization techniques applied to maximum likelihood estimation. Figure 4 shows on the left the norm of the parameter difference \(\left\| \widehat{\varvec{{\theta }}} - \varvec{{\theta }}^{\text {Ag}} \right\| _{2}\) between the parameter recovered by the different IL methods \(\widehat{\varvec{{\theta }}}\) and the true parameter employed by the expert \(\varvec{{\theta }}^{\text {Ag}}\), whereas on the right we plot the estimated expected KL-divergence between the imitation policy and the expert’s policy computed as:
The lines Conf and No-conf refer to the results of ML estimation obtained by restricting the policy space to the parameters identified by our simplified rule with and without employing environment configurability, respectively (precisely as in Sect. 6.1.1). ML, Ridge, and Lasso correspond to maximum likelihood estimation in the full parameter space. Specifically, they are obtained by minimizing the objective:

For ML we perform no regularization (\(\lambda ^{\text {R}}=\lambda ^{\text {L}}=0\)), for Ridge we set \(\lambda ^{\text {R}}=0.001\) and \(\lambda ^{\text {L}}=0\), and for Lasso we have \(\lambda ^{\text {R}}=0\) and \(\lambda ^{\text {L}}=0.001\).
We observe that Conf, i.e., the usage of our identification rule, together with environment configuration, outperforms the other methods. This is more evident in the expected KL-divergence plot (right), which is a more robust index compared to the norm of the parameter difference (left). Ridge and Lasso regularizations display good behavior, better than both the identification rule without configuration (No-Conf) and the plain maximum likelihood without regularization (ML). This illustrates two important points. First, it confirms the benefits of configuring the environment for policy space identification. Second, it shows that a proper selection of the parameters controlled by the agent allows improving over standard ML, which tends to overfit.Footnote 12 We tested additional values of the regularization hyperparametrers \(\lambda ^{\text {R}}\) and \(\lambda ^{\text {L}}\) and other regularization techniques (Shannon and Tsallis entropy). The complete results are reported in Appendix C.2.
It is worth noting that the specific IL setting we consider, i.e., the availability of an initial dataset \({\mathcal {D}}\) of expert’s demonstrations with no further interaction allowedFootnote 13 rules out from the comparison a large body of the literature that requires the possibility to interact with the expert or with the environment (e.g., Ho and Ermon, 2016; Lee et al., 2018). Nevertheless, these IL algorithms could be, in principle, adapted to this challenging no-interaction setting at the cost of restoring to off-policy estimation techniques (Owen, 2013), that, however, might inject further uncertainty in the learning process (see Appendix C.2 for details).
6.3 Application to configurable MDPs
The knowledge of the agent’s policy space could be relevant when the learning process involves the presence of an external supervisor, as in the case of Configurable Markov Decision Process (Metelli et al., 2018a,, 2019). In a Conf-MDP, the supervisor is in charge of selecting the best configuration for the agent, i.e., the one that allows the agent to achieve the highest performance possible. As intuition suggests, the best environment configuration is closely related to the agent’s capabilities. Agents with different perception and actuation possibilities might benefit from different configurations. Thus, the external supervisor should be aware of the agent’s policy space to select the most appropriate configuration for the specific agent.
In the Minigolf environment (Lazaric et al., 2007), an agent hits a ball using a putter with the goal of reaching the hole in the minimum number of attempts. Surpassing the hole causes the termination of the episode and a large penalization. The agent selects the force applied to the putter by playing a Gaussian policy linear in some polynomial features (complying to Lemma 2) of the distance from the hole (x) and the friction of the green (f). When an action is performed, a Gaussian noise is added whose magnitude depends on the green friction and on the action itself.
The goal of this experiment is to highlight that knowing the policy space is beneficial when learning in a Conf–MDP. We consider two agents with different perception capabilities: \({\mathscr {A}}_1\) has access to both the x and f, whereas \({\mathscr {A}}_2\) knows only x. Thus, we expect that \({\mathscr {A}}_1\) learns a policy that allows reaching the hole in a smaller number of hits, compared to \({\mathscr {A}}_2\), as it can calibrate force according to friction, whereas \({\mathscr {A}}_2\) has to be more conservative, being unaware of f. There is also a supervisor in charge of selecting, for the two agents, the best putter length \(\omega\), i.e., the configurable parameter of the environment.
Figure 5-left shows the performance of the optimal policy as a function of the putter length \(\omega\). We can see that for agent \({\mathscr {A}}_1\) the optimal putter length is \(\omega ^\text {Ag}_{{\mathscr {A}}_1}=5\) while for agent \({\mathscr {A}}_2\) is \(\omega ^\text {Ag}_{{\mathscr {A}}_2}=11.5\). Figure 5-right compares the performance of the optimal policy of agent \({\mathscr {A}}_2\) when the putter length \(\omega\) is chosen by the supervisor using four different strategies. In (i) the configuration is sampled uniformly in the interval [1, 15]. In (ii) the supervisor employs the optimal configuration for agent \({\mathscr {A}}_1\) (\(\omega =5\)), i.e., assuming the agent is aware of the friction. (iii) is obtained by selecting the optimal configuration of the policy space produced by using our identification rule 2. Finally, (iv) is derived by employing an oracle that knows the true agent’s policy space (\(\omega =11.5\)). We can see that the performance of the identification procedure (iii) is comparable with that of the oracle (iv) and notably higher than the performance when employing an incorrect policy space (ii). Hyperparameters and additional experiments are reported in Appendix C.3.
7 Conclusions
In this paper, we addressed the problem of identifying the policy space available to an agent in a learning process by simply observing its behavior when playing the optimal policy within such a space. We introduced two identification rules, both based on the GLR test, which can be applied to select the parameters controlled by the agent. Additionally, we have shown how to use the configurability property of the environment to improve the effectiveness of identification rules. The experimental evaluation highlights some essential points. First, the identification of the policy space brings advantages to the learning process in a Conf–MDP, helping to choose wisely the most suitable environment configuration. Second, we have shown that configuring the environment is beneficial for speeding up the identification process. Additionally, we have verified that policy space identification can improve imitation learning. Future research might investigate the usage of Bayesian statistical tests and the application of policy space identification to multi-agent RL (Busoniu et al., 2008). We believe that an agent in a multi-agent system might benefit from the knowledge of the policy space of its adversaries to understand what their action possibilities are and make decisions accordingly.
Notes
Although we assume to act in a Conf–MDP, we stress that our primary goal is to identify the policy space of the agent, rather than learning a profitable configuration in the Conf–MDP.
By “controllable” parameter we mean a parameter whose value can be changed by the agent, while the “uncontrollable” parameters are those which are permanently set to zero. This is a way of modeling the limitations of the policy space.
We stress that, since we restrict the search to the policy space \(\varPi _\varTheta\), \(\pi ^{\text {Ag}}\) might be suboptimal compared to the optimal policy in the space of Markovian stationary policies.
For exposition simplicity, we limit the presentation to the case of i.i.d. samples (Sutton et al., 2008). Nevertheless, by means of the blocking technique (Yu, 1994), it is possible to generalize the concentration results to \(\beta\)-mixing strictly stationary processes, provided that the mixing rate is exponential (e.g., Antos et al., 2008; Lazaric et al., 2012; Dai et al., 2018).
The extension of the identification rules to (known) fixed values different from zero is straightforward.
We can also look at \(\pi ^{\text {Ag}}_{\epsilon }\) as the optimal policy within \(\varPi _{\varTheta }\) for a different MDP \({\mathcal {M}}_\epsilon\), that is an approximation of the original MDP \({\mathcal {M}}\).
This setting is equivalent to a particular case the combinatorial rule in which \({\mathcal {H}}_{\star,i} \equiv {\mathcal {H}}_{\star,\{1,...,d\}\setminus \{i\} }\), with \(\star \in \{0,1\}\) and, consequently, \(\lambda _i \equiv \lambda _{\{1,...,d\}\setminus \{i\}}\) and \(\varTheta _i = \varTheta _{\{1,...,d\}\setminus \{i\}}\).
We limit our analysis to Identification Rule 2 since we will show that, in the case of linear policies belonging to the exponential family, the identifiability property can be easily enforced.
We use the symbols \(\alpha\) and \(\beta\) to highlight the analogy between these probabilities and the type I and type II error probabilities of a statistical test.
This idea shares some analogies with the adapted parameter in the meta-learning setting (Finn et al., 2017).
The features are selected to fulfill Lemma 2.
It is worth noting that the classical regularization techniques, like ridge and lasso, require choosing the regularization hyperparameter \(\lambda ^{\star }\) with \(\star \in \{R,L\}\). In our experiments, we searched for the best parameter in \(\{0.0001, 0.001, 0.01, 0.1, 1\}\).
This setting was recently defined “truly batch model-free” (Ramponi et al., 2020).
Notice that we are considering a set made of \(k+1\) actions but the matrix \(\widetilde{\varvec{{\theta }}}\) has only k rows. This allows enforcing the identifiability property, otherwise if we had a row for each of the \(k+1\) actions we would have multiple representation for the same policy (rescaling the rows by the same amount).
References
Antos, A., Szepesvári, C., & Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1), 89–129. https://doi.org/10.1007/s10994-007-5038-2.
Barnard, G. A. (1959). Control charts and stochastic processes. Journal of the Royal Statistical Society: Series B (Methodological)
Ben-Israel, A., Greville, T.N. (2003). Generalized inverses: theory and applications, vol 15. Berlin: Springer Science & Business Media
Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities—A nonasymptotic theory of independence. Oxford: Oxford University Press.
Brantley, K., Sun, W., Henaff, M. (2020). Disagreement-regularized imitation learning. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia. April 26-30, 2020. OpenReview.net
Brown, L.D. (1986). Fundamentals of statistical exponential families: with applications in statistical decision theory. Ims
Busoniu, L., Babuska, R., & De Schutter, B. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2), 156–172.
Casella, G., & Berger, R. L. (2002). Statistical inference (Vol. 2). Duxbury: Pacific Grove.
Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J., Song, L. (2018). SBEED: convergent reinforcement learning with nonlinear function approximation. In: Dy JG, Krause A (eds) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden. July 10-15, 2018, PMLR, Proceedings of Machine Learning Research, vol. 80, pp. 1133–1142
Deisenroth, M.P., Rasmussen, C.E. (2011). PILCO: A model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28–July 2, 2011, pp. 465–472
Deisenroth, M.P., Neumann, G., Peters, J. (2013). A survey on policy search for robotics. Foundations and Trends in Robotics
Finn, C., Abbeel, P., Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning, PMLR, Proceedings of Machine Learning Research, vol. 70, pp. 1126–1135
Garivier, A., Kaufmann, E. (2019). Non-asymptotic sequential tests for overlapping hypotheses and application to near optimal arm identification in bandit models. arXiv preprint arXiv:190503495
Goodman, S. N. (1999). Toward evidence-based medical statistics. 1: The p value fallacy. Annals of internal medicine, 130(12), 995–1004.
Ho, J., Ermon, S. (2016). Generative adversarial imitation learning. In: Lee DD, Sugiyama M, von Luxburg U, Guyon I, Garnett R (eds) Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 4565–4573
Hsu, D., Kakade, S., Zhang, T., et al. (2012). A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17
Jeffreys, H. (1935). Some tests of significance, treated by the theory of probability. Mathematical Proceedings of the Cambridge Philosophical Society, 31, 203–222.
Jolliffe, I.T. (2011). Principal component analysis. In: Lovric M (ed) International Encyclopedia of Statistical Science. Springer, pp. 1094–1096. https://doi.org/10.1007/978-3-642-04898-2_455
Lazaric, A., Restelli, M., Bonarini, A. (2007). Reinforcement learning in continuous action spaces through sequential monte carlo methods. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in Neural Information Processing Systems 20. In: Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, Curran Associates, Inc., pp. 833–840
Lazaric, A., Ghavamzadeh, M., & Munos, R. (2012). Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13, 3041–3074.
Lee, K., Choi, S., Oh, S. (2018). Maximum causal Tsallis entropy imitation learning. In: Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3–8, 2018, Montréal, Canada, pp. 4408–4418
Levine, S., Koltun, V. (2013). Guided policy search. In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16–21 June 2013, JMLR.org, JMLR Workshop and Conference Proceedings, vol. 28, pp. 1–9
Li, L., Lu, Y., Zhou, D. (2017). Provably optimal algorithms for generalized linear contextual bandits. In: Proceedings of the 34th International Conference on Machine Learning, PMLR, Proceedings of Machine Learning Research, vol. 70, pp. 2071–2080
Little, M. P., Heidenreich, W. F., & Li, G. (2010). Parameter identifiability and redundancy: theoretical considerations. PloS ONE, 5(1), e8915.
Metelli, A.M., Mutti, M., Restelli, M. (2018a). Configurable Markov decision processes. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, PMLR, Proceedings of Machine Learning Research, vol. 80, pp. 3488–3497
Metelli, A. M., Papini, M., Faccio, F., & Restelli, M. (2018b). Policy optimization via importance sampling. Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3–8 December 2018 (pp. 5447–5459). Canada.: Montréal.
Metelli, A.M., Ghelfi, E., Restelli, M. (2019). Reinforcement learning in configurable continuous environments. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, PMLR, Proceedings of Machine Learning Research, vol. 97, pp. 4546–4555
Metelli, A. M., Papini, M., Montali, N., & Restelli, M. (2020). Importance sampling techniques for policy optimization. Journal of Machine Learning Research, 21, 141:1-141:75.
Morey, R. D., Romeijn, J. W., & Rouder, J. N. (2016). The philosophy of Bayes factors and the quantification of statistical evidence. Journal of Mathematical Psychology, 72, 6–18.
Neu, G., Jonsson, A., Gómez, V. (2017). A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:170507798
Osa, T., Pajarinen, J., Neumann, G., Bagnell, J.A., Abbeel, P., Peters, J. (2018). An algorithmic perspective on imitation learning. Foundations and Trends in Robotics
Owen, A. B. (2013). Monte Carlo theory, methods and examples. Methods and Examples Art Owen: Monte Carlo Theory.
Peters, J., & Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4), 682–697.
Petersen, K. B., Pedersen, M. S., et al. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.
Puterman, M. L. (2014). Markov Decision Processes: Discrete Stochastic Dynamic Programming. London: John Wiley & Sons.
Rajeswaran, A., Lowrey, K., Todorov, E., Kakade, S. M., & (2017) Towards generalization and simplicity in continuous control. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017(December), pp. 4–9, . (2017). Long Beach, CA, USA (pp. 6550–6561).
Ramponi, G., Likmeta, A., Metelli, A.M., Tirinzoni, A., Restelli, M. (2020). Truly batch model-free inverse reinforcement learning about multiple intentions. In: Chiappa S, Calandra R (eds) Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR, Online, Proceedings of Machine Learning Research, vol. 108, pp. 2359–2369
Reddy, S., Dragan, A.D., Levine, S. (2019). Sqil: Imitation learning via regularized behavioral cloning. arXiv preprint arXiv:190511108
Rényi, A. (1961). On measures of entropy and information. Hungarian Academy of Sciences Budapest Hungary: Technical report.
Rothenberg, T. J., et al. (1971). Identification in parametric models. Econometrica, 39(3), 577–591.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Adaptive computation and machine learning. Cambridge: MIT Press.
Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In S. A. Solla, T. K. Leen, & K. Müller (Eds.), Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999] (pp. 1057–1063). Cambridge: The MIT Press.
Sutton, R.S., Szepesvári, C., Geramifard, A., Bowling, M.H. (2008). Dyna-style planning with linear function approximation and prioritized sweeping. In: McAllester DA, Myllymäki P (eds) UAI 2008, Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence, Helsinki, Finland, July 9–12, 2008, AUAI Press, pp. 528–536
Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In: Compressed Sensing, Cambridge University Press, pp. 210–268
Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9(1), 60–62.
Yu, B. (1994). Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pp. 94–116
Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K. (2008). Maximum entropy inverse reinforcement learning. In: Fox D, Gomes CP (eds) Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13–17, 2008, AAAI Press, pp. 1433–1438
Ziebart, B.D., Bagnell, J.A., Dey, A.K. (2010). Modeling interaction via the principle of maximum causal entropy. In: Fürnkranz J, Joachims T (eds) Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21–24, 2010, Haifa, Israel, Omnipress, pp. 1255–1262
Funding
Open access funding provided by Politecnico di Milano within the CRUI-CARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier.
Appendices
Appendix
1.1 A Proofs and derivations
In this appendix, we report the proofs and derivations of the results presented in the main paper.
1.2 A.1 Gaussian and Boltzmann linear policies as exponential family distributions
In this appendix, we show how a multivariate Gaussian with fixed covariance and a Boltzmann policy, both linear in the state features \(\varvec{{\phi }}(s)\) can be cast into Assumption 3. We are going to make use of the following identities regarding the Kronecker product (Petersen et al., 2008):
where \(\mathrm {vec} (\varvec{{X}})\) is the vectorization of matrix \(\varvec{{X}}\) obtained by stacking the columns of \(\varvec{{X}}\) into a single column vector.
1.2.1 A.1.1 Multivariate linear Gaussian policy with fixed covariance
The typical representation of a multivariate linear Gaussian policy is given by the following probability density function:
where \(\widetilde{\varvec{{\theta }}} \in {\mathbb {R}}^{k \times q}\) is a properly sized matrix. Recalling Assumption 3, we rephrase the previous equation as:
Recalling the identities at Eqs. (12) and (13) and observing that \(\varvec{{\phi }}(s)^T \widetilde{\varvec{{\theta }}}^T \varvec{{\varSigma }}^{-1} \varvec{{a}}\) and \(\varvec{{\phi }}(s)^T \widetilde{\varvec{{\theta }}}^T \varvec{{\varSigma }}^{-1} \widetilde{\varvec{{\theta }}} \varvec{{\phi }}(s)\) are scalar, we can rewrite:
Now, by redefining the parameter of the exponential family distribution \(\varvec{{\theta }} = \mathrm {vec}\left( \widetilde{\varvec{{\theta }}}^T\right)\) we state the following definitions to comply with Assumption 3:
1.2.2 A.1.2 Boltzmann linear policy
The Boltzmann policy on a finite set of actions \(\{a_1,...,a_{k+1}\}\) is typically represented by means of a matrix of parameters \(\widetilde{\varvec{{\theta }}} \in {\mathbb {R}}^{k \times q}\):Footnote 14
where with \(\widetilde{\varvec{{\theta }}}_i\) we denote the i-th row of matrix \(\widetilde{\varvec{{\theta }}}\). In order to comply to Assumption 3, we rewrite the density function in the following form:
By introducing the vector \(\varvec{{e}}_i\) as the i–th vector of the canonical basis of \({\mathbb {R}}^k\), i.e., the vector having 1 in the i–th component and 0 elsewhere, and recalling the definition of Kronecker product, we can derive the following identity for \(i \le k\):
In the case \(i=k\) it is sufficient to replace the previous term with the zero vector \(\varvec{{0}}\). Therefore, by renaming \(\varvec{{\theta }} = \mathrm {vec}\left( \widetilde{\varvec{{\theta }}}^T\right)\) we can make the following assignments in order to get the relevant quantities in Assumption 3:
1.3 A.2 Results on exponential family
In this appendix, we derive several results that are used in Section 4, concerning policies belonging to the exponential family, as in Assumption 3.
1.3.1 A.2.1 Fisher information matrix
We start by providing an expression of the Fisher Information matrix (FIM) for the specific case of the exponential family, that we are going to use extensively in the derivation. We first define the FIM for a fixed state and then we provide its expectation under the state distribution \(\nu\). For any state \(s \in {\mathcal {S}}\), we define the FIM induced by \(\pi _{\varvec{{\theta }}}(\cdot |s)\) as:
We can derive the following immediate result.
Lemma 3
For a policy \(\pi _{\varvec{{\theta }}}\) belonging to the exponential family, as in Assumption 3, the FIM for state \(s \in {\mathcal {S}}\) is given by the covariance matrix of the sufficient statistic:
Proof
Let us first compute the gradient log-policy for the exponential family:
Now, we just need to apply the definition given in Eq. (14) and to recall the definition of covariance matrix:
\(\square\)
We now define the expected FIM \({\mathcal {F}}(\varvec{{\theta }})\) and its corresponding estimator \(\widehat{{\mathcal {F}}}(\varvec{{\theta }})\) under the \(\gamma\)–discounted stationary distribution induced by the agent’s policy \(\pi ^\text {Ag}\):
Finally, we provide a sufficient condition to ensure that the FIM \({{\mathcal {F}}}(\varvec{{\theta }})\) is non singular in the case of Gaussian and Boltzmann linear policies.
Proposition 1
If the second moment matrix of the feature vector \(\mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{s \sim \nu } \left[ \varvec{{\phi }}(s)\varvec{{\phi }}(s)^T \right]\) is non–singular, the identifiability condition of Lemma 2 is fulfilled by the Gaussian and Boltzmann linear policies for all \(\varvec{{\theta }} \in \varTheta\) , provided that each action is played with non–zero probability for the Boltzmann policy.
Proof
Let us start with the Boltzmann policy and consider the expression of \(\varvec{{{\overline{t}}}}(s,a_i)\) with \(i\in \{1,...,k\}\):
where \(\varvec{{\pi }}\) is a vector defined as \(\varvec{{\pi }} = \left( \pi _{\varvec{{\theta }}}(a_1|s),...,\pi _{\varvec{{\theta }}}(a_k|s) \right) ^T\) and we exploited the distributivity of the Kronecker product. While for \(i=k+1\), we have \(\left( \varvec{{0}} - \varvec{{\pi }} \right) \otimes \varvec{{\phi }}(s)\). For the sake of the proof, let us define \(\widetilde{\varvec{{e}}}_i = \varvec{{e}}_i\) if \(i \le k\) and \(\widetilde{\varvec{{e}}}_{k+1} = \varvec{{0}}\). Let us compute the FIM:
where we exploited the distributivity of the Kroneker product, observed that \(\mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{a \sim \pi _{\varvec{{\theta }}} (\cdot |s)} \left[ \widetilde{\varvec{{e}}_i}\right] = \varvec{{\pi }}\) and \(\mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{a \sim \pi _{\varvec{{\theta }}} (\cdot |s)} \left[ \widetilde{\varvec{{e}}_i}\widetilde{\varvec{{e}}_i}^T \right] = \mathrm {diag} (\varvec{{\pi }})\). Let us now consider the matrix:
Consider a generic row \(i \in \{1,...,k\}\). The element on the diagonal is \(\pi _{\varvec{{\theta }}}(a_i|s) - \pi _{\varvec{{\theta }}}(a_i|s)^2 = \pi _{\varvec{{\theta }}}(a_i|s) \left( 1- \pi _{\varvec{{\theta }}}(a_i|s)\right)\), while the absolute sum of the elements out of the diagonal is:
Therefore, if all actions are played with non–zero probability, i.e., \(\pi _{\varvec{{\theta }}}(a_i|s) > 0\) for all \(i \in \{1,...,k+1\}\) it follows that the matrix is strictly diagonally dominant by rows and thus it is positive definite. If also \(\mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{s \sim \nu } \left[ \varvec{{\phi }}(s)\varvec{{\phi }}(s)^T \right]\) is positive definite, for the properties of the Kroneker product, the FIM is positive definite.
Let us now focus on the Gaussian policy. Let \(\varvec{{a}} \in {\mathbb {R}}^d\) and denote \(\varvec{{\mu }}(s) = \mathop {{{\,{{\mathbb { E}}}\,}}}\limits {_{\varvec{{a}}} \sim \pi _{\varvec{{\theta }}}(\cdot |s)} \left[ \varvec{{a}} \right]\):
Let us compute the FIM:
If \(\varvec{{\varSigma }}\) has finite values, then \(\varvec{{\varSigma }}^{-1}\) will be positive definite and additionally, considering that \(\mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{s \sim \nu } \left[ \varvec{{\phi }}(s)\varvec{{\phi }}(s)^T \right]\) is positive definite, we have that the FIM is positive definite. \(\square\)
1.3.2 A.2.2 Subgaussianity assumption
From Assumption 4, we can prove the following result that upper bounds the maximum eigenvalue \(\lambda _{\max }\) of the Fisher information matrix with the subgaussianity parameter \(\sigma\).
Lemma 4
Under Assumption 4, for any \(\varvec{{\theta }} \in \varTheta\) and for any \(s \in {\mathcal {S}}\) the maximum eigenvalue of the Fisher Information matrix \({\mathcal {F}}(\varvec{{\theta }},s)\) is upper bounded by \(d\sigma ^2\).
Proof
Recall that the maximum eigenvalue of a matrix \(\varvec{{A}}\) can be computed as \(\sup _{\varvec{{x}} : \left\| \varvec{{x}} \right\| _2 \le 1} \varvec{{x}}^T\varvec{{Ax}}\) and the norm of a vector \(\varvec{{y}}\) can be computed as \(\sup _{\varvec{{x}} : \left\| \varvec{{x}} \right\| _2 \le 1} \varvec{{x}}^T \varvec{{y}}\). Consider now the derivation for a generic \(\varvec{{x}} \in {\mathbb {R}}^d\) such that \(\left\| \varvec{{x}} \right\| _2 \le 1\):
where we employed Lemma 3 and upper bounded the right hand side. By taking the supremum over \(\varvec{{x}} \in {\mathbb {R}}^d\) such that \(\left\| \varvec{{x}} \right\| _2 \le 1\) we get:
By applying the first inequality in Remark 2.2 of Hsu et al., (2012) and setting \(\varvec{{A}} = \varvec{{I}}\) we get that \(\mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{a \sim \pi _{\varvec{{\theta }}} (\cdot |s)} \left[ \left\| \varvec{{{\overline{t}}}}(s,a,\varvec{{\theta }}) \right\| _2^2 \right] \le d\sigma ^2\). \(\square\)
We now show that the subgaussianity assumption is satisfied by the Boltzmann and Gaussian policies, as defined in Table 1, under the following assumption.
Assumption 6
(Boundedness of Features) For any \(s \in {\mathcal {S}}\) the feature function is bounded in \(L_2\)-norm, i.e., there exists \(\varPhi _{\max } < \infty\) such that \(\left\| \varvec{{\phi }}(s) \right\| _2 \le \varPhi _{\max }\).
Proposition 2
Under Assumption 6, then Assumption 4 is fulfilled by the Boltzmann linear policy with parameter \(\sigma = 2 \varPhi _{\max }\) and Gaussian linear policy with parameter \(\sigma = \frac{\varPhi _{\max }}{\sqrt{ \lambda _{\min } \left( \varvec{{\varSigma }} \right) }}\).
Proof
Let us start with the Boltzmann policy. From the definition of subgaussianity given in Assumption 4, requiring that the random vector \(\varvec{{{\overline{t}}}}(s,a_i, \varvec{{\theta }})\) is subgaussian with parameter \(\sigma\) is equivalent to require that the random (scalar) variable \(\frac{1}{\Vert \varvec{{\alpha }}\Vert _2} \varvec{{\alpha }}^T\varvec{{{\overline{t}}}}(s,a_i, \varvec{{\theta }})\) is subgaussian with parameter \(\sigma\) for any \(\varvec{{\alpha }} \in {\mathbb {R}}^d\). Thus, we now bound the term:
where we used Cauchy–Swartz inequality, the identity \(\left\| \varvec{{x}} \otimes \varvec{{y}} \right\| _2^2 = \left( \varvec{{x}} \otimes \varvec{{y}}\right) ^T \left( \varvec{{x}} \otimes \varvec{{y}}\right) = \left( \varvec{{x}}^T \varvec{{x}} \right) \otimes \left( \varvec{{y}}^T \varvec{{y}} \right) = \left\| \varvec{{x}} \right\| _2^2 \left\| \varvec{{y}} \right\| _2^2\) and the inequality \(\left\| \widetilde{\varvec{{e}}}_i - \varvec{{\pi }} \right\| _2^2 \le 2\). Therefore, we have that the random variable \(\frac{1}{\Vert \varvec{{\alpha }}\Vert _2} \varvec{{\alpha }}^T\varvec{{{\overline{t}}}}(s,a_i, \varvec{{\theta }}) \le 2 \varPhi _{\max }\) is bounded. Thanks to Hoeffding’s lemma we have that the subgaussianity parameter is \(\sigma = 2 \varPhi _{\max }\).
Let us now consider the Gaussian policy. Let \(\varvec{{a}} \in {\mathbb {R}}^d\) and denote with \(\varvec{{\mu }}(s) = \mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{\varvec{{a}} \sim \pi _{\varvec{{\theta }}}(\cdot |s)} \left[ \varvec{{a}} \right]:\)
Let us first observe that we can rewrite:
where \(\beta _i = \sum _{j} \alpha _{ij}\phi (s)_j\) for \(i \in \{1,...,k\}\). We now proceed with explicit computations:
Now we complete the square:
Thus, we have:
Now, we observe that:
having derived from Cauchy–Swartz inequality:
We get the result by setting \(\sigma = \varPhi _{\max }\sqrt{ \left\| \varvec{{\varSigma }}^{-1} \right\| _2} = \frac{\varPhi _{\max }}{\sqrt{ \lambda _{\min } \left( \varvec{{\varSigma }} \right) }}\). \(\square\)
Furthermore, we report for completeness the standard Hoeffding concentration inequality for subgaussian random vectors.
Proposition 3
Let \(\varvec{{X}}_1,\varvec{{X}}_2, ..., \varvec{{X}}_n\) be n i.i.d. zero–mean subgaussian d –dimensional random vectors with parameter \(\sigma \ge 0\) , then for any \(\varvec{{\alpha }} \in {\mathbb {R}}^d\) and \(\epsilon > 0\) it holds that:
Proof
The proof is analogous to that of the Hoeffding inequality for bounded random variables. Let \(s \ge 0\):
where we employed Markov inequality, exploited the subgaussianity assumption and the independence. We minimize the last expression over s, getting the optimal \(s = \frac{\epsilon n}{ \left\| \varvec{{\alpha }}\right\| _2^2 \sigma ^2}\), from which we get the result:
\(\square\)
Under the Assumption 4, we provide the following concentration inequality for the minimum eigenvalue of the empirical FIM.
Proposition 4
Let \({\mathcal {F}}(\varvec{{\theta }})\) and \(\widehat{{\mathcal {F}}}(\varvec{{\theta }})\) be the FIM and its estimate obtained with \(n>0\) independent samples. Then, under Assumption 4, for any \(\epsilon > 0\) it holds that:
where \(\psi _\sigma > 0\) is a constant depending only on the subgaussianity parameter \(\sigma\). In particular, under the following condition on n we have that, for any \(\delta \in [0,1]\) and \(\alpha \in [0,1)\) it holds that \({\lambda }_{\min }( \widehat{{\mathcal {F}}}(\varvec{{\theta }}) ) > \alpha {\lambda }_{\min }( {{\mathcal {F}}}(\varvec{{\theta }}) )\) with probability at least \(1-\delta\):
Proof
Let us recall that \(\widehat{{\mathcal {F}}}(\varvec{{\theta }})\) and \({\mathcal {F}}(\varvec{{\theta }})\) are both symmetric positive semidefinite matrices, thus their eigenvalues \(\lambda _j\) correspond to their singular values \(\sigma _j\). Let us consider the following sequence of inequalities:
where last inequality follows from Ben-Israel and Greville, (2003). Therefore, all it takes is to bound the norm of the difference. For this purpose, we employ Corollary 5.50 and Remark 5.51 of Vershynin, (2012), having observed that the FIM is indeed a covariance matrix and its estimate is a sample covariance matrix. We obtain that with probability at least \(1-\delta\):
where \(\psi _\sigma \ge 0\) is a constant depending on the subgaussianity parameter \(\sigma\). Recalling, from Lemma 4, that \(\left\| {\mathcal {F}}(\varvec{{\theta }}) \right\| = \lambda _{\max } \left( {\mathcal {F}}(\varvec{{\theta }}) \right) \le d \sigma ^2\), we can rewrite the previous inequality as:
By setting the right hand side equal to \(\epsilon\) and solving for \(\delta\), we get the first result. The value of n can be obtained by setting the right hand side equal to \((1-\alpha )\lambda _{\min }({\mathcal {F}}(\varvec{{\theta }}))\). \(\square\)
1.3.3 A.2.3 Concentration result
We are now ready to provide the main result of this section, that consists in a concentration result on the negative log–likelihood. Our final goal is to provide a probabilistic bound to the differences \(\ell (\widehat{\varvec{{\theta }}}) - \ell ({\varvec{{\theta }}}^\text {Ag})\) and \({\widehat{\ell }}({\varvec{{\theta }}}^\text {Ag}) - {\widehat{\ell }}(\widehat{\varvec{{\theta }}})\). To this purpose, we start with a technical lemma (Lemma 5) which provides a concentration result involving a quantity that will be used later, under Assumption 4. Then, we use this result to obtain the concentration of the parameters, i.e., bounding the distance \(\left\| \widehat{ \varvec{{\theta }}} - \varvec{{\theta }}^\text {Ag} \right\| _2\) (Theorem 6), under suitable well–conditioning properties of the involved quantities. Finally, we employ the latter result to prove the concentration of the negative log–likelihood (Corollary 1). Some parts of the derivation are inspired to Li et al., (2017).
Lemma 5
Under Assumption 2and Assumption 4, let \({\mathcal {D}} = \{(s_i,a_i)\}_{i=1}^n\) be a dataset of \(n>0\) independent samples, where \(s_i \sim d_{\mu }^{\pi _{\varvec{{\theta }}^\text {Ag}}}\) and \(a_i \sim \pi _{\varvec{{\theta }}^\text {Ag}}(\cdot |s_i)\). For any \(\varvec{{\theta }} \in \varTheta\), let \(\varvec{{g}}(\varvec{{\theta }})\) be defined as:
Let \(\widehat{\varvec{{\theta }}} = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } {\widehat{\ell }}(\varvec{{\theta }}) = \frac{1}{n} \sum _{i=1}^n \log \pi _{\varvec{{\theta }}}(a_i|s_i)\). Then, under Assumption 4, for any \(\delta \in [0,1]\), with probability at least \(1-\delta\), it holds that:
Proof
The negative log–likelihood of a policy complying with Assumption 3 is \({\mathcal {C}}^2 ({\mathbb {R}}^d)\). Thus, since \(\widehat{\varvec{{\theta }}}\) is a minimizer of the negative log–likelihood function \({\widehat{\ell }}(\varvec{{\theta }})\), it must fulfill the following first–order condition:
As a consequence, we can rewrite the expression of \(\varvec{{g}}(\widehat{\varvec{{\theta }}})\) exploiting this condition:
By recalling that \(a_i \sim \pi _{\varvec{{\theta }}^\text {Ag}} (\cdot |s_i)\) it immediately follows that \(\varvec{{g}}(\widehat{\varvec{{\theta }}})\) is a zero-mean random vector, i.e., \(\mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{\begin{array}{c} s_i \sim \nu \\ a_i \sim \pi _{\varvec{{\theta }}^\text {Ag}}(\cdot |s_i) \end{array} } \left[ \varvec{{g}}(\widehat{\varvec{{\theta }}}) \right] = \varvec{{0}}\). Moreover, under Assumption 4, \(\varvec{{g}}(\widehat{\varvec{{\theta }}})\) is the sample mean of subgaussian random vectors. Our goal is to bound the probability \(\Pr \left( \left\| \varvec{{g}}(\widehat{\varvec{{\theta }}}) \right\| _2 > \epsilon \right)\); to this purpose we consider the following derivation:
where we exploited in line (P.9) the fact that for a d-dimensional vector \(\varvec{{x}}\) if \(\left\| \varvec{{x}} \right\| _2 > \epsilon\) it must be that at least one component \(j=1,...,d\) satisfy \(x_j^2 > \frac{\epsilon ^2}{d}\) and we used a union bound over the d dimensions to get line (P.10). Since for each \(j=1,...,d\) we have that \(g_j(\widehat{\varvec{{\theta }}})\) is a zero-mean subgaussian random variable we can bound the deviation using standard results (Boucheron et al., 2013):
Putting all together we get:
By setting \(\delta = 2 d \exp \left\{ - \frac{\epsilon ^2 n}{2 d \sigma ^2 } \right\}\) and solving for \(\epsilon\) we get the result. \(\square\)
We can now use the previous result to derive the concentration of the parameters, i.e., bounding the deviation \(\left\| \widehat{ \varvec{{\theta }}} - \varvec{{\theta }}^\text {Ag} \right\| _2\).
Theorem 6
(Parameter concentration) Under Assumption 2and Assumption 4, let \({\mathcal {D}} = \{(s_i,a_i)\}_{i=1}^n\) be a dataset of \(n>0\) independent samples, where \(s_i \sim \nu\) and \(a_i \sim \pi _{\varvec{{\theta }}^\text {Ag}}(\cdot |s_i)\). Let \(\widehat{\varvec{{\theta }}} = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } {\widehat{\ell }}(\varvec{{\theta }})\). If the empirical FIM \(\widehat{{\mathcal {F}}}(\varvec{{\theta }})\) has a positive minimum eigenvalue \({\widehat{\lambda }}_{\min } > 0\) for all \(\varvec{{\theta }} \in \varTheta\), for any \(\delta \in [0,1]\), with probability at least \(1-\delta\), it holds that:
Proof
Recalling that \(\varvec{{g}}({\varvec{{\theta }}}^\text {Ag}) =\varvec{{0}}\), we employ the mean value theorem to rewrite \(\varvec{{g}}(\widehat{\varvec{{\theta }}})\) centered in \({\varvec{{\theta }}}^\text {Ag}\):
where \(\overline{\varvec{{\theta }}} = t \widehat{\varvec{{\theta }}} + (1-t) {\varvec{{\theta }}}^\text {Ag}\) for some \(t \in [0,1]\) and \(\widehat{{\mathcal {F}}}(\overline{\varvec{{\theta }}})\) is defined as:
where we exploited the expression of \(\nabla _{\varvec{{\theta }}} \log \pi _{\varvec{{\theta }}}(a|s)\) and the definition of Fisher information matrix given in Eq. (14). Under the hypothesis of the statement, we can derive the following lower bound:
By solving for \(\left\| \widehat{\varvec{{\theta }}} -{\varvec{{\theta }}}^\text {Ag} \right\| _2\) and applying Lemma 5 we get the result. \(\square\)
Finally, we can get the concentration result for the negative log–likelihood.
Corollary 1
(Negative log–likelihood concentration) Under Assumption 2and Assumption 4, let \({\mathcal {D}} = \{(s_i,a_i)\}_{i=1}^n\) be a dataset of \(n>0\) independent samples, where \(s_i \sim \nu\) and \(a_i \sim \pi _{\varvec{{\theta }}^\text {Ag}}(\cdot |s_i)\). Let \(\widehat{\varvec{{\theta }}} = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } {\widehat{\ell }}(\varvec{{\theta }})\). If \({\lambda }_{\min }(\widehat{{\mathcal {F}}}({\varvec{{\theta }}})) = {\widehat{\lambda }}_{\min } > 0\) for all \(\varvec{{\theta }} \in \varTheta\), for any \(\delta \in [0,1]\), with probability at least \(1-\delta\), it holds that:
and also:
Proof
Let us start with \(\ell (\widehat{\varvec{{\theta }}}) - \ell ({\varvec{{\theta }}}^\text {Ag})\). We consider the first order Taylor expansion of the negative log–likelihood centered in \({\varvec{{\theta }}}^\text {Ag}\):
where \(\overline{\varvec{{\theta }}} = t \widehat{\varvec{{\theta }}} + (1-t) {\varvec{{\theta }}}^\text {Ag}\) for some \(t \in [0,1]\). We first observe that \(\nabla _{\varvec{{\theta }}} \ell ({\varvec{{\theta }}}^\text {Ag}) = \varvec{{0}}\) being \({\varvec{{\theta }}}^\text {Ag}\) the true parameter and we develop \({\mathcal {H}}_{\varvec{{\theta }}} \ell (\overline{\varvec{{\theta }}})\):
By using Lemma 4 to bound the maximum eigenvalue of \({\mathcal {F}}(\overline{\varvec{{\theta }}},s)\), we can state the inequality:
Using the concentration result of Theorem 6, we get the result. Concerning \({\widehat{\ell }}({\varvec{{\theta }}}^\text {Ag}) - {\widehat{\ell }}(\widehat{\varvec{{\theta }}})\), the derivation is analogous with the only difference that the Taylor expansion has to be centered in \(\widehat{\varvec{{\theta }}}\) instead of \({\varvec{{\theta }}}^\text {Ag}\). \(\square\)
To conclude this appendix, we present the following technical lemma.
Theorem 7
Under Assumption 2and Assumption 4, let \({\mathcal {D}} = \{(s_i,a_i)\}_{i=1}^n\) be a dataset of \(n>0\) independent samples, where \(s_i \sim \nu\) and \(a_i \sim \pi _{\varvec{{\theta }}^\text {Ag}}(\cdot |s_i)\). Let \(\varvec{{\theta }}, \varvec{{\theta }}' \in \varTheta\), then for any \(\epsilon > 0\), it holds that:
Proof
We write explicitly the involved expression, using Assumption 3 and perform some algebraic manipulations:
Essentially, we are comparing the mean and the sample mean of the random variable \(\left( \varvec{{\theta }} - \varvec{{\theta }}'\right) ^T \varvec{{t}}(s,a) - \left( A(\varvec{{\theta }},s) - A(\varvec{{\theta }}',s) \right)\). Let us now focus on \(A(\varvec{{\theta }},s) - A(\varvec{{\theta }}',s)\). From the mean value theorem we know that, for some \(t \in [0,1]\) and \(\overline{\varvec{{\theta }}} = t\varvec{{\theta }} + (1-t)\varvec{{\theta }}'\), we have:
From Eq. (P.4), we know that \(\nabla _{\varvec{{\theta }}} A(\overline{\varvec{{\theta }}},s) = \mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{{\overline{a}} \sim \pi _{\overline{\varvec{{\theta }}}}(\cdot |s)} \left[ \varvec{{t}}(s,{\overline{a}}) \right]\). The random variable \(\varvec{{{\overline{t}}}}(s,a,\overline{\varvec{{\theta }}}) = \varvec{{t}}(s,a) - \mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{{\overline{a}} \sim \pi _{\overline{\varvec{{\theta }}}}(\cdot |s)} \left[ \varvec{{t}}(s,{\overline{a}}) \right]\) is a subgaussian random variable for any \(\overline{\varvec{{\theta }}} \in \varTheta\). Thus, under Assumption 4 we have:
If we apply Proposition 3, we get the result. \(\square\)
1.4 A.3 Results on significance and power of the tests
Theorem 3 Let \({\widehat{I}}_{c}\) be the set of parameter indexes selected by the Identification Rule 2 obtained using \(n>0\) i.i.d. samples collected with \(\pi _{\varvec{{\theta }}^{\text {Ag}}}\), with \(\varvec{{\theta }}^{\text {Ag}}\in \varTheta\). Then, under Assumptions 1, 2, 3, 4, and 5, let \({\varvec{{\theta }}}_i^{\text {Ag}} = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{{\theta }} \in \varTheta _i} \ell (\varvec{{\theta }})\) for all \(i \in \{1,...,d\}\) and \(\xi = \min \left\{ 1, \frac{\lambda _{\min }}{\sigma ^2} \right\}\). If \({\widehat{\lambda }}_{\min } \ge \frac{\lambda _{\min }}{2\sqrt{2}}\) and \(\ell ({\varvec{{\theta }}}_i^{\text {Ag}}) - {l}(\varvec{{\theta }}^{\text {Ag}}) \ge c_1\), it holds that:
Proof
We start considering \(\alpha = \Pr \left( \exists i \notin I^{\text {Ag}}: i \in {\widehat{I}}_{c} \right)\). We employ an argument analogous to that of (Garivier and Kaufmann, 2019):
where we observed that \({\widehat{\ell }}(\varvec{{\theta }}^\text {Ag}) \ge {\widehat{\ell }}(\widehat{\varvec{{\theta }}_i})\) as \(\varvec{{\theta }}^\text {Ag} \in \varTheta _i\) under \({\mathcal {H}}_0\) and we applied Corollary 1 in the last line, recalling that \({\widehat{\lambda }}_{\min } \ge \frac{\lambda _{\min }}{2\sqrt{2}}\). For the second inequality, the derivation is a little more articulated. Concerning \(\beta = \Pr \left( i \in I^{\text {Ag}}: i \notin {\widehat{I}} \right)\), we first perform a union bound:
Let us now focus on the single terms \(\Pr \left( i \notin {\widehat{I}}_{c} \right)\). We now perform the following manipulations:
where line (P.18) is obtained by observing that \({\widehat{\ell }}({\varvec{{\theta }}^\text {Ag}}) - {\widehat{\ell }}(\widehat{\varvec{{\theta }}}) \ge 0\). Thus, we have:
where line (P.20) derives from the inequality \(\Pr (X+Y \ge c) \le \Pr (X \ge a) + \Pr (Y \ge b)\) with \(c=a+b\), line (P.21) is obtained by the following second order Taylor expansion, recalling that \(\nabla _{\varvec{{\theta }}} {\ell }({\varvec{{\theta }}}^\text {Ag}) = \varvec{{0}}\):
where \(\overline{\varvec{{\theta }}} = t \varvec{{\theta }}^\text {Ag} + (1-t) \varvec{{\theta }}^\text {Ag}_i\) for some \(t \in [0,1]\). Line (P.22) is obtained by applying Corollary 1, recalling that \({\widehat{\lambda }}_{\min } \ge \frac{\lambda _{\min }}{2\sqrt{2}}\) and Theorem 7. Finally, line (P.23) derives by introducing the term \(\xi = \min \left\{ 1, \frac{\lambda _{\min }}{\sigma ^2} \right\}\) and observing that:
Clearly, this result is meaningful as long as \({\ell }({\varvec{{\theta }}_i}^\text {Ag}) - {\ell }({\varvec{{\theta }}}^\text {Ag}) - c_1 \ge 0\). \(\square\)
B Detail on identification rules with configurable environment
In the following, we report the pseudocode for the environment configuration procedure in the case of application of Identification Rule 1 (Algorithm 4) which was omitted in the main text.

C Experimental details
In this appendix, we report the full experimental results, along with the hyperparameters employed.
3.1 C.1 Experimental details for section 6.1
3.1.1 C.1.1 Discrete grid world
Hyperparameters In the following, we report the hyperparameters used for the experiments on the discrete grid world:
-
Horizon (T): 50
-
Discount factor (\(\gamma\)): 0.98
-
Learning steps with G(PO)MDP: 200
-
Batch size: 250
-
Max-likelihood maximum update steps: 1000
-
Max-likelihood learning rate (using Adam): 0.03
-
Number of configuration attempts per feature (\(N_{\mathrm {conf}}\)): 3
-
Environment configuration update steps: 150
-
Regularization parameter of the Rényi divergence (\(\zeta\)): 0.125
-
Significance of the likelihood-ratio tests (\(\delta\)): 0.01
Example of configuration and identification in the discrete grid world In Fig. 6, we show a graphical representation of a single experiment with the grid world environment using its configurability to better identify the policy space. The colors inside the squares indicate the probability mass function associated to the initial state distribution, consisting of the agent’s position (blue) and the goal position (red), where sharper colors mean higher probabilities. The colored lines represent the features the agent has access to, they are binary features indicating if the agent is on a certain row or column (blue lines) and if the goal is on a certain row or column (red lines). Note that, to avoid redundancy of representation (and so enforcing the identifiability), the last row and column are not explicitly encoded, but they can be represented by the absence of the other rows and columns. When a line is not shown anymore, it means that it has been rejected, i.e., we think the agent has access to that feature. The agent has access to every feature except for the goal columns, i.e., only to its own position and to the goal row are known.
The configuration of the environment is updated in the images at even position, the identification step is performed at even positions. The environment is configured in order to maximize the influence on the gradient of the first – not rejected – feature, considering the blue features first and then the red ones. After the model was configured three times for a feature, and the feature has not been rejected, the model was configured for the next one.
We can see that the general trend of this configuration is to change the parameters in order to spread the initial value of the mass probability functions across a greater number of grid cells. This is an expected behavior since with the initial model configuration, very often an episode starts with the agent in the bottom-left of the grid and the goal in the bottom-right, causing the policy to depend mostly on the position of the agent. In fact, only blue column features are rejected at the first iteration, as we can see in the third image. Instead, distributing the probabilities across the whole grid let an episode starts with the two positions extracted almost uniformly. Eventually, the correct policy space is identified. It is interesting to observe that such is can hardly be obtained without the configuration of the environment, given the initial state distribution shown in the first image.
3.1.2 C.1.2 Continuous grid world
In this appendix, we report the experiments performed on the continuous version of the grid world. In this environment, the agent has to reach a goal region, delimited by a circle, starting from an initial position. Both initial position and center of the goal are sampled at the beginning of the episode from a Gaussian distribution with fixed covariance \(\mu _{\varvec{{\omega }}}\). The supervisor is allowed to change, via the parameters \(\varvec{{\omega }}\), the mean of this distribution. The agent specifies, at each time step, the speed in the vertical and horizontal direction, by means of a bivariate Gaussian policy with fixed covariance, linear in a set of radial basis functions (RBF) for representing both the current position of the agent and the position of the goal (5\(\times\)5 both for the agent position and the goal). The feature, and consequently the parameters, that the agent can control are randomly selected at the beginning. In Fig. 7, we show the results of an experiment analogous to that of the discrete grid world, by comparing \({\widehat{\alpha }}\) and \({\widehat{\beta }}\) for the case in which we do not perform environment configuration (no-conf) and the case in which the configuration is performed (conf). Once again, we confirm our findings that configuring the environment allows speeding up the identification process by inducing the agent chaining its policy and, as a consequence, revealing which parameters it can actually control.
Hyperparameters In the following, we report the hyperparameters used for the experiments on the continuous grid world:
-
Horizon (T): 50
-
Discount factor (\(\gamma\)): 0.98
-
Policy covariance (\(\varvec{{\varSigma }}\)): \(0.02^2 \varvec{{I}}\)
-
Learning steps with G(PO)MDP: 100
-
Batch size: 100
-
Max-likelihood maximum update steps: closed form
-
Number of configuration attempts per feature (\(N_{\mathrm {conf}}\)): 3
-
Environment configuration update steps: 100
-
Regularization parameter of the Rényi divergence (\(\zeta\)): \(1e-6\)
-
Significance of the likelihood-ratio tests (\(\delta\)): 0.01
Example of configuration and identification in the continuous grid world In Fig. 8, we show an example of model configuration in the continuous grid world environment. The two filled circles are a graphical representation of the normal distributions from which the initial position of the agent (light blue) and the position of the goal (pink) are sampled at the beginning of each episode. The circumferences correspond to the set of features (RBF) to which the agent has access, among which we want to discover the ones accessible by the agent. Since the policy space is composed by Gaussian policies with mean specified by a linear combination of these features, each one is associated to a parameter. If a circumference is not shown anymore at an iteration step, it means that the hypothesis associated to that feature was rejected, i.e., we believe that the agent has access to that feature.
The group of images is an alternated sequence of new environment configurations and parameter identifications. In the first image we can see the initial model with no rejected features. The identification with the initial model yields to the rejection of a certain set of features, which can be seen in the second image. The third image shows the new configuration of the model, in which the mean of the two initial state distributions are moved in order to investigate the remaining features. Then a new test is performed and the result is shown in the fourth image, and so on. In this experiment, the environment was configured in order to maximize the influence of one feature at a time, starting from the blue ones from bottom-left to top-right in row order, and then with the red ones in the same order. Each feature is used to configure the model for a maximum of three times, after that point the next feature is considered.
The only features that were not actually in the agent’s set are the red ones on the two top rows. We can see that the mean of the initial position of the agent (a configurable parameter of the environment) always tracked the first available feature yet to be tested, as expected from this experiment. In fact, when the initial position is close enough to those features, the agent often moves around those blue circumferences to reach the goal, making them more important in the definition of the optimal policy. Eventually, the tests reject all the features that are actually accessible by the agent, and only them, yielding to a correct identification of the policy space. The rest of the configurations are not shown, since no more features were rejected. In this experiment, similarly to the discrete grid world case, the use of Conf–MDPs was crucial to obtain this result.
3.1.3 C.1.3 Simulated car driving
In this environment, an agent has to drive a car to reach the end of the track without running off the road. The control directives are the acceleration and the steering, and are expressed through a two dimensional bounded action space. The car has four sensors oriented in different directions: \(-\frac{\pi }{4}\), \(-\frac{\pi }{6}\), \(\frac{\pi }{6}\), \(\frac{\pi }{4}\) w.r.t. the axis pointing toward the front of the car. The values of these sensors are the normalized distances from the car to the nearest road margin along the direction of the sensor, or the maximum value if the margin is outside the range of the sensor. The complete set of state features is made up by the normalized car speed and the values of the four sensors. In the experiments, the agent has access to the speed and the sensor at angles \(\frac{\pi }{6}\) and \(\frac{\pi }{4}\). The track consists in a single road segment with a fixed curvature. The rewards are given proportionally to the speed of the car, i.e., greater speeds yield higher rewards. The episode finishes when the car goes outside the road, and a negative reward is given in this case, when the track is completed, or when a maximum number of time steps is elapsed.
Hyperparameters In the following, we report the hyperparameters used for the experiments on the simulated car driving:
-
Horizon (T): 250
-
Discount factor (\(\gamma\)): 0.996
-
Policy covariance (\(\varvec{{\varSigma }}\)): \(0.1 \varvec{{I}}\)
-
Learning steps with G(PO)MDP: 100
-
Batch size: 50
-
Max-likelihood maximum update steps: 200
-
Max-likelihood learning rate (using Adam): 0.1
-
Significance of the likelihood-ratio tests (\(\delta\)): 0.1 rescaled by 0.1/5 for the simplified identification rule and 0.1/32 for the combinatorial identification rule
3.2 C.2 Experimental details of section 6.2
Hyperparameters In the following, we report the hyperparameters used for the experiments on the discrete grid world:
-
Horizon (T): 50
-
Discount factor (\(\gamma\)): 0.98
-
Learning steps with G(PO)MDP: 200
-
Batch size: 250
-
Max-likelihood maximum update steps: 1000
-
Max-likelihood learning rate (using Adam): 0.03
-
Number of configuration attempts per feature (\(N_{\mathrm {conf}}\)): 3
-
Environment configuration update steps: 150
-
Regularization parameter of the Rényi divergence (\(\zeta\)): 0.125
-
Significance of the likelihood-ratio tests (\(\delta\)): 0.01
Additional Results In the following, we report the complete results about the imitation learning experiments. These results extend the ones presented in the main paper providing additional algorithms and additional metrics for comparison.
Concerning the additional algorithms, we include other two regularization techniques for the maximum likelihood estimation: Shannon and Tsallis entropy. Given a policy \(\pi\), the Shannon \({\mathbb {H}}(\pi )\) and Tsallis \({\mathbb {W}}(\pi )\) entropies are defined as follows (Ho and Ermon, 2016; Lee et al., 2018):
It is worth noting that, differently from the other regularizers (like ridge and lasso), Shannon and Tsallis entropies require to compute an expectation w.r.t. to the policy \(\pi\) we are optimizing. Since samples are collected with a policy that is, in general, different and unknown (the expert’s policy) those expectations are approximated, in our experiments, with self-normalized importance weighting (Owen, 2013). Thus, the complete loss function that is optimized, ignoring the ridge and lasso regularizers for brevity, is the following:

where \({\widetilde{\omega }}_i(\varvec{{\theta }}) = \frac{n \pi _{\varvec{{\theta }}}(a_i|s_i)}{\sum _{j=1}^n \pi _{\varvec{{\theta }}}(a_j|s_j)}\) is the self-normalized importance weight.
Furthermore, we have tested other IL methods that require natively the interaction with the environment. In our truly batch model-free setting, we replaced again the interaction with the environment with off-policy estimation. These algorithms are based on the notion of feature expectation, i.e., the expectation of a feature function \(\varvec{\phi }(s,a)\) under the \(\gamma\)-discounted stationary distribution induced by a policy \(\pi\):
The goal consists in finding a policy \(\pi _{\widehat{\varvec{{\theta }}}}\) that matches the feature expectations induced by the expert’s policy \(\pi _{{\varvec{{\theta }}}^{\text {Ag}}}\), i.e., \(\varvec{\phi }(\pi _{\widehat{\varvec{{\theta }}}}) \simeq \varvec{\phi }(\pi _{{\varvec{{\theta }}}^{\text {Ag}}})\) and applying a regularization on \(\pi _{\widehat{\varvec{{\theta }}}}\). If the regularization is the Shannon entropy we have the Maximum Causal Entropy Inverse Reinforcement Learning (MCE, Ziebart et al., 2010):
where \(\alpha ^S\) is a scale parameter. Instead, if we employ Tsallis entropy we obtain the Maximum Tsallis Entropy Imitation Learning (MTE, Lee et al., 2018):
where \(\alpha ^W\) is a scale parameter.
In both cases, similarly to the regularizers presented above, the computation of the objective requires to perform an off-policy estimation via importance sampling. In these cases, we have the additional complexity that also the constraint, i.e., matching the feature expectation, requires off-policy estimation for the left hand side.
The tables reported in the following pages present the complete results. As comparison metrics, we employed the norm of the parameter difference (Table 2), the estimated expected KL-divergence (Table 3), as defined in Section 6.2, and the norm of the estimated difference in the feature expectations (Table 4). In each table we report, as an oracle baseline, the results of ML assuming to have the knowledge of the parameters actually controlled by the agent (True). FE is a feature matching baseline, obtained by looking for the policy that better explains the feature expectations induced by the expert’s data:
where \({\widetilde{\omega }}_i(\varvec{{\theta }})\) is the self-normalized importance weight, as defined before. Finally, MCE and MCT are Maximum Causal Entropy and Maximum Tsallis Entropy, adapted with importance sampling. As a general trend, we can see that all algorithms that employ importance weighting do not perform well. This can be explained by the fact that expert’s policy, which is likely (near) optimal, does not provide good information across the state-action space. As a consequence, the importance weighting procedure injects a large uncertainty (Owen, 2013; Metelli et al., 2018b). This also highlights how this no-interaction setting makes the IL problem challenging.
3.3 C.3 Experimental details of section 6.3
In the minigolf experiment, the polynomial features obtained from the distance from the goal x and the friction f are the following:
While agent \({\mathscr {A}}_1\) perceives all the features, agent \({\mathscr {A}}_2\) has access to \(\left( 1,\, x,\, \sqrt{x} \right) ^T\) only.
Hyperparameters In the following, we report the hyperparameters used for the experiments on the minigolf:
-
Horizon (T): 20
-
Discount factor (\(\gamma\)): 0.99
-
Policy covariance (\(\varvec{{\varSigma }}\)): 0.01
-
Learning steps with G(PO)MDP: 100
-
Batch size: 100
-
Max-likelihood maximum update steps: closed form
-
Number of configuration attempts per feature (\(N_{\mathrm {conf}}\)): 10
-
Environment configuration update steps: 100
-
Regularization parameter of the Rényi divergence (\(\zeta\)): 0.25
-
Significance of the likelihood-ratio tests (\(\delta\)): 0.01
3.3.1 C.3.1 Experiment with randomly chosen features
In the following, we report an additional experiment in the minigolf domain in which the features that the agent can perceive are randomly selected at the beginning, comparing the case in which we do not configure the environment and the case in which environment configuration is performed, and for different number of episodes collected. Although, less visible w.r.t. to the grid world case, we can see that for some features (e.g., \(\sqrt{x}\) and \(\sqrt{xf}\)) the environment configurability is beneficial.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Metelli, A.M., Manneschi, G. & Restelli, M. Policy space identification in configurable environments. Mach Learn 111, 2093–2145 (2022). https://doi.org/10.1007/s10994-021-06033-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-021-06033-3
Keywords
- Reinforcement learning
- Configurable Markov decision processes
- Likelihood ratio test
- Policy space identification