1 Introduction

Reinforcement Learning (RL, Sutton and Barto, 2018) deals with sequential decision–making problems in which an artificial agent interacts with an environment by sensing perceptions and performing actions. The agent’s goal is to find an optimal policy, i.e.,  a prescription of actions that maximizes the (possibly discounted) cumulative reward collected during its interaction with the environment. The performance of an agent in an environment is constrained by its perception and its actuation possibilities, along with the ability to map observations to actions. These three elements define the policy space available to the agent in the learning process. Agents having access to different policy spaces may exhibit different optimal behaviors, even in the same environment. Therefore, the notion of optimality is necessarily connected to the space of policies the agent can access, which we will call the agent’s policy space in the following. While in tabular RL we typically assume access to the complete space of Markovian stationary policies, in continuous control, the policy space needs to be limited. In policy search methods (Deisenroth et al., 2013), the policies are explicitly modeled considering a parametric functional space (Sutton et al., 1999; Peters and Schaal, 2008) or a kernel space (Deisenroth and Rasmussen, 2011; Levine and Koltun, 2013); but even in value–based RL, a function approximator induces a set of representable (greedy) policies. It is important to point out that the notion of policy space is not just an algorithmic convenience. Indeed, the need to limit the policy space naturally emerges in many industrial applications, where some behaviors have to be avoided for safety reasons.

Fig. 1
figure 1

An example of policy space modeled as a 1-layer neural network showing a limitation in the a perception, b mapping, and c actuation

The knowledge of the agent’s policy space might be useful in some subfields of RL. Recently, the framework of Configurable Markov Decision Process (Conf-MDP, Metelli et al., 2018a) has been introduced to account for the scenarios in which it is possible to configure some environmental parameters. Intuitively, the best environment configuration is intimately related to the agent’s possibilities in terms of policy space. When the configuration activity is performed by an external supervisor, it might be helpful to know which parameters the agent can control in order to select the most appropriate configuration. Furthermore, in the field of Imitation Learning (IL, Osa et al., 2018), figuring out the policy space of the expert’s agent can aid the learning process of the imitating policy, mitigating overfitting/underfitting phenomena.

In this paper, motivated by the examples presented above, we study the problem of identifying the agent’s policy space in a Conf–MDP,Footnote 1 by observing the agent’s behavior and, possibly, exploiting the configuration opportunities of the environment. We consider the case where the agent’s policy space is a subset of a known super–policy space \(\varPi _{\varTheta }\) induced by a parameter space \(\varTheta \subseteq {\mathbb {R}}^d\). Thus, any policy \(\pi _{\varvec{{\theta }}}\) is determined by a d–dimensional parameter vector \(\varvec{{\theta }} \in \varTheta\). However, the agent has control over a smaller number \(d^{\text {Ag}}< d\) of parameters (which are unknown), while the remaining ones have a fixed value, namely zero.Footnote 2 The choice of zero as a fixed value might appear arbitrary, but it is rather a common case in practice. Indeed, the formulation based on the identification of the parameters effectively covers the limitations of the policy space related to perception, actuation, and mapping. For instance, in a linear policy, the fact that the agent does not observe a state feature is equivalent to set the corresponding parameters to zero. Similarly, in a neural network, removing a neuron is equivalent to neglecting all of its connections, which in turn can be realized by setting the relative weights to zero. Figure 1 shows three examples of policy space limitations in the case of a 1–hidden layer neural network policy, which can be realized by setting the appropriate weights to zero.

Our goal is to identify the parameters that the agent can control (and possibly change) by observing some demonstrations of the optimal policy \(\pi ^{\text {Ag}}\) in the policy space \(\varPi _\varTheta\).Footnote 3 To this end, we formulate the problem as deciding whether each parameter \(\theta _i\) for \(i \in \{1,...,d\}\) is zero, and we address it by means of a frequentist statistical test. In other words, we check whether there is a statistically significant difference between the likelihood of the agent’s behavior with the full set of parameters and the one in which \(\theta _i\) is set to zero. In such a case, we conclude that \(\theta _i\) is not zero and, consequently, the agent can control it. On the contrary, either the agent cannot control the parameter, or zero is the value consciously chosen by the agent.

Indeed, there could be parameters that, given the peculiarities of the environment, are useless for achieving the optimal behavior or whose optimal value is actually zero, while they could prove essential in a different environment. For instance, in a grid world where the goal is to reach the right edge, the vertical position of the agent is useless, while if the goal is to reach the upper right corner, both horizontal and vertical positions become relevant. In this spirit, configuring the environment can help the supervisor in identifying whether a parameter set to zero is actually uncontrollable by the agent or just useless in the current environment. Thus, the supervisor can change the environment configuration \(\varvec{{\omega }} \in \varOmega\), so that the agent will adjust its policy, possibly by changing the parameter value and revealing whether it can control such a parameter. Consequently, the new configuration should induce an optimal policy in which the considered parameters have a value significantly different from zero. We formalize this notion as the problem of finding the new environment configuration that maximizes the power of the statistical test and we propose a surrogate objective for this purpose.

The paper is organized as follows. In Sect. 2, we introduce the necessary background. The identification rules (combinatorial and simplified) to perform parameter identification in a fixed environment are presented in Sect. 3 and the simplified one is analyzed in Sect. 4. Sect. 5 shows how to improve them by exploiting the environment configurability. The experimental evaluation, on discrete and continuous domains, is provided in Sect. 6. Besides studying the ability of our identification rules in identifying the agent’s policy space, we apply them to the IL and Conf-MDP frameworks. The proofs not reported in the main paper can be found in Appendix A.

2 Preliminaries

In this section, we report the essential background that will be used in the subsequent sections. For a given set \({\mathcal {X}}\), we denote with \({\mathscr {P}}({\mathcal {X}})\) the set of probability distributions over \({\mathcal {X}}\).

(Configurable) Markov Decision Processes A discrete–time Markov Decision Process (MDP, Puterman, 2014) is defined by the tuple \({\mathcal {M}} = \left( {\mathcal {S}}, {\mathcal {A}}, p, \mu, r, \gamma \right)\), where \({\mathcal {S}}\) and \({\mathcal {A}}\) are the state space and the action space respectively, \(p: {\mathcal {S}} \times {\mathcal {A}} \rightarrow {\mathscr {P}}({\mathcal {S}})\) is the transition model that provides, for every state-action pair \((s,a) \in \mathcal {S} \times \mathcal {A}\), a probability distribution over the next state \(p(\cdot |s,a)\), \(\mu \in {\mathscr {P}}({\mathcal {S}})\) is the distribution of the initial state, \(r: {\mathcal {S}} \times {\mathcal {A}} \rightarrow {\mathbb {R}}\) is the reward model, defining the reward collected by the agent r(sa) when performing action \(a\in {\mathcal {A}}\) in state \(s\in {\mathcal {S}}\), and \(\gamma \in [0,1]\) is the discount factor. The behavior of an agent is defined by means of a policy \(\pi : {\mathcal {S}} \rightarrow {\mathscr {P}}({\mathcal {S}})\) that provides a probability distribution over the actions \(\pi (\cdot |s)\) for every state \(s \in {\mathcal {S}}\). We limit the scope to parametric policy spaces \(\varPi _\varTheta = \left\{ \pi _{\varvec{{\theta }}} : \varvec{{\theta }} \in \varTheta \right\}\), where \(\varTheta \subseteq {\mathbb {R}}^d\) is the parameter space. The goal of the agent is to find an optimal policy within \(\varPi _\varTheta\), i.e.,  any policy parametrization that maximizes the expected return:

$$\begin{aligned} \varvec{{\theta }}^\text {Ag} \in \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } J_{{\mathcal {M}}}(\varvec{{\theta }}) = \mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{\begin{array}{c} s_0 \sim \mu \\ a_t \sim \pi _{\varvec{{\theta }}}(\cdot |s_t) \\ s_{t+1} \sim p(\cdot |s_t,a_t) \end{array}} \left[ \sum _{t=0}^{+\infty } \gamma ^t r(s_t,a_t) \right] . \end{aligned}$$

In this paper, we consider a slightly modified version of the Conf–MDPs (Metelli et al., 2018a).

Definition 1

A Configurable Markov Decision Process (Conf–MDP) induced by the configuration space \(\varOmega \subseteq {\mathbb {R}}^p\) is defined as the set of MDPs:

$$\begin{aligned} {\mathcal {C}}_{\varOmega } = \left\{ {\mathcal {M}}_{\varvec{{\omega }}} = \left( {\mathcal {S}}, {\mathcal {A}}, p_{\varvec{{\omega }}}, \mu _{\varvec{{\omega }}}, r, \gamma \right) \,:\, \varvec{{\omega }} \in \varOmega \right\} . \end{aligned}$$

The main differences w.r.t.  the original definition are: i) we allow the configuration of the initial state distribution \(\mu _{\varvec{{\omega }}}\), in addition to the transition model \(p_{\varvec{{\omega }}}\); ii) we restrict to the case of parametric configuration spaces \(\varOmega\); iii) we do not consider the policy space \(\varPi _\varTheta\) as a part of the Conf–MDP.

Generalized Likelihood Ratio Test The Generalized Likelihood Ratio test (GLR, Barnard, 1959; Casella and Berger, 2002) aims at testing the goodness of fit of two statistical models. Given a parametric model having density function \(p(\cdot |{\varvec{{\theta }}})\) with \(\varvec{{\theta }} \in \varTheta\), we aim at testing the null hypothesis \({\mathcal {H}}_0 : \varvec{{\theta }}^\text {Ag} \in \varTheta _0\), where \(\varTheta _0 \subset \varTheta\) is a subset of the parametric space, against the alternative \({\mathcal {H}}_1 : \varvec{{\theta }}^\text {Ag} \in \varTheta \setminus \varTheta _0\). Given a dataset \({\mathcal {D}} = \left\{ X_i \right\} _{i=1}^n\) sampled independently from \(p(\cdot |{\varvec{{\theta }}^\text {Ag}})\), where \(\varvec{{\theta }}^\text {Ag}\) is the true parameter, the GLR statistic is:

$$\begin{aligned} \varLambda = \frac{\sup _{\varvec{{\theta }} \in \varTheta _0} p({\mathcal {D}}|\varvec{{\theta }}) }{\sup _{\varvec{{\theta }} \in \varTheta } p({\mathcal {D}}|\varvec{{\theta }})} = \frac{\sup _{\varvec{{\theta }} \in \varTheta _0} \widehat{{\mathcal {L}}}(\varvec{{\theta }}) }{\sup _{\varvec{{\theta }} \in \varTheta } \widehat{{\mathcal {L}}}(\varvec{{\theta }})}, \end{aligned}$$

where \(p({\mathcal {D}}|\varvec{{\theta }}) = \widehat{{\mathcal {L}}}(\varvec{{\theta }}) = \prod _{i=1}^n p(X_i|{\varvec{{\theta }}})\) is the likelihood function. We denote with \({\widehat{\ell }}(\varvec{{\theta }}) = -\log \widehat{{\mathcal {L}}}(\varvec{{\theta }})\) the negative log–likelihood function, \(\widehat{\varvec{{\theta }}} \in \mathop {{{\,\mathrm{{arg\,sup}}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } \widehat{{\mathcal {L}}}(\varvec{{\theta }})\) and \(\widehat{\varvec{{\theta }}}_0 \in \mathop {{{\,\mathrm{{arg\,sup}}\,}}}\limits _{\varvec{{\theta }} \in \varTheta _0} \widehat{{\mathcal {L}}}(\varvec{{\theta }})\), i.e.,  the maximum likelihood solutions in \(\varTheta\) and \(\varTheta _0\) respectively. Moreover, we define the expectation of the likelihood under the true parameter: \(\ell (\varvec{{\theta }}) = \mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{X_i \sim p(\cdot |{\varvec{{\theta }}^\text {Ag}})} [{\widehat{\ell }}(\varvec{{\theta }})]\). As the maximization is carried out employing the same dataset \({\mathcal {D}}\) and recalling that \(\varTheta _0 \subset \varTheta\), we have that \(\varLambda \in [0,1]\). It is usually convenient to consider the logarithm of the GLR statistic: \(\lambda = -2 \log \varLambda = 2 ({\widehat{\ell }}(\widehat{\varvec{{\theta }}}_0) - {\widehat{\ell }}(\widehat{\varvec{{\theta }}}) )\). Therefore, \({\mathcal {H}}_0\) is rejected for large values of \(\lambda\), i.e.,  when the maximum likelihood parameter searched in the restricted set \(\varTheta _0\) significantly underfits the data \({\mathcal {D}}\), compared to \(\varTheta\). Wilk’s theorem provides the asymptomatic distribution of \(\lambda\) when \({\mathcal {H}}_0\) is true (Wilks, 1938; Casella and Berger, 2002).

Theorem 1

(Casella and Berger, (2002), Theorem 10.3.3) Let \(d = \mathrm {dim}(\varTheta )\) and \(d_0 = \mathrm {dim}(\varTheta _0) < d\). Under suitable regularity conditions (see Casella and Berger, (2002) Section 10.6.2), if \({\mathcal {H}}_0\) is true, then when \(n \rightarrow +\infty\), the distribution of \(\lambda\) tends to a \(\chi ^2\) distribution with \(d-d_0\) degrees of freedom.

The significance of a test \(\alpha \in [0,1]\), or type I error probability, is the probability to reject \({\mathcal {H}}_0\) when \({\mathcal {H}}_0\) is true, while the power of a test \(1-\beta \in [0,1]\) is the probability to reject \({\mathcal {H}}_0\) when \({\mathcal {H}}_0\) is false, \(\beta\) is the type II error probability.

3 Policy space identification in a fixed environment

As we introduced in Sect. 1, we aim at identifying the agent’s policy space by observing a set of demonstrations coming from the optimal policy of the agent. We assume that the agent is playing a policy \(\pi ^{\text {Ag}}\) belonging to a parametric policy space \(\varPi _{\varTheta }\).

Assumption 1

(Parametric Agent’s Policy) The agent’s policy \(\pi ^{\text {Ag}}\) belongs to a known parametric policy space \(\varPi _{\varTheta }\), i.e.,  there exists a (maybe not unique) \(\varvec{{\theta }}^{\text {Ag}}\in \varTheta\) such that \(\pi _{\varvec{{\theta }}^{\text {Ag}}}(\cdot |s) = \pi ^{\text {Ag}}(\cdot |s)\) almost surely for all \(s \in {\mathcal {S}}\).

It is important to stress \(\pi ^{\text {Ag}}\) is one of the possibly many optimal policies within the policy space \(\varPi _{\varTheta }\), which, in turn, might be unable to represent the optimal Markovian stationary policy. Furthermore, we do not explicitly report the dependence on the agent’s parameter \(\varvec{{\theta }}^{\text {Ag}}\in \varTheta\) as, in the general case, there might exist multiple parameters yielding the same policy \(\pi ^{\text {Ag}}\).

We have access to a dataset \({\mathcal {D}} = \{(s_i,a_i)\}_{i=1}^n\) where \(s_i \sim \nu\) and \(a_i \sim \pi ^{\text {Ag}}(\cdot |s_i)\) sampled independently.Footnote 4\(\nu\) is a sampling distribution over the states. Although we will present the method for a generic \(\nu \in {\mathscr {P}}({\mathcal {S}})\), in practice, we employ as \(\nu\) the \(\gamma\)–discounted stationary distribution induced by \(\pi ^{\text {Ag}}\), i.e.,  \(d_{\mu }^{\pi ^{\text {Ag}}}(s) = (1-\gamma ) \sum _{t=0}^{+\infty } \Pr (s_t = s | {\mathcal {M}}, \pi ^{\text {Ag}})\) (Sutton et al., 1999). We assume that the agent has control over a limited number of parameters \(d^{\text {Ag}}< d\) whose value can be changed during learning, while the remaining \(d-d^{\text {Ag}}\) are kept fixed to zero.Footnote 5 Given a set of indexes \(I \subseteq \{1,...,d\}\) we define the subset of the parameter space: \(\varTheta _I = \left\{ \varvec{{\theta }} \in \varTheta : \theta _i = 0,\, \forall i \in \{1,...,d\}\setminus I \right\}\). Thus, the set I represents the indexes of the parameters that can be changed if the agent’s parameter space were \(\varTheta _I\). Our goal is to find a set of parameter indexes \(I^{\text {Ag}}\) that are sufficient to explain the agent’s policy, i.e.,  \(\pi ^{\text {Ag}}\in \varPi _{\varTheta _{I^{\text {Ag}}}}\) but also necessary, in the sense that when removing any \(i \in I^{\text {Ag}}\) the remaining ones are insufficient to explain the agent’s policy, i.e.,  \(\pi ^{\text {Ag}}\notin \varPi _{\varTheta _{I^{\text {Ag}}\setminus \{i\}}}\). We formalize these notions in the following definition.

Definition 2

(Correctness) Let \(\pi ^{\text {Ag}}\in \varPi _{\varTheta }\). A set of parameter indexes \(I^{\text {Ag}}\subseteq \{1,...,d\}\) is correct w.r.t.  \(\pi ^{\text {Ag}}\) if:

$$\begin{aligned} \pi ^{\text {Ag}}\in \varPi _{\varTheta _{I^{\text {Ag}}}} \, \wedge \,\forall i \in {I^{\text {Ag}}} : \pi ^{\text {Ag}}\notin \varPi _{\varTheta _{I^{\text {Ag}}\setminus \{i\}}}. \end{aligned}$$

We denote with \({\mathcal {I}}^{\text {Ag}}\) the set of all correct set of parameter indexes \(I^{\text {Ag}}\).

Thus, there exist multiple \(I^{\text {Ag}}\) when multiple parametric representations of the agent’s policy \(\pi ^{\text {Ag}}\) are possible. The uniqueness of \(I^{\text {Ag}}\) is guaranteed under the assumption that each policy admits a unique representation in \(\varPi _\varTheta\), i.e.,  under the identifiability assumption.

Assumption 2

(Identifiability) The policy space \(\varPi _{\varTheta }\) is identifiable, i.e.,  for all \(\varvec{{\theta }},\varvec{{\theta }}' \in \varTheta\), we have that if \(\pi _{\varvec{{\theta }}}(\cdot |s) = \pi _{\varvec{{\theta }}'}(\cdot |s) \; \text {almost surely}\) for all \(s \in {\mathcal {S}}\) than \(\varvec{{\theta }} = \varvec{{\theta }}'\).

The identifiability property allows rephrasing Definition 2 in terms of the policy parameters only, leading to the following result.

Lemma 1

(Correctness under Identifiability) Under Assumption 2, let \(\varvec{{\theta }}^{\text {Ag}}\in \varTheta\) be the unique parameter such that \(\pi _{\varvec{{\theta }}^{\text {Ag}}}(\cdot |s) = \pi ^{\text {Ag}}(\cdot |s)\) almost surely for all \(s \in {\mathcal {S}}\). Then, there exists a unique set of parameter indexes \(I^{\text {Ag}}\subseteq \{1,...,d\}\) that is correct w.r.t.  \(\pi ^{\text {Ag}}\) defined as:

$$\begin{aligned} I^{\text {Ag}}= \left\{ i \in \{1,...,d\}\,:\, \theta _i^\text {Ag} \ne 0 \right\} . \end{aligned}$$

Consequently, \({\mathcal {I}}^{\text {Ag}}= \{ I^{\text {Ag}}\}\).


The uniqueness of \(I^\text {Ag}\) is ensured by Assumption 2. Let us rewrite the condition of Definition 2 under Assumption 2:

$$\begin{aligned} \pi ^\text {Ag} \in \varPi _{\varTheta _{I^\text {Ag}}}& \, \wedge \,\forall i \in {I^\text {Ag}} : \pi ^\text {Ag} \notin \varPi _{\varTheta _{I^\text {Ag} \setminus \{i\}}} \nonumber \\&\iff \varvec{{\theta }}^\text {Ag} \in \varTheta _{I^\text {Ag}} \, \wedge \,\forall i \in {I^\text {Ag}} : \varvec{{\theta }}^\text {Ag} \notin \varTheta _{I^\text {Ag} \setminus \{i\}} \end{aligned}$$
$$\begin{aligned}&\iff \forall i \in I^\text {Ag}: \theta ^\text {Ag}_i \ne 0 \, \wedge \,\forall i \in \{1,...,d\}\setminus I^\text {Ag} : \theta _i^\text {Ag} =0 \nonumber \\&\iff I^\text {Ag} = \left\{ i \in \{1,...,d\}\,:\, \theta _i^\text {Ag} \ne 0 \right\}, \end{aligned}$$

where line (P.1) follows since there is a unique representation for \(\pi ^\text {Ag}\) determined by parameter \(\varvec{{\theta }}^\text {Ag}\) and line (P.2) is obtained from the definition of \(\varTheta _I\). \(\square\)

Remark 1

(About the Optimality of \(\pi ^\text {Ag}\)) We started this section stating that \(\pi ^{\text {Ag}}\) is an optimal policy within the policy space \(\varPi _{\varTheta }\). This is motivated by the fact that typically we start with an overparametrized policy space \(\varPi _{\varTheta }\) and we seek for the minimal set of parameters that allows the agent to reach an optimal policy within \(\varPi _{\varTheta }\). However, in practice, we usually have access to an \(\epsilon\)-optimal policy \(\pi ^{\text {Ag}}_{\epsilon }\), meaning that the performance of \(\pi ^{\text {Ag}}_{\epsilon }\) is \(\epsilon\)-close to the optimal performance.Footnote 6 Nevertheless, the notion of correctness (Definition 2) makes no assumptions on the optimality of \(\pi ^\text {Ag}\). If we replace \(\pi ^\text {Ag}\) with \(\pi ^{\text {Ag}}_{\epsilon }\) we will recover a set of parameter indexes \(I^{\text {Ag}}_{\epsilon }\) that is, in general, different from \(I^{\text {Ag}}_{\epsilon }\), but we can still provide some guarantees. If \(I^{\text {Ag}} \subseteq I^{\text {Ag}}_{\epsilon }\), then \(I^{\text {Ag}}_{\epsilon }\) is sufficient to explain the optimal policy \(\pi ^{\text {Ag}}\), but not necessary in general (it might contain useless parameters for \(\pi ^{\text {Ag}}\)). Instead, if \(I^{\text {Ag}} \not \subseteq I^{\text {Ag}}_{\epsilon }\), then \(I^{\text {Ag}}_{\epsilon }\) is not sufficient to explain the optimal policy \(\pi ^{\text {Ag}}\). In any case, \(I^{\text {Ag}}_{\epsilon }\) is necessary and sufficient to represent, at least, an \(\epsilon\)-optimal policy.

The following two subsections are devoted to the presentation of the identification rules based on the application of Definition 2 (Sect. 3.1) and Lemma 1 (Sect. 3.2) when we only have access to a dataset of samples \({\mathcal {D}}\). The goal of an identification rule consists in producing a set \(\widehat{{\mathcal {I}}}\), approximating \({\mathcal {I}}^{\text {Ag}}\). The idea at the basis of our identification rules consists in employing the GLR test to assess the correctness (Definition 2 or Lemma 1) of a candidate set of indexes.

3.1 Combinatorial identification rule

In principle, using \({\mathcal {D}} = \{(s_i,a_i)\}_{i=1}^n\), we could compute the maximum likelihood parameter \(\widehat{\varvec{{\theta }}}\in \mathop {{{\,\mathrm{{arg\,sup}}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } \widehat{{\mathcal {L}}}(\varvec{{\theta }})\) and employ it with Definition 2. However, this approach has, at least, two drawbacks. First, when Assumption 2 is not fulfilled, it would produce a single approximate parameter, while multiple choices might be viable. Second, because of the estimation errors, we would hardly get a zero value for the parameters the agent might not control. For these reasons, we employ a GLR test to assess whether a specific set of parameters is zero. Specifically, for all \(I \subseteq \{1,...,d\}\) we consider the pair of hypotheses \({\mathcal {H}}_{0,I} \,:\, \pi ^{\text {Ag}}\in \varPi _{\varTheta _I}\) against \({\mathcal {H}}_{1,I} \,:\, \pi ^{\text {Ag}}\in \varPi _{\varTheta \setminus \varTheta _I}\) and the GLR statistic:

$$\begin{aligned} \lambda _I = -2 \log \frac{\sup _{\varvec{{\theta }} \in \varTheta _I} \widehat{{\mathcal {L}}}(\varvec{{\theta }}) }{\sup _{\varvec{{\theta }} \in \varTheta } \widehat{{\mathcal {L}}}(\varvec{{\theta }}) } = 2 \left( {\widehat{\ell }}(\widehat{\varvec{{\theta }}}_I) - {\widehat{\ell }}(\widehat{\varvec{{\theta }}}) \right), \end{aligned}$$

where the likelihood is defined as \(\widehat{{\mathcal {L}}}(\varvec{{\theta }}) = \prod _{i=1}^n \pi _{\varvec{{\theta }}}(a_i|s_i)\), \(\widehat{\varvec{{\theta }}}_I \in \mathop {{{\,\mathrm{{arg\,sup}}\,}}}\limits _{\varvec{{\theta }} \in \varTheta _I} \widehat{{\mathcal {L}}}(\varvec{{\theta }})\) and \(\widehat{\varvec{{\theta }}} \in \mathop {{{\,\mathrm{{arg\,sup}}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } \widehat{{\mathcal {L}}}(\varvec{{\theta }})\). We are now ready to state the identification rule derived from Definition 2.

Identification Rule 1

The combinatorial identification rule with threshold function \(c_l\) selects \(\widehat{{\mathcal {I}}}_c\) containing all and only the sets of parameter indexes \({I} \subseteq \{1,...,d\}\) such that:

$$\begin{aligned} \lambda _{{I}} \le c_{|{I}|} \wedge \,\forall i \in {I} : \lambda _{{I} \setminus \{i\}}> c_{|{I}|-1}. \end{aligned}$$

Thus, I is defined in such a way that the null hypothesis \({\mathcal {H}}_{0,{I}}\) is not rejected, i.e.,  I contains parameters that are sufficient to explain the data \({\mathcal {D}}\), and necessary since for all \(i \in {I}\) the set \({I} \setminus \{i\}\) is no longer sufficient, as \({\mathcal {H}}_{0,{I} \setminus \{i\}}\) is rejected. The threshold function \(c_l\), which depend on the cardinality l of the tested set of indexes, controls the behavior of the tests. In practice, we recommend setting them by exploiting Wilk’s asymptotic approximation (Theorem 1) to enforce (asymptotic) guarantees on the type I error. Given a significance level \(\delta \in [0,1]\), since for Identification Rule 1 we perform \(2^d\) statistical tests by using the same dataset \({\mathcal {D}}\), we partition \(\delta\) using Bonferroni correction and setting \(c_l = \chi ^2_{l,1-{\delta }/{2^d}}\), where \(\chi ^2_{l,\bullet }\) is the \(\bullet\)–quintile of a chi square distribution with l degrees of freedom. Refer to Algorithm 1 for the pseudocode of the identification procedure.

figure a

3.2 Simplified identification rule

Identification Rule 1 is hard to be employed in practice, as it requires performing \({\mathcal {O}}( 2^d )\) statistical tests. However, under Assumption 2, to retrieve \(I^{\text {Ag}}\) we do not need to test all subsets, but we can just examine one parameter at a time (see Lemma 1). Thus, for all \(i \in \{1,...,d\}\) we consider the pair of hypotheses \({\mathcal {H}}_{0,i} \,:\, {\theta }^{\text {Ag}}_i = 0\) against \({\mathcal {H}}_{1,i} \,:\, {\theta }^{\text {Ag}}_i \ne 0\) and define \(\varTheta _i = \{ \varvec{{\theta }} \in \varTheta \,:\, \theta _i = 0\}\). The GLR test can be performed straightforwardly, using the statistic:

$$\begin{aligned} \lambda _i = -2 \log \frac{\sup _{\varvec{{\theta }} \in \varTheta _i} \widehat{{\mathcal {L}}}(\varvec{{\theta }}) }{\sup _{\varvec{{\theta }} \in \varTheta } \widehat{{\mathcal {L}}}(\varvec{{\theta }}) } = 2 \left( {\widehat{\ell }}(\widehat{\varvec{{\theta }}}_i) - {\widehat{\ell }}(\widehat{\varvec{{\theta }}}) \right), \end{aligned}$$

where the likelihood is defined as \(\widehat{{\mathcal {L}}}(\varvec{{\theta }}) = \prod _{i=1}^n \pi _{\varvec{{\theta }}}(a_i|s_i)\), \(\widehat{\varvec{{\theta }}}_i = \mathop {{{\,\mathrm{{arg\,sup}}\,}}}\limits _{\varvec{{\theta }} \in \varTheta _i} \widehat{{\mathcal {L}}}(\varvec{{\theta }})\) and \(\widehat{\varvec{{\theta }}} = \mathop {{{\,\mathrm{{arg\,sup}}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } \widehat{{\mathcal {L}}}(\varvec{{\theta }})\).Footnote 7 In the spirit of Lemma 1, we define the following identification rule.

Identification Rule 2

The simplified identification rule with threshold function \(c_1\) selects \(\widehat{{\mathcal {I}}}_c\) containing the unique set of parameter indexes \({\widehat{I}}_{c}\) such that:

$$\begin{aligned} {\widehat{I}}_{c} = \left\{ i \in \{1,...,d\}: \lambda _i > c_1 \right\} . \end{aligned}$$

Therefore, the identification rule constructs \({\widehat{I}}_c\) by taking all the indexes \(i \in \{1,...,d\}\) such that the corresponding null hypothesis \({\mathcal {H}}_{0,i} \,:\, {\theta }^{\text {Ag}}_i = 0\) is rejected, i.e.,  those for which there is statistical evidence that their value is not zero. Similarly to the combinatorial identification rule, we recommend setting the threshold function \(c_1\) based on Wilk’s approximation. Given a significance level \(\delta \in [0,1]\), since we perform d statistical tests, we employ Bonferroni correction and we set \(c_1 = \chi ^2_{1,1-{\delta }/{d}}\). Refer to Algorithm 2 for the pseudocode of the identification rule.

figure b

This second procedure requires a test for every parameter, i.e.,  \({\mathcal {O}}(d)\) instead of \({\mathcal {O}}(2^d)\) tests. However, the correctness of Identification Rule 2, in the sense of Definition 2, comes with the cost of assuming the identifiability property (Assumption 2). What happens if we employ this second procedure in a case where the assumption does not hold? Consider, for instance, the case in which two parameters \(\theta _1\) and \(\theta _2\) are exchangeable, we will include none of them in \({\widehat{I}}_{c}\) as, individually, they are not necessary to explain the agent’s policy, while the pair \((\theta _1,\theta _2)^T\) is indeed necessary. We will discuss how to enforce Assumption 2, for the case of policies belonging to the exponential family, in the following section.

Remark 2

(On Frequentist and Bayesian Statistical Tests) In this paper, we restrict our attention to frequentist statistical tests, but, in principle, the same approaches can be extended to the Bayesian setting (Jeffreys, 1935). Indeed, the GLR test admits a Bayesian counterpart, known as the Bayes Factor (BF, Goodman, 1999; Morey et al., 2016). We consider the same setting presented in Sect. 2 in which we aim at testing the null hypothesis \({\mathcal {H}}_0 : \varvec{{\theta }}^\text {Ag} \in \varTheta _0\), against the alternative \({\mathcal {H}}_1 : \varvec{{\theta }}^\text {Ag} \in \varTheta \setminus \varTheta _0\). We take the Bayesian perspective, looking at each \(\varvec{{\theta }}\) not as an unknown fixed quantity but as a realization of prior distributions on the parameters defined in terms of the hypothesis: \(p(\varvec{{\theta }} | {\mathcal {H}}_{\star })\) for \(\star \in \{0,1\}\). Thus, given a dataset \({\mathcal {D}} = \left\{ X_i \right\} _{i=1}^n\), we can compute the likelihood of \({\mathcal {D}}\) given a parameter \(\varvec{{\theta }}\) as usual: \(p({\mathcal {D}}|\varvec{{\theta }}) = \prod _{i=1}^n p(X_i|\varvec{{\theta }})\). Combining the likelihood and the prior, we define the Bayes Factor as:

The Bayesian approach has the clear advantage of incorporating additional domain knowledge by means of the prior. Furthermore, if also a prior on the hypothesis is available \(p({\mathcal {H}}_{\star })\) for \(\star \in \{0,1\}\) it is possible to compute the ratio of the posterior probability of each hypothesis:

Compared to the GLR test, the Bayes factor provides richer information, since we can compute the likelihood of each hypothesis, given the data \({\mathcal {D}}\). However, like any Bayesian approach, the choice of the prior turns out to be of crucial importance. The computationally convenient prior (which might allow computing the integral in closed form) is typically not correct, leading to a biased test. In this sense, GLR replaces the integral with a single-point approximation centered in the maximum likelihood estimate. For these reasons, we leave the investigation of Bayesian approaches for policy space identification as future work.

4 Analysis for the exponential family

In this section, we provide an analysis of the Identification Rule 2 for a policy \(\pi _{\varvec{{\theta }}}\) linear in some state features \(\varvec{{\phi }}\) that belongs to the exponential family.Footnote 8 The section is organized as follows. We first introduce the exponential family, deriving a concentration result of independent interest (Theorem 2) and then we apply it for controlling the identification errors made by our identification rule (Theorem 3).

Exponential Family We refer to the definition of linear exponential family given in (Brown, 1986), that we state as an assumption.

Assumption 3

(Exponential Family of Linear Policies) Let \(\varvec{{\phi }}: {\mathcal {S}} \rightarrow {\mathbb {R}}^q\) be a feature function. The policy space \(\varPi _{\varTheta }\) is a space of linear policies, belonging to the exponential family, i.e.,  \(\varTheta = {\mathbb {R}}^d\) and all policies \(\pi _{\varvec{{\theta }}} \in \varPi _{\varTheta }\) have form:

$$\begin{aligned} \pi _{\varvec{{\theta }}}(a|s) = h(a) \exp \left\{ \varvec{{\theta }}^T \varvec{{t}}\left( s,a \right) - A(\varvec{{\theta }},s) \right\}, \end{aligned}$$

where h is a positive function, \(\varvec{{t}}\left( s,a\right)\) is the sufficient statistic that depends on the state via the feature function \(\varvec{{\phi }}\) (i.e.,  \(\varvec{{t}}\left( s,a\right) =\varvec{{t}}(\varvec{{\phi }}(s),a)\)) and \(A(\varvec{{\theta }},s) = \log \int _{{\mathcal {A}}} h(a) \exp \{ \varvec{{\theta }}^T \varvec{{t}}(s,a) \}\mathrm {d} a\) is the log partition function. We denote with \(\varvec{{{\overline{t}}}}(s,a,\varvec{{\theta }}) = \varvec{{t}}(s,a) - \mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{{\overline{a}} \sim \pi _{\varvec{{\theta }}} (\cdot |s)} \left[ \varvec{{t}}(s,{\overline{a}}) \right]\) the centered sufficient statistic.

This definition allows modeling the linear policies that are a popular choice in linear time-invariant systems and a valid option for robotic control (Deisenroth et al., 2013), sometimes even competitive with complex neural network parametrizations (Rajeswaran et al., 2017). Table 1 shows how to map the Gaussian linear policy with fixed covariance, typically used in continuous action spaces, and the Boltzmann linear policy, suitable for finite action spaces, to Assumption 3 (details in Appendix A.1).

Table 1 Action space \({\mathcal {A}}\), probability density function \(\pi _{\widetilde{\varvec{{\theta }}}}\), sufficient statistic \(\varvec{{t}}\), and function h for the Gaussian linear policy with fixed covariance and the Boltzmann linear policy

For the sake of the analysis, we enforce the following assumption concerning the tail behavior of the policy \(\pi _{\varvec{{\theta }}}\).

Assumption 4

(Subgaussianity) For any \(\varvec{{\theta }} \in \varTheta\) and for any \(s \in {\mathcal {S}}\) the centered sufficient statistic \(\varvec{{{\overline{t}}}}(s,a,\varvec{{\theta }})\) is subgaussian with parameter \(\sigma \ge 0\), i.e.,  for any \(\varvec{{\alpha }} \in {\mathbb {R}}^d\):

$$\begin{aligned} \mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{a \sim \pi _{\varvec{{\theta }}} (\cdot |s)} \left[ \exp \left\{ \varvec{{\alpha }}^T \varvec{{{\overline{t}}}}(s,a,\varvec{{\theta }}) \right\} \right] \le \exp \left\{ \frac{1}{2} \left\| \varvec{{\alpha }} \right\| _2^2 \sigma ^2 \right\} . \end{aligned}$$

A sufficient condition to ensure that Gaussian and Boltzmann are subgaussian is that the features \(\varvec{{\phi }}(s)\) are bounded in \(L_2\)-norm, uniformly over the state space \({\mathcal {S}}\) (Proposition 2). Furthermore, limited to the policies complying with Assumption 3, the identifiability (Assumption 2) can be restated in terms of the Fisher Information matrix (Rothenberg et al., 1971; Little et al., 2010).

Lemma 2

(Rothenberg et al., (1971), Theorem 3) Let \(\varPi _\varTheta\) be a policy space, as in Assumption 3. Then, under suitable regularity conditions (see Rothenberg et al., (1971)), if the Fisher Information matrix (FIM) \({\mathcal {F}}(\varvec{{\theta }})\):

$$\begin{aligned} {\mathcal {F}}(\varvec{{\theta }}) = \mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{\begin{array}{c} s \sim \nu \\ a \sim \pi _{\varvec{{\theta }}}(\cdot |s) \end{array}} \left[ \varvec{{{\overline{t}}}}(s,a,\varvec{{\theta }})\varvec{{{\overline{t}}}}(s,a,\varvec{{\theta }})^T \right] \end{aligned}$$

is non–singular for all \(\varvec{{\theta }} \in \varTheta\), then \(\varPi _\varTheta\) is identifiable. In this case, we denote with \(\lambda _{\min } = \inf _{\varvec{{\theta }} \in \varTheta } \lambda _{\min } \left( {\mathcal {F}}(\varvec{{\theta }}) \right) > 0\).

Proposition 1 of Appendix A.2.1 shows that a sufficient condition for the identifiability in the case of Gaussian and Boltzmann linear policies is that the second moment matrix of the feature vector \(\mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{s \sim \nu } \left[ \varvec{{\phi }}(s)\varvec{{\phi }}(s)^T \right]\) is non–singular along with the fact that the policy \(\pi _{\varvec{{\theta }}}\) plays each action with positive probability for the Boltzmann policy.

Remark 3

(How to enforce identifiability?) Requiring that \(\mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{s \sim \nu } \left[ \varvec{{\phi }}(s)\varvec{{\phi }}(s)^T \right]\) is full rank is essentially equivalent to require that all features \(\phi _i\) are linearly independent for all \(i \in \{1,...,d\}\). This condition can be easily met with a preprocessing phase that removes the linearly dependent features, for instance, by employing Principal Component Analysis (PCA, Jolliffe, 2011). For this reason, in our experimental evaluation we will always consider the case of linearly independent features.

When working with samples, however, we need to estimate the FIM from samples, leading to the empirical FIM, in which the expectation over the states of Eq. (8), is replaced with the sample mean:

$$\begin{aligned} \widehat{{\mathcal {F}}}(\varvec{{\theta }}) = \frac{1}{n} \sum _{i=1}^n \mathop {{{\,{{\mathbb { E}}}\,}}}\limits _{a \sim \pi _{\varvec{{\theta }}}(\cdot |s)} \left[ \varvec{{{\overline{t}}}}(s_i,a,\varvec{{\theta }})\varvec{{{\overline{t}}}}(s_i,a,\varvec{{\theta }})^T\right], \end{aligned}$$

where \(\{s_i\}_{i=1}^n \sim \nu\). We denote with \({\widehat{\lambda }}_{\min } = \inf _{\varvec{{\theta }} \in \varTheta }\lambda _{\min }(\widehat{{\mathcal {F}}}(\varvec{{\theta }}))\) the minimum eigenvalue of the empirical FIM. In order to carry out the subsequent analysis, we need to require that this quantity is non-zero.

Assumption 5

(Positive Eigenvalues of Empirical FIM) The minimum eigenvalue of the empirical FIM \(\widehat{{\mathcal {F}}}(\varvec{{\theta }})\) is non-zero for all \(\varvec{{\theta }} \in \varTheta\), i.e.,  \({\widehat{\lambda }}_{\min } = \inf _{\varvec{{\theta }} \in \varTheta }\lambda _{\min }(\widehat{{\mathcal {F}}}(\varvec{{\theta }})) > 0\).

The condition of Assumption 5 can be enforced as long as the true FIM \({{\mathcal {F}}}(\varvec{{\theta }})\) has a positive minimum eigenvalue \(\lambda _{\min }\), i.e.,  under identifiability assumption (Lemma 2) and given a sufficiently large number of samples. Proposition 4 of Appendix A.2.1 provides the minimum number of samples such that with high probability it holds that \({\widehat{\lambda }}_{\min } > 0\).

We are now ready to present a concentration result, of independent interest, for the parameters and the negative log–likelihood that represents the central tool of our analysis.

Theorem 2

Under Assumptions 1234, and 5let \({\mathcal {D}} = \{(s_i,a_i)\}_{i=1}^n\) be a dataset of \(n>0\) independent samples, where \(s_i \sim \nu\) and \(a_i \sim \pi _{\varvec{{\theta }}^{\text {Ag}}}(\cdot |s_i)\). Let \(\widehat{\varvec{{\theta }}} = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } {\widehat{\ell }}(\varvec{{\theta }})\) and \(\varvec{{\theta }}^{\text {Ag}}= \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } {\ell }(\varvec{{\theta }})\). Then, for any \(\delta \in [0,1]\), with probability at least \(1-\delta\) it holds that:

$$\begin{aligned} \left\| \widehat{ \varvec{{\theta }}} - \varvec{{\theta }}^{\text {Ag}}\right\| _2 \le \frac{\sigma }{{\widehat{\lambda }}_{\min }} \sqrt{\frac{2d}{n} \log \frac{2d}{\delta }}. \end{aligned}$$

Furthermore, with probability at least \(1-\delta\), it holds that individually:

$$\begin{aligned} \ell (\widehat{\varvec{{\theta }}}) - \ell (\varvec{{\theta }}^{\text {Ag}}) \le \frac{d^2\sigma ^4}{{\widehat{\lambda }}_{\min }^2 n} \log \frac{2d}{\delta } \quad \text {and} \quad {\widehat{\ell }}(\varvec{{\theta }}^{\text {Ag}}) - {\widehat{\ell }}(\widehat{\varvec{{\theta }}}) \le \frac{ d^2\sigma ^4}{{\widehat{\lambda }}_{\min }^2 n} \log \frac{2d}{\delta }. \end{aligned}$$

Proof sketch

The idea of the proof is to first obtain a probabilistic bound on the parameter difference in norm \(\left\| \widehat{ \varvec{{\theta }}} - \varvec{{\theta }}^\text {Ag} \right\| _2\). This result is given in Theorem 6. Then, we use the latter result together with Taylor expansion to bound the differences \(\ell (\widehat{\varvec{{\theta }}}) - \ell ({\varvec{{\theta }}}^\text {Ag})\) and \({\widehat{\ell }}({\varvec{{\theta }}}^\text {Ag}) - {\widehat{\ell }}(\widehat{\varvec{{\theta }}})\), as in Corollary 1. The full derivation can be found in Appendix A.2.3.

The theorem shows that the \(L_2\)–norm of the difference between the maximum likelihood parameter \(\widehat{\varvec{{\theta }}}\) and the true parameter \(\varvec{{\theta }}^{\text {Ag}}\) concentrates with rate \({\mathcal {O}}(n^{-1/2})\) while the likelihood \({\widehat{\ell }}\) and its expectation \(\ell\) concentrate with faster rate \({\mathcal {O}}(n^{-1})\).

Identification Rule Analysis We are now ready to start the analysis of Identification Rule 2. The goal of the analysis is, informally, to bound the probability of an identification error as a function of the number of samples n and the threshold function \(c_1\). For this purpose, we define the following quantities.

Definition 3

Consider an identification rule producing \({\widehat{I}}\) as approximate parameter index set. We define the significance \(\alpha\) and the power \(1-\beta\) of the identification rule as:

$$\begin{aligned} \alpha = \Pr \left( \exists i \notin I^{\text {Ag}}: i \in {\widehat{I}} \right), \quad \beta = \Pr \left( \exists i \in I^{\text {Ag}}: i \notin {\widehat{I}} \right) . \end{aligned}$$

Thus, \(\alpha\) represents the probability that the identification rule selects a parameter that the agent does not control, whereas \(\beta\) is the probability that the identification rule does not select a parameter that the agent does control.Footnote 9

By employing the results we derived for the exponential family (Theorem 2) we can now bound \(\alpha\) and \(\beta\), under a slightly more demanding assumption on \({\widehat{\lambda }}_{\min }\).

Theorem 3

Let \({\widehat{I}}_{c}\) be the set of parameter indexes selected by the Identification Rule 2obtained using \(n>0\) i.i.d. samples collected with \(\pi _{\varvec{{\theta }}^{\text {Ag}}}\), with \(\varvec{{\theta }}^{\text {Ag}}\in \varTheta\). Then, under Assumptions 1234, and 5, let \({\varvec{{\theta }}}_i^{\text {Ag}} = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{{\theta }} \in \varTheta _i} \ell (\varvec{{\theta }})\) for all \(i \in \{1,...,d\}\) and \(\xi = \min \left\{ 1, \frac{\lambda _{\min }}{\sigma ^2} \right\}\). If \({\widehat{\lambda }}_{\min } \ge \frac{\lambda _{\min }}{2\sqrt{2}}\) and \(\ell ({\varvec{{\theta }}}_i^{\text {Ag}}) - {l}(\varvec{{\theta }}^{\text {Ag}}) \ge c_1\), it holds that:

$$\begin{aligned}&\alpha\, \le \, 2d \exp \left\{ -\frac{c_1 {\lambda }_{\min }^2 n}{16d^2 \sigma ^4} \right\} \\&\beta \, \le \, (2d-1) \sum _{i \in I^{\text {Ag}}} \exp \left\{ - \frac{ \left( {l}({\varvec{{\theta }}}_i^{\text {Ag}}) - {l}({\varvec{{\theta }}^{\text {Ag}}}) - c_1 \right) {\lambda }_{\min } \xi n}{16(d-1)^2 \sigma ^2 } \right\} . \end{aligned}$$

Proof sketch

Concerning \(\alpha = \Pr \left( \exists i \notin I^{\text {Ag}}: i \in {\widehat{I}}_c \right)\), we employ a technique similar to that of Lemma 2 in (Garivier and Kaufmann, 2019) to remove the existential quantification. Instead, for \(\beta = \Pr \left( \exists i \in I^{\text {Ag}}: i \notin {\widehat{I}}_c \right)\) we first perform a union bound over \(i \in I^{\text {Ag}}\) and then we bound the individual \(\Pr \left( i \notin {\widehat{I}}_c \right)\). The full derivation can be found in Appendix A.3. \(\square\)

In principle, we could employ Theorem 3 to derive a proper value of \(c_1\) and n, given a required value of \(\alpha\) and \(\beta\). Unfortunately, their expression depend on \({\lambda }_{\min }\) which is unknown in practice. As already mentioned in the previous sections, we recommend employing Wilk’s asymptotic approximation to set the threshold function as \(c_1= \chi ^2_{1,1-{\delta }/{d}}\). This choice allows an asymptotic control of the significance of the identification rule.

Theorem 4

Let \({\widehat{I}}_{c}\) be the set of parameter indexes selected by the Identification Rule 2obtained using \(n>0\) i.i.d. samples collected with \(\pi _{\varvec{{\theta }}^{\text {Ag}}}\), with \(\varvec{{\theta }}^{\text {Ag}}\in \varTheta\). Then, under suitable regularity conditions (see Casella and Berger, (2002) Section 10.6.2), if \(c_1 = \chi ^2_{1,1-{\delta }/{d}}\) it holds that \(\alpha \le \delta\) when \(n\rightarrow +\infty\).


Starting from the definition of \(\alpha\), we first perform a union bound over \(i \notin I^{\text {Ag}}\) to remove the existential quantification.

$$\begin{aligned} \alpha = \Pr \left( \exists i \notin I^{\text {Ag}}: i \in {\widehat{I}}_c \right) = \Pr \left( \bigvee _{i \notin I^{\text {Ag}}} i \in {\widehat{I}}_c \right) \le \sum _{i \notin I^{\text {Ag}}} \Pr \left( i \in {\widehat{I}}_c \right) . \end{aligned}$$

Now, we bound each \(\Pr \left( i \in {\widehat{I}}_c \right)\) individually, recalling that \(\lambda _i\) is distributed asymptotically as a \(\chi ^2\) distribution with 1 degree of freedom and that \(c_1 = \chi^2 _{1,1-\delta /d}\):

$$\begin{aligned} \Pr \left( i \in {\widehat{I}}_c \right) = \Pr \left( \lambda _i > \chi^2 _{1,1-\delta /d} \right) \rightarrow \frac{\delta }{d}, \quad n \rightarrow \infty . \end{aligned}$$

Thus, we have that when \(n \rightarrow +\infty\):

$$\begin{aligned} \alpha \le \frac{d - d^{\text {Ag}}}{d} \delta \le \delta . \end{aligned}$$


5 Policy space identification in a configurable environment

The identification rules presented so far are unable to distinguish between a parameter set to zero because the agent cannot control it or because zero is its optimal value. To overcome this issue, we employ the Conf–MDP properties to select a configuration in which the parameters we want to examine have an optimal value other than zero. Intuitively, if we want to test whether the agent can control parameter \(\theta _i\), we should place the agent in an environment \(\varvec{{\omega }}_i \in \varOmega\) where \(\theta _i\) is “maximally important” for the optimal policy. This intuition is justified by Theorem 3, since to maximize the power of the test (\(1-\beta\)), all other things being equal, we should maximize the log–likelihood gap \({l}({\varvec{{\theta }}_i^\text {Ag}}) - {l}({\varvec{{\theta }}^\text {Ag}})\), i.e.,  parameter \(\theta _i\) should be essential to justify the agent’s behavior. Let \(I \subseteq \{1,...,d\}\) be a set of parameter indexes we want to test, our ideal goal is to find the environment \(\varvec{{\omega }}_I\) such that:

$$\begin{aligned} \varvec{{\omega }}_I \in \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\varvec{{\omega }} \in \varOmega } \left\{ {l}({\varvec{{\theta }}_I^\text {Ag}}(\varvec{{\omega }})) - {l}({\varvec{{\theta }}^\text {Ag}}(\varvec{{\omega }})) \right\}, \end{aligned}$$

where \({\varvec{{\theta }}^\text {Ag}}(\varvec{{\omega }}) \in \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\varvec{{\theta }} \in \varTheta } J_{{\mathcal {M}}_{\varvec{{\omega }}}}(\varvec{{\theta }})\) and \({\varvec{{\theta }}}_I^\text {Ag}(\varvec{{\omega }}) \in \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\varvec{{\theta }} \in \varTheta _I} J_{{\mathcal {M}}_{\varvec{{\omega }}}}(\varvec{{\theta }})\) are the parameters of the optimal policies in the environment \({\mathcal {M}}_{\varvec{{\omega }}}\) considering \(\varPi _{\varTheta }\) and \(\varPi _{\varTheta _I}\) as policy spaces respectively. Clearly, given the samples \({\mathcal {D}}\) collected with a single optimal policy \(\pi ^\text {Ag}(\varvec{{\omega }}_0)\) in a single environment \({\mathcal {M}}_{\varvec{{\omega }}_0}\), solving problem (10) is hard as it requires performing an off–distribution optimization both on the space of policy parameters and configurations. For these reasons, we consider a surrogate objective that assumes that the optimal parameter in the new configuration can be reached by performing a single gradient step.Footnote 10

Theorem 5

Let \(I \in \{1,...,d\}\) and \({\overline{I}} =\{1,...,d\}\setminus I\). For a vector \(\varvec{{v}} \in {\mathbb {R}}^d\), we denote with \(\varvec{{v}} \vert _I\) the vector obtained by setting to zero the components in I. Let \(\varvec{{\theta }}^\text {Ag}(\varvec{{\omega }}_0) \in \varTheta\) the initial parameter. Let \(\alpha \ge 0\) be a learning rate, \(\varvec{{\theta }}_I^\text {Ag} (\varvec{{\omega }}) = \varvec{{\theta }}_0 + \alpha \nabla _{\varvec{{\theta }}} J_{{\mathcal {M}}_{\varvec{{\omega }}}} (\varvec{{\theta }}^\text {Ag}(\varvec{{\omega }}_0)) \vert _I\) and \(\varvec{{\theta }}^\text {Ag} (\varvec{{\omega }}) = \varvec{{\theta }}_0 + \alpha \nabla _{\varvec{{\theta }}} J_{{\mathcal {M}}_{\varvec{{\omega }}}} (\varvec{{\theta }}^\text {Ag}(\varvec{{\omega }}_0))\). Then, under Assumption 2, we have:

$$\begin{aligned} {\ell }({\varvec{{\theta }}_I^\text {Ag}}(\varvec{{\omega }})) - {\ell }({\varvec{{\theta }}^\text {Ag}}(\varvec{{\omega }})) \ge \frac{\lambda _{\min } \alpha ^2}{2} \left\| \nabla _{\varvec{{\theta }}} J_{{\mathcal {M}}_{\varvec{{\omega }}}} (\varvec{{\theta }}^\text {Ag}(\varvec{{\omega }}_0)) \vert _{{\overline{I}}} \right\| _2^2. \end{aligned}$$


By second-order Taylor expansion of \(\ell\) and recalling that \(\nabla _{\varvec{{\theta }}} {\ell }({\varvec{{\theta }}^\text {Ag}(\varvec{{\omega }})}) = \varvec{{0}}\), we have:

$$\begin{aligned}&{\ell }({\varvec{{\theta }}_I^\text {Ag}}(\varvec{{\omega }})) - {\ell }({\varvec{{\theta }}^\text {Ag}}(\varvec{{\omega }})) \ge \frac{\lambda _{\min }}{2} \left\| {\varvec{{\theta }}_I^\text {Ag}}(\varvec{{\omega }}) - {\varvec{{\theta }}^\text {Ag}}(\varvec{{\omega }}) \right\| _2^2\\&\quad = \frac{\lambda _{\min }}{2} \left\| \varvec{{\theta }}^\text {Ag}(\varvec{{\omega }}_0) + \alpha \nabla _{\varvec{{\theta }}} J_{{\mathcal {M}}_{\varvec{{\omega }}}} (\varvec{{\theta }}^\text {Ag}(\varvec{{\omega }}_0)) \vert _I - \varvec{{\theta }}^\text {Ag}(\varvec{{\omega }}_0) - \alpha \nabla _{\varvec{{\theta }}} J_{{\mathcal {M}}_{\varvec{{\omega }}}} (\varvec{{\theta }}^\text {Ag}(\varvec{{\omega }}_0)) \right\| _2^2\\&\quad = \frac{\lambda _{\min }\alpha ^2}{2} \left\| \nabla _{\varvec{{\theta }}} J_{{\mathcal {M}}_{\varvec{{\omega }}}} (\varvec{{\theta }}^\text {Ag}(\varvec{{\omega }}_0)) \vert _{{\overline{I}}} \right\| _2^2. \end{aligned}$$


Thus, we maximize the \(L_2\)–norm of the gradient components that correspond to the parameters we want to test. Since we have at our disposal only samples \({\mathcal {D}}\) collected with the current policy \(\pi _{\varvec{{\theta }}^\text {Ag}(\varvec{{\omega }}_0)}\) and in the current environment \(\varvec{{\omega }}_0\), we have to perform an off–distribution optimization over \(\varvec{{\omega }}\). To this end, we employ an approach analogous to that of (Metelli et al., 2018b, 2020) where we optimize the empirical version of the objective with a penalization that accounts for the distance between the distribution over trajectories:


where \(\zeta \ge 0\) is a regularization parameter. We assume to have access to a dataset of trajectories \({\mathcal {D}} = \{\tau _i\}_{i=1}^n\) independently collected using policy \(\pi _{\varvec{{\theta }}}\) in the environment \({\mathcal {M}}_{\varvec{{\omega }}_0}\). Each trajectory is a sequence of triples \(\{(s_{i,t},a_{i,t},r_{i,t})\}_{t=1}^T\), where T is the trajectory horizon. The expression of the gradient estimator is given by:

The expression is obtained starting from the well–known G(PO)MDP gradient estimator and adapting for off–distribution estimation by introducing the importance weight (Metelli et al., 2018b). The dissimilarity penalization term corresponds to the estimated 2–Rényi divergence (Rényi, 1961) is obtained from the following expression, which represents the empirical second moment of the importance weight:

$$\begin{aligned} {\widehat{d}}_2 (\varvec{{\omega }} \Vert \varvec{{\omega }}_0) = \frac{1}{n} \sum _{i=1}^n \left( \frac{\mu _{\varvec{{\omega }}}(s_{i,0})}{\mu _{\varvec{{\omega }}_0}(s_{i,0})} \prod _{t=1}^{T} \frac{p_{\varvec{{\omega }}}(s_{i,t+1}|s_{i,t},a_{i,t})}{p_{\varvec{{\omega }}_0}(s_{i,t+1}|s_{i,t},a_{i,t})} \right) ^2. \end{aligned}$$

Refer to (Metelli et al., 2018b) for the theoretical background behind the choice of this objective function. For conciseness, we report the pseudocode of the identification procedure in a configurable environment for Identification Rule 2 only (Algorithm 3), while the pseudocode for Identification Rule 2 can be found in Appendix B.

figure c

6 Experimental results

In this section, we present the experimental results, focusing on three aspects of policy space identification.

  • In Sect. 6.1, we provide experiments to assess the quality of our identification rules in terms of the ability to correctly identifying the parameters controlled by the agent.

  • In Sect. 6.2, we focus on the application of policy space identification to Imitation Learning, comparing our identification rules with commonly employed regularization techniques.

  • In Sect. 6.3, we consider the Conf-MDP framework and we show how properly identifying the parameters controlled by the agent allows learning better (more specific) environment configurations.

Additional experiments together with the hyperparameter values are reported in Appendix C.

6.1 Identification rules experiments

In this section, we provide two experiments to test the ability of our identification rules in properly selecting the parameters the agent controls in different settings. We start with an experiment on a discrete grid world (Sect. 6.1.1) to highlight the beneficial effects of environment configuration in parameter identification. Then, we provide an experiment on a simulated car driving domain (Sect. 6.1.2) in which we compare the combinatorial and the simplified identification rules.

6.1.1 Discrete grid world

The grid world environment is a simple representation of a two-dimensional world (5\(\times\)5 cells) in which an agent has to reach a target position by moving in the four directions. Whenever an action is performed, there is a small probability of failure (0.1) triggering a random action. The initial position of the agent and the target position are drawn at the beginning of each episode from a Boltzmann distribution \(\mu _{\varvec{{\omega }}}\). The agent plays a Boltzmann linear policy \(\pi _{\varvec{{\theta }}}\) with binary features \(\varvec{{\phi }}\) indicating its current row and column and the row and column of the goal.Footnote 11 For each run, the agent can control a subset \(I^{\text {Ag}}\) of the parameters \(\varvec{{\theta }}_{I^{\text {Ag}}}\) associated with those features, which is randomly selected. Furthermore, the supervisor can configure the environment by changing the parameters \(\varvec{{\omega }}\) of the initial state distribution \(\mu _{\varvec{{\omega }}}\). Thus, the supervisor can induce the agent to explore certain regions of the grid world and, consequently, change the relevance of the corresponding parameters in the optimal policy.

The goal of this set of experiments is to show the advantages of configuring the environment when performing the policy space identification using rule 2. Figure 2 shows the empirical \({\widehat{\alpha }}\) and \({\widehat{\beta }}\), i.e.,  the fraction of parameters that the agent does not control that are wrongly selected and the fraction of those the agent controls that are not selected respectively, as a function of the number m of episodes used to perform the identification. We compare two cases: conf where the identification is carried out by also configuring the environment, i.e.,  optimizing Eq. (11), and no-conf in which the identification is performed in the original environment only. In both cases, we can see that \({\widehat{\alpha }}\) is almost independent of the number of samples, as it is directly controlled by the threshold function \(c_1\). Differently, \({\widehat{\beta }}\) decreases as the number of samples increases, i.e.,  the power of the test \(1-{\widehat{\beta }}\) increases with m. Remarkably, we observe that configuring the environment gives a significant advantage in understanding the parameters controlled by the agent w.r.t.  using a fixed environment, as \({\widehat{\beta }}\) decreases faster in the conf case. This phenomenon also empirically justifies our choice of objective (Eq. (11)) for selecting the new environment. Hyperparameters, further experimental results, together with experiments on a continuous version of the grid world, are reported in Appendix C.1.1C.1.2.

Fig. 2
figure 2

Discrete Grid World: \({\widehat{\alpha }}\) and \({\widehat{\beta }}\) error for conf and no-conf cases varying the number of episodes. 25 runs 95% c.i

Fig. 3
figure 3

Simulated Car Driving: fraction of correct identifications varying the number of episodes. 100 runs 95% c.i

6.1.2 Simulated car driving

We consider a simple version of a car driving simulator, in which the agent has to reach the end of a road in the minimum amount of time, avoiding running off-road. The agent perceives its speed, four sensors placed at different angles that provide distance from the edge of the road and it can act on acceleration and steering.

The purpose of this experiment is to show a case in which the identifiability assumption (Assumption 2) may not be satisfied. The policy \(\pi _{\varvec{{\theta }}}\) is modeled as a Gaussian policy whose mean is computed via a single hidden layer neural network with 8 neurons. Some of the sensors are not available to the agent, our goal is to identify which ones the agent can perceive.

In Fig. 3, we compare the performance of the Identification Rules 1 (Combinatorial) and 2 (Simplified), showing the fraction of runs that correctly identify the policy space. We note that, while for a small number of samples, the simplified rule seems to outperform, when the number of samples increases, the combinatorial rule displays remarkable stability, approaching the correct identification in all the runs. This is explained by the fact that, when multiple representations for the same policy are possible (like in this case when having a neural network as policy), considering one parameter at a time might induce the simplified rule to select a wrong set of parameters. Hyperparameters are reported in Appendix C.1.3.

Fig. 4
figure 4

Discrete Grid World: Norm of the difference between the expert’s parameter \(\varvec{{\theta }}^{\text {Ag}}\) and the estimated parameter \(\widehat{\varvec{{\theta }}}\) (left) and expected KL-divergence between the expert’s policy \(\pi _{\varvec{{\theta }}^{\text {Ag}}}\) and the estimated policy \(\pi _{\widehat{\varvec{{\theta }}}}\) (right) as a function of the number of collected episodes m. 25 runs, 95% c.i

6.2 Application to imitation learning

IL aims at recovering a policy replicating the behavior of an expert’s agent. Selecting the parameters that an agent can control can be interpreted as applying a form of regularization to the IL problem (Osa et al., 2018). In the IL literature, a widely used technique is based on entropy regularization (Neu et al., 2017), which was employed in several successful algorithms, such as Maximum Causal Entropy IRL methods (MCE, Ziebart et al., 2008,, 2010), and Generative Adversarial IL (Ho and Ermon, 2016). Alternatively, other approaches aim at enforcing a sparsity constraint on the recovered policy parameters (e.g.,  Lee et al., 2018; Reddy et al., 2019; Brantley et al., 2020).

The goal of this experiment consists in showing that if we have appropriately identified the expert’s policy space, we can mitigate overfitting/underfitting phenomena, with a general benefit on the process of learning the imitating policy. This experiment is conducted in the grid world domain, introduced in Sect. 6.1.1, using the same setting. In each run, the expert agent plays a (near) optimal Boltzmann policy \(\pi _{\varvec{{\theta }}^{\text {Ag}}}\) that makes use of a subset of the available parameters and provides a dataset \({\mathcal {D}} = \{(s_i,a_i)\}_{i=1}^n\) of n samples coming from m episodes.

In the IL framework knowing the policy space of the expert agent means properly tailoring the hypothesis space in which we search for the imitation policy. For this reason, we propose a comparison with common regularization techniques applied to maximum likelihood estimation. Figure 4 shows on the left the norm of the parameter difference \(\left\| \widehat{\varvec{{\theta }}} - \varvec{{\theta }}^{\text {Ag}} \right\| _{2}\) between the parameter recovered by the different IL methods \(\widehat{\varvec{{\theta }}}\) and the true parameter employed by the expert \(\varvec{{\theta }}^{\text {Ag}}\), whereas on the right we plot the estimated expected KL-divergence between the imitation policy and the expert’s policy computed as:

$$\begin{aligned} \widehat{{\mathbb {D}}}_{\text {KL}} \left( \pi _{\varvec{{\theta }}^{\text {Ag}}} \Vert \pi _{\widehat{\varvec{{\theta }}}} \right) = \frac{1}{n} \sum _{i=1}^n D_{\text {KL}} \left( \pi _{\varvec{{\theta }}^{\text {Ag}}}(\cdot |s_i) \Vert \pi _{\widehat{\varvec{{\theta }}}}(\cdot |s_i) \right) . \end{aligned}$$

The lines Conf and No-conf refer to the results of ML estimation obtained by restricting the policy space to the parameters identified by our simplified rule with and without employing environment configurability, respectively (precisely as in Sect. 6.1.1). ML, Ridge, and Lasso correspond to maximum likelihood estimation in the full parameter space. Specifically, they are obtained by minimizing the objective:

For ML we perform no regularization (\(\lambda ^{\text {R}}=\lambda ^{\text {L}}=0\)), for Ridge we set \(\lambda ^{\text {R}}=0.001\) and \(\lambda ^{\text {L}}=0\), and for Lasso we have \(\lambda ^{\text {R}}=0\) and \(\lambda ^{\text {L}}=0.001\).

We observe that Conf, i.e.,  the usage of our identification rule, together with environment configuration, outperforms the other methods. This is more evident in the expected KL-divergence plot (right), which is a more robust index compared to the norm of the parameter difference (left). Ridge and Lasso regularizations display good behavior, better than both the identification rule without configuration (No-Conf) and the plain maximum likelihood without regularization (ML). This illustrates two important points. First, it confirms the benefits of configuring the environment for policy space identification. Second, it shows that a proper selection of the parameters controlled by the agent allows improving over standard ML, which tends to overfit.Footnote 12 We tested additional values of the regularization hyperparametrers \(\lambda ^{\text {R}}\) and \(\lambda ^{\text {L}}\) and other regularization techniques (Shannon and Tsallis entropy). The complete results are reported in Appendix C.2.

It is worth noting that the specific IL setting we consider, i.e.,  the availability of an initial dataset \({\mathcal {D}}\) of expert’s demonstrations with no further interaction allowedFootnote 13 rules out from the comparison a large body of the literature that requires the possibility to interact with the expert or with the environment (e.g.,  Ho and Ermon, 2016; Lee et al., 2018). Nevertheless, these IL algorithms could be, in principle, adapted to this challenging no-interaction setting at the cost of restoring to off-policy estimation techniques (Owen, 2013), that, however, might inject further uncertainty in the learning process (see Appendix C.2 for details).

6.3 Application to configurable MDPs

The knowledge of the agent’s policy space could be relevant when the learning process involves the presence of an external supervisor, as in the case of Configurable Markov Decision Process (Metelli et al., 2018a,, 2019). In a Conf-MDP, the supervisor is in charge of selecting the best configuration for the agent, i.e.,  the one that allows the agent to achieve the highest performance possible. As intuition suggests, the best environment configuration is closely related to the agent’s capabilities. Agents with different perception and actuation possibilities might benefit from different configurations. Thus, the external supervisor should be aware of the agent’s policy space to select the most appropriate configuration for the specific agent.

In the Minigolf environment (Lazaric et al., 2007), an agent hits a ball using a putter with the goal of reaching the hole in the minimum number of attempts. Surpassing the hole causes the termination of the episode and a large penalization. The agent selects the force applied to the putter by playing a Gaussian policy linear in some polynomial features (complying to Lemma 2) of the distance from the hole (x) and the friction of the green (f). When an action is performed, a Gaussian noise is added whose magnitude depends on the green friction and on the action itself.

The goal of this experiment is to highlight that knowing the policy space is beneficial when learning in a Conf–MDP. We consider two agents with different perception capabilities: \({\mathscr {A}}_1\) has access to both the x and f, whereas \({\mathscr {A}}_2\) knows only x. Thus, we expect that \({\mathscr {A}}_1\) learns a policy that allows reaching the hole in a smaller number of hits, compared to \({\mathscr {A}}_2\), as it can calibrate force according to friction, whereas \({\mathscr {A}}_2\) has to be more conservative, being unaware of f. There is also a supervisor in charge of selecting, for the two agents, the best putter length \(\omega\), i.e.,  the configurable parameter of the environment.

Figure 5-left shows the performance of the optimal policy as a function of the putter length \(\omega\). We can see that for agent \({\mathscr {A}}_1\) the optimal putter length is \(\omega ^\text {Ag}_{{\mathscr {A}}_1}=5\) while for agent \({\mathscr {A}}_2\) is \(\omega ^\text {Ag}_{{\mathscr {A}}_2}=11.5\). Figure 5-right compares the performance of the optimal policy of agent \({\mathscr {A}}_2\) when the putter length \(\omega\) is chosen by the supervisor using four different strategies. In (i) the configuration is sampled uniformly in the interval [1, 15]. In (ii) the supervisor employs the optimal configuration for agent \({\mathscr {A}}_1\) (\(\omega =5\)), i.e.,  assuming the agent is aware of the friction. (iii) is obtained by selecting the optimal configuration of the policy space produced by using our identification rule 2. Finally, (iv) is derived by employing an oracle that knows the true agent’s policy space (\(\omega =11.5\)). We can see that the performance of the identification procedure (iii) is comparable with that of the oracle (iv) and notably higher than the performance when employing an incorrect policy space (ii). Hyperparameters and additional experiments are reported in Appendix C.3.

Fig. 5
figure 5

Mingolf: Performance of the optimal policy varying the putter length \(\omega\) for agents \({\mathscr {A}}_1\) and \({\mathscr {A}}_2\) (left) and performance of the optimal policy for agent \({\mathscr {A}}_2\) with four different strategies for selecting \(\omega\) (right). 100 runs 95% c.i

7 Conclusions

In this paper, we addressed the problem of identifying the policy space available to an agent in a learning process by simply observing its behavior when playing the optimal policy within such a space. We introduced two identification rules, both based on the GLR test, which can be applied to select the parameters controlled by the agent. Additionally, we have shown how to use the configurability property of the environment to improve the effectiveness of identification rules. The experimental evaluation highlights some essential points. First, the identification of the policy space brings advantages to the learning process in a Conf–MDP, helping to choose wisely the most suitable environment configuration. Second, we have shown that configuring the environment is beneficial for speeding up the identification process. Additionally, we have verified that policy space identification can improve imitation learning. Future research might investigate the usage of Bayesian statistical tests and the application of policy space identification to multi-agent RL (Busoniu et al., 2008). We believe that an agent in a multi-agent system might benefit from the knowledge of the policy space of its adversaries to understand what their action possibilities are and make decisions accordingly.