Appendix 1: Consistency proof
In this appendix we show that, assuming we use a consistent estimator of the behavior policy, the SEC estimator and RIS estimators are consistent estimators of \(\bar{\phi }\) and \(\bar{\chi }\) respectively.
Assumption 3
(\(\text{Consistent estimation of }\hat{\pi }\))
$$\begin{aligned} \underset{\pi \in \varPi }{\text {argmax}} \sum _{j=1}^k \log \pi (A_j | S_j) \xrightarrow {a.s.} \pi \end{aligned}$$
where \(\xrightarrow {a.s.}\) denotes almost sure convergence.
Proposition 1
Under Assumption 3, the SEC estimator is a consistent estimator of \(\bar{\phi }\):
$$\begin{aligned} {\text {SEC}}(D) \xrightarrow {a.s.} \bar{\phi }. \end{aligned}$$
Proof
We have assumed that as the amount of data increases, the behavior policy estimated by SEC will almost surely converge to the true behavior policy:
$$\begin{aligned} \hat{\pi } \xrightarrow {a.s.} {\pi _b}. \end{aligned}$$
Almost sure convergence to the true behavior policy means that SEC almost surely converges to the Monte Carlo estimate. Consider the difference, \({\text {SEC}}({{D}}) - {\text {MC}}({{D}})\). Since \(\hat{\pi } \xrightarrow {a.s.} {\pi _b}\), we have that:
$$\begin{aligned} {\text {SEC}}({{D}}) - {\text {MC}}({{D}}) \xrightarrow {a.s.} 0. \end{aligned}$$
Thus, with probability 1, SEC and Monte Carlo converge to the same value. Since the Monte Carlo estimator is a consistent estimator of \(\bar{\phi }\), then with probability 1 we have that \({\text {OIS}}({\pi _e}, {{D}})\) converges to \(\bar{\phi }\). Thus \({\text {SEC}}(\mathcal {D}) \xrightarrow {a.s.} \bar{\phi }\).
Similarly, for RIS(n):
Proposition 3
Under Assumption 3, \(\forall n\), \({\text {RIS}}(n)\) is a consistent estimator of \(\bar{\chi }\): \({\text {RIS}}(n)(\pi , D) \xrightarrow {a.s.} \bar{\chi }\).
Proof
The proof is identical to that for Proposition 3 with RIS(n) taking the place of SEC, \(\bar{\chi }\) taking the place of \(\bar{\phi }\), and the off-policy ordinary importance sampling estimator taking the place of the Monte Carlo estimator.
Appendix 2: Consistent behavior policy estimation
The previous section proves the SEC and RIS estimators are consistent as long as they use consistent estimators of the true behavior policy. In this section we give more precise assumptions under which we can prove consistent behavior policy estimation.
The main intuition for the proofs is that SEC and RIS estimators are performing policy search on an estimate of the log-likelihood, \(\widehat{{\mathcal {L}}}(\pi | {{D}})\), as a surrogate objective for the true log-likelihood, \({\mathcal {L}}(\pi )\). Since \({\pi _b}\) has generated our data, \({\pi _b}\) is the optimal solution to this policy search. As long as, for all \(\pi\), \(\widehat{{\mathcal {L}}}(\pi | {{D}})\) is a consistent estimator of \({\mathcal {L}}(\pi )\) then selecting \(\hat{\pi } = \displaystyle \text {argmax}_{\pi \in \varPi } \widehat{{\mathcal {L}}}(\pi | {{D}})\) will converge probabilistically to \({\pi _b}\). If the set of policies we search over, \(\varPi\), is countable then this argument is almost enough to show a consistent behavior policy estimator. The difficulty (as we explain below) arises when \(\varPi\) is not countable.
Our proof takes inspiration from Thomas and Brunskill who show that their magical policy search algorithm converges to the optimal policy by maximizing a surrogate estimate of policy value 2016b. They show that performing policy search on a policy value estimate, \(\hat{v}(\pi )\), will almost surely return the policy that maximizes \(v(\pi )\) if \(\hat{v}(\pi )\) is a consistent estimator of \(v(\pi )\). The proof is almost identical; the notable difference is substituting the log-likelihood, \({\mathcal {L}}(\pi )\), and a consistent estimator of the log-likelihood, \(\widehat{{\mathcal {L}}}(\pi | {{D}})\), in place of \(v(\pi )\) and \(\hat{v}(\pi )\).
Appendix 2.1: Definitions and assumptions
Let \(\mathcal {H}_n\) be the set of all possible state-action trajectory segments with n states and \(n-1\) actions:
$$\begin{aligned} \mathcal {H}_n = \mathcal {S}^n \times \mathcal {A}^{n-1}. \end{aligned}$$
We will denote elements of \(\mathcal {H}_n\) as \(h_n\) and random variables that take values from \(\mathcal {H}_n\) as \(H_n\). Let \({d_{{\pi _b}, \mathcal {H}_n}}: \mathcal {H}_n \rightarrow [0,1]\) be the distribution over elements of \(\mathcal {H}_n\) induced by running \({\pi _b}\). Previously, we defined the behavior policy, \({\pi _b}\), to be a function mapping state-action pairs to probabilities. We re-define \({\pi _b}: \mathcal {H}_n \times \mathcal {A} \rightarrow [0,1]\), i.e., a policy that conditions the distribution over actions on the preceding length n trajectory segment. These definitions are equivalent provided for any \(h_{n,i} = (s_i, a_i, ... s_{i+n-1})\) and \(h_{n,j} = (s_j, a_j, ... s_{j+n-1})\), if \(s_{i+n-1} = s_{j+n-1}\) then \(\forall a\) \({\pi _b}(a | h_{n,i}) = {\pi _b}(a | h_{n,j})\).
Let \((\Omega , \mathcal {F}, \mu )\) be a probability space and \(D_m: \Omega \rightarrow \mathcal {D}\) be a random variable. \(D_m(\omega )\) is a sample of m trajectories with \(\omega \in \Omega\). Let \(d_{\pi _b}\) be the distribution of length n trajectory segments under \({\pi _b}\). Define the expected log-likelihood:
$$\begin{aligned} {\mathcal {L}}(\pi ) = \mathbf {E} \biggl [\log \pi (A | H_n) \biggm | H_n \sim d_{{\pi _b}, \mathcal {H}_n}, A \sim {\pi _b}\biggr ] \end{aligned}$$
and its sample estimate from samples in \(D_m(\omega )\):
$$\begin{aligned} \widehat{{\mathcal {L}}}(\pi | D_m(\omega )) = \frac{1}{m {l}} \sum _{j=1}^m \sum _{t=0}^{{l}-1} \log \pi (A_t^j | H_{t-n, t}^j). \end{aligned}$$
Note that:
$$\begin{aligned} {\pi _b}= \mathop {\text {argmax}}\limits _{\pi \in \varPi } {\mathcal {L}}(\pi ) \end{aligned}$$
and
$$\begin{aligned} {\pi _{D}}^{(n)} = \mathop {\text {argmax}}\limits _{\pi \in \varPi } \widehat{{\mathcal {L}}}(\pi | D_m(\omega )). \end{aligned}$$
Define the KL-divergence (\({D_\mathtt {KL}}\)) between \({\pi _b}\) and \({\pi _{D}}\) after segment \(h_n\) as:
$$\begin{aligned} \delta _\mathtt {KL}(h_n) = {D_\mathtt {KL}}({\pi _b}(\cdot | h_n), {\pi _{D}}(\cdot | h_n)). \end{aligned}$$
Assuming for all \(h_n\) and a the variance of \(\log \pi (a | h_n)\) is bounded, \(\widehat{{\mathcal {L}}}(\pi | D_m(\omega ))\) is a consistent estimator of \({\mathcal {L}}(\pi )\). We make this assumption explicit:
Assumption 8
(Consistent Estimation of Log likelihood). For all \(\pi \in \varPi\), \(\widehat{{\mathcal {L}}}(\pi | D_m(\omega )) \xrightarrow {a.s.} {\mathcal {L}}(\pi )\).
This assumption will hold when the support of \({\pi _b}\) is a subset of the support of \(\pi\) for all \(\pi \in \varPi\), i.e., no \(\pi \in \varPi\) places zero probability measure on an action that \({\pi _b}\) might take. We can ensure this assumption is satisfied by only considering \(\pi \in \varPi\) that place non-zero probability on any action that \({\pi _b}\) has taken.
We also make an additional assumption about the piece-wise continuity of the log-likelihood, \({\mathcal {L}}\), and the estimate of the log-likelihood, \(\widehat{{\mathcal {L}}}\). First we present two necessary definitions as given by Thomas and Brunskill (2016b):
Definition 3
(Piecewise Lipschitz continuity). We say that a function \(f: M \rightarrow \mathbb {R}\) on a metric space (M, d) is piecewise Lipschitz continuous with respect to Lipschitz constant K and with respect to a countable partition, \(\{M_1,M_2,...\}\) if f is Lipschitz continuous with Lipschitz constant K on all metric spaces in \(\{(M_i, d_i)\}_{i=1}^\infty\).
Definition 4
(\(\delta\)-covering). If (M, d) is a metric space, a set \(X \subset M\) is a \(\delta\)-covering of (M, d) if and only if \(\max _{y \in M} \min _{x \in X} d(x,y) \le \delta\).
Assumption 9
(Piecewise Lipschitz objectives). Our policy class, \(\varPi\), is equipped with a metric, \(d_\varPi\), such that for all \(D_m(\omega )\) there exist countable partition of \(\varPi\), \(\varPi ^{\mathcal {L}}:=\{\varPi ^{\mathcal {L}}_1, \varPi ^{\mathcal {L}}_2, ...\}\) and \(\varPi ^{\widehat{{\mathcal {L}}}} :=\{\varPi ^{\widehat{{\mathcal {L}}}}_1, \varPi ^{\widehat{{\mathcal {L}}}}_2, ...\}\), where \({\mathcal {L}}\) and \(\widehat{{\mathcal {L}}}(\cdot | D_m(\omega ))\) are piecewise Lipschitz continuous with respect to \(\varPi ^{\mathcal {L}}\) and \(\varPi ^{\widehat{{\mathcal {L}}}}\) with Lipschitz constants K and \(\widehat{K}\) respectively. Furthermore, for all \(i \in \mathbb {N}_{>0}\) and all \(\delta > 0\) there exist countable \(\delta\)-covers of \(\varPi ^{\mathcal {L}}_i\) and \(\varPi ^{\widehat{{\mathcal {L}}}}_i\).
As pointed out by Thomas and Brunskill, this assumption holds for the most commonly considered policy classes but is also general enough to hold for other settings (see Thomas and Brunskill 2016b for further discussion of Assumption 9 and the related definitions).
Appendix 2.2: Consistent behavior policy estimation proof
We now show that SEC and RIS estimators use consistent behavior policy estimation by showing that the expected KL-divergence between the true behavior policy and estimted behavior policy almost surely goes to zero.
Lemma 1
If Assumptions 8 and 9 hold then \(\mathbf {E}[\delta _\mathtt {KL}(H_n) | H_n \sim {d_{{\pi _b}, \mathcal {H}_n}}] \xrightarrow {a.s.} 0\).
Proof
Define \(\varDelta (\pi , \omega ) = |\widehat{{\mathcal {L}}}(\pi | D_m(\omega )) - {\mathcal {L}}(\pi )|\). From Assumption 8 and one definition of almost sure convergence, for all \(\pi \in \varPi\) and for all \(\epsilon > 0\):
$$\begin{aligned} \Pr \left( \liminf _{m\rightarrow \infty } \{ \omega \in \Omega : \varDelta (\pi , \omega ) < \epsilon \}\right) = 1. \end{aligned}$$
(21)
Thomas and Brunskill point out that because \(\varPi\) may not be countable, (21) may not hold at the same time for all \(\pi \in \varPi\). More precisely, it does not immediately follow that for all \(\epsilon >0\):
$$\begin{aligned} \Pr \left( \liminf _{m\rightarrow \infty } \{ \omega \in \Omega : \forall \pi \in \varPi , \varDelta (\pi , \omega ) < \epsilon \}\right) = 1. \end{aligned}$$
(22)
Let \(C(\delta )\) denote the union of all of the policies in the \(\delta\)-covers of the countable partitions of \(\varPi\) assumed to exist by Assumption 2. Since the partitions are countable and the \(\delta\)-covers for each region are assumed to be countable, we have that \(C(\delta )\) is countable for all \(\delta\). Thus, for all \(\pi \in C(\delta )\), (21) holds simulatenously. More precisely, for all \(\delta > 0\) and for all \(\epsilon > 0\):
$$\begin{aligned} \Pr \left( \liminf _{m\rightarrow \infty } \{ \omega \in \Omega : \forall \pi \in C(\delta ), \varDelta (\pi , \omega ) < \epsilon \}\right) = 1. \end{aligned}$$
(23)
Consider a \(\pi \not \in C(\delta )\). By the definition of a \(\delta\)-cover and Assumption 9, we have that \(\exists \pi ^\prime \in \varPi _i^{\mathcal {L}}, d(\pi , \pi ^\prime ) \le \delta\). Since Assumption 9 requires \({\mathcal {L}}\) to be Lipschitz continuous on \(\varPi ^{\mathcal {L}}_i\), we have that \(|{\mathcal {L}}(\pi ) - {\mathcal {L}}(\pi ^\prime )| \le K\delta\). Similarly \(|\widehat{{\mathcal {L}}}(\pi | D_m(\omega )) - \widehat{{\mathcal {L}}}(\pi ^\prime | D_m(\omega ))| \le \widehat{K}\delta\). So, \(|\widehat{{\mathcal {L}}}(\pi | D_m(\omega )) - {\mathcal {L}}(\pi )| \le |\widehat{{\mathcal {L}}}(\pi | D_m(\omega )) - {\mathcal {L}}(\pi ^\prime )| + K\delta \le |\widehat{{\mathcal {L}}}(\pi ^\prime | D_m(\omega )) - {\mathcal {L}}(\pi ^\prime )| + (\widehat{K} + K)\delta\). Then it follows that for all \(\delta > 0\):
$$\begin{aligned} \left( \forall \pi \in C(\delta ), \varDelta (\pi , \omega ) \le \epsilon \right) \rightarrow \left( \forall \pi \in \varPi , \varDelta (\pi , \omega ) < \epsilon + (K + \widehat{K})\delta \right) . \end{aligned}$$
Substituting this into (23) we have that for all \(\delta > 0\) and for all \(\epsilon > 0\):
$$\begin{aligned} \Pr \left( \liminf _{m\rightarrow \infty } \{\omega \in \Omega : \forall \pi \in \varPi , \varDelta (\pi , \omega ) < \epsilon + (K + \widehat{K}) \delta \} \right) = 1. \end{aligned}$$
The next part of the proof massages (23) into a statement of the same form as (22). Consider the choice of \(\delta :=\epsilon / (K + \widehat{K})\). Define \(\epsilon ^\prime = 2 \epsilon\). Then for all \(\epsilon ^\prime > 0\):
$$\begin{aligned} \Pr \left( \liminf _{m\rightarrow \infty } \{\omega \in \Omega : \forall \pi \in \varPi , \varDelta (\pi , \omega ) < \epsilon ^\prime \} \right) = 1. \end{aligned}$$
(24)
Since \(\forall \pi \in \varPi , \varDelta (\pi , \omega ) < \epsilon ^\prime\), we obtain:
$$\begin{aligned}&\varDelta ({\pi _b}, \omega ) < \epsilon ^\prime \end{aligned}$$
(25)
$$\begin{aligned}&\varDelta ({\pi _{D}}, \omega ) < \epsilon ^\prime \end{aligned}$$
(26)
and then applying the definition of \(\varDelta\):
$$\begin{aligned} {\mathcal {L}}({\pi _{D}}) {\mathop {\le }\limits ^{(a)}}&{\mathcal {L}}({\pi _b}) \end{aligned}$$
(27)
$$\begin{aligned} {\mathop {<}\limits ^{(b)}}&\widehat{{\mathcal {L}}}({\pi _b}| D_m(\omega )) + \epsilon ^\prime \end{aligned}$$
(28)
$$\begin{aligned} {\mathop {\le }\limits ^{(c)}}&\widehat{{\mathcal {L}}}({\pi _{D}}|D_m(\omega )) + \epsilon ^\prime \end{aligned}$$
(29)
$$\begin{aligned} {\mathop {\le }\limits ^{(d)}}&{\mathcal {L}}({\pi _{D}}) + 2 \epsilon ^\prime \end{aligned}$$
(30)
where (a) comes from the fact that \({\pi _b}\) maximizes \({\mathcal {L}}\), (b) comes from (25), (c) comes from the fact that \({\pi _{D}}\) maximizes \(\widehat{{\mathcal {L}}}(\cdot | D_m(\omega ))\), and (d) comes from (26). Considering (27) and (30), it follows that \(| {\mathcal {L}}({\pi _{D}}) - {\mathcal {L}}({\pi _b})| < 2\epsilon ^\prime\). Thus, (24) implies that:
$$\begin{aligned} \forall \epsilon ^\prime > 0, \Pr \left( \liminf _{m\rightarrow \infty } \{\omega \in \Omega : | {\mathcal {L}}({\pi _{D}}) - {\mathcal {L}}({\pi _b})| < 2\epsilon ^\prime \} \right) = 1. \end{aligned}$$
Using \(\epsilon '' :=2\epsilon ^\prime\) we obtain:
$$\begin{aligned} \forall \epsilon '' > 0, \Pr \left( \liminf _{m\rightarrow \infty } \{\omega \in \Omega : | {\mathcal {L}}({\pi _{D}}) - {\mathcal {L}}({\pi _b})| < \epsilon '' \} \right) = 1 \end{aligned}$$
From the definition of the KL-Divergence,
$$\begin{aligned} {\mathcal {L}}({\pi _{D}}) - {\mathcal {L}}({\pi _b}) = \mathbf {E}[\delta _\mathtt {KL}(H_n) | H_n \sim {d_{{\pi _b}, \mathcal {H}_n}}] \end{aligned}$$
and we obtain that:
$$\begin{aligned} \forall \epsilon > 0, \Pr \left( \liminf _{n \rightarrow \infty } \{ \omega \in \Omega : | - \mathbf {E}[\delta _\mathtt {KL}(H_n) | H_n \sim {d_{{\pi _b}, \mathcal {H}_n}}] | < \epsilon \} \right) = 1 \end{aligned}$$
And finally, since the KL-Divergence is non-negative:
$$\begin{aligned} \forall \epsilon > 0, \Pr \left( \liminf _{m\rightarrow \infty } \{\omega \in \Omega : \mathbf {E}[\delta _\mathtt {KL}(H_n) | H_n \sim {d_{{\pi _b}, \mathcal {H}_n}}] | < \epsilon \} \right) = 1, \end{aligned}$$
which, by the definition of almost sure convergence, means that
$$\begin{aligned} \mathbf {E}[\delta _\mathtt {KL}(H_n) | H_n \sim {d_{{\pi _b}, \mathcal {H}_n}}] \xrightarrow {a.s.} 0. \end{aligned}$$
Appendix 3: Asymptotic variance of RIS and SEC
In this section we prove that the SEC estimator and, \(\forall n\), \({\text {RIS}}(n)\) has asymptotic variance at most that of the Monte Carlo estimator. These results are corollaries of Theorem 1 in Henmi et al. (2007) that holds for general Monte Carlo integration. Consider estimating \(v = \mathbf {E}[f(X) | X \sim p]\) for probability mass function p and real-valued function f with domain \(\mathcal {X}\). Note that while we define distributions as probability mass functions, this result can be applied to continuous-valued state and action spaces by replacing probability mass functions with density functions. Given parameterized and twice differentiable probability mass function \(q(\cdot | \tilde{{\varvec{\theta }}})\), the Monte Carlo estimator of v is \(\tilde{v} :=\frac{1}{m} \sum _{i=1}^m \frac{p(X_i)}{q(X_i, \tilde{{\varvec{\theta }}})}f(X_i)\). Similarly, define \(\hat{v} :=\frac{1}{m} \sum _{i=1}^m \frac{p(X_i)}{q(X_i, \hat{{\varvec{\theta }}})}f(X_i)\) where \(\hat{{\varvec{\theta }}}\) is the maximum likelihood estimate of \(\tilde{{\varvec{\theta }}}\) given samples from \(q(\cdot | \tilde{{\varvec{\theta }}})\). The following theorem relates the asymptotic variance of \(\hat{v}\) to that of \(\tilde{v}\).
Theorem 1
$$\begin{aligned} {{\text {Var}}}_\mathtt {A}({\hat{v}}) \le {{\text {Var}}}_\mathtt {A}(\tilde{v}) \end{aligned}$$
where \({{\text {Var}}}_\mathtt {A}\) denotes the asymptotic variance.
Proof
See Theorem 1 of Henmi et al. (2007).
Theorem 1 shows that an importance sampling estimate using the maximum likelihood estimate of the sampling distribution parameters yields an asymptotically lower variance estimate than using the true parameters, \(\tilde{{\varvec{\theta }}}\). To specialize this theorem to our setting, we show that the maximum likelihood behavior policy parameters are also the maximum likelihood parameters for the state-action distribution (for SEC) and the trajectory distribution (for RIS methods). We first need to specify the parameterized class of the sampling distribution. For SEC, the sampling distribution is \(\Pr (S=s, A=a; {\varvec{\theta }}) = d_\pi (s) {\pi _{\varvec{\theta }}}(a|s)\). Note that the state distribution \(d_\pi\) is not parameterized by \({\varvec{\theta }}\)—only the policy, \({\pi _{\varvec{\theta }}}\). This parameterization means that changing \({\varvec{\theta }}\) leaves the distribution of states unchanged and is justified because we are only concerned with weighting already sampled data and not with collecting additional data. For RIS(n), the sampling distribution is \(\Pr (H=h; {\varvec{\theta }}) = p(h) w_{\pi _{\varvec{\theta }}}(h)\) where \(p(h) :=d_0(s_0) \prod _{t=1}^{{l}-1} P(s_t | s_{t-1}, a_{t-1})\) and \(w_{\pi _{\varvec{\theta }}}(h) = \prod _{t=0}^{{l}-1} {\pi _{\varvec{\theta }}}(a_t | s_{t-n}, a_{t-n},\dots ,s_t)\).
We next present two lemmas that show that maximum likelihood estimation of the behavior policy is equivalent to maximum likelihood estimation of the specified sampling distributions. For SEC, we give the following lemma:
Lemma 2
$$\begin{aligned}&\mathop {\text {argmax}}\limits _{\varvec{\theta }}\sum _{i=1}^k \log {\pi _{\varvec{\theta }}}(A_i | S_i) = \mathop {\text {argmax}}\limits _{\varvec{\theta }}\sum _{i=1}^k \log \Pr (S_k, A_k ; {\varvec{\theta }}) \end{aligned}$$
Proof
$$\begin{aligned} \mathop {\text {argmax}}\limits _{\varvec{\theta }}\sum _{i=1}^k \log {\pi _{\varvec{\theta }}}(A_i | S_i)&= \mathop {\text {argmax}}\limits _{\varvec{\theta }}\sum _{i=1}^k \log {\pi _{\varvec{\theta }}}(A_i | S_i) + \underbrace{\log d_\pi (S_i)}_\text{const w.r.t. }{\varvec{\theta }} \\&= \mathop {\text {argmax}}\limits _{\varvec{\theta }}\sum _{i=1}^k \log \Pr (S_i, A_i; {\varvec{\theta }}) \end{aligned}$$
And for all RIS(n):
Lemma 3
$$\begin{aligned} \mathop {\text {argmax}}\limits _{\varvec{\theta }}\sum _{i=1}^m \sum _{t=0}^{{l}- 1} \log {\pi _{\varvec{\theta }}}(a_t^i | s_{t-n}^i, a_{t-n}^i,\dots ,s_t^i)&=\mathop {\text {argmax}}\limits _{\varvec{\theta }}\sum _{i=1}^m \log \Pr (h_i ; {\varvec{\theta }}) \end{aligned}$$
Proof
$$\begin{aligned} \mathop {\text {argmax}}\limits _{\varvec{\theta }}\sum _{i=1}^m \sum _{t=0}^{{l}-1} \log {\pi _{\varvec{\theta }}}(a_t^i | s_{t-n}^i, a_{t-n}^i,\dots ,s_t^i)&= \mathop {\text {argmax}}\limits _{\varvec{\theta }}\sum _{i=1}^m \sum _{t=0}^{{l}-1} \log {\pi _{\varvec{\theta }}}(a_t^i | s_{t-n}^i, a_{t-n}^i,\dots ,s_t^i) \\&\quad +\, \underbrace{\log d(s_0^i) + \sum _{t=1}^{{l}-1} \log P(s_t^i | s_{t-1}^i, a_{t-1}^i)}_\text {const w.r.t. }{\varvec{\theta }} \\&= \mathop {\text {argmax}}\limits _{\varvec{\theta }}\sum _{i=1}^m \log w_{\pi _{\varvec{\theta }}}(h_i) + \log p(h_i) \\&= \mathop {\text {argmax}}\limits _{\varvec{\theta }}\sum _{i=1}^m \log \Pr (h_i; {\varvec{\theta }}) \end{aligned}$$
Combining each of these lemmas in turn with Theorem 1 allows us to prove Corollaries 1 and 2 respectively.
Corollary 1
Let \({{\text {Var}}}_\mathtt {A}({\text {EST}})\) denote the asymptotic variance of estimator \({\text {EST}}\). Under Assumptions 4 and 5,
$$\begin{aligned} {{\text {Var}}}_\mathtt {A}({\text {SEC}}) \le {{\text {Var}}}_\mathtt {A}({\text {MC}}). \end{aligned}$$
Proof
Define \(\mathcal {X} :=\mathcal {S} \times \mathcal {A}\), \(f(x) :=\phi (s,a)\), \(p(x) :=\Pr (s,a | \pi )\) and \(q(s,a | {\varvec{\theta }}) :=\Pr (s,a | {\pi _{\varvec{\theta }}})\). Lemma 2 implies that:
$$\begin{aligned} \displaystyle \hat{{\varvec{\theta }}} = \mathop {\text {argmax}}\limits _{{\varvec{\theta }}\in \varPi _{\varvec{\theta }}} \sum _{i=1}^k \sum _{t=0}^{{l}- 1} \log {\pi _{\varvec{\theta }}}(a_j | s_j) \end{aligned}$$
is the maximum likelihood estimate of \(\tilde{{\varvec{\theta }}}\) (where \(\pi _{\tilde{{\varvec{\theta }}}} = \pi\) and \(\Pr (s,a | \tilde{{\varvec{\theta }}})\) is the probability of (s, a) under \(\pi\)) and then Corollary 1 follows directly from Theorem 1.
Corollary 2
Under Assumptions 4 and 5,\(\forall n\),
$$\begin{aligned} {{\text {Var}}}_\mathtt {A}({{\text {RIS}}(n)(\pi , {D})}) \le {{\text {Var}}}_\mathtt {A}({{\text {OIS}}(\pi , {D}, {\pi _b})}) \end{aligned}$$
where \({{\text {Var}}}_\mathtt {A}\) denotes the asymptotic variance.
Proof
Define \(f(x) = g(h)\), \(p(h) = \Pr (h | {\pi _e})\) and \(q(h | {\varvec{\theta }}) = \Pr (h | {\pi _{\varvec{\theta }}})\). Lemma 3 implies that:
$$\begin{aligned} \displaystyle \hat{{\varvec{\theta }}} = \mathop {\text {argmax}}\limits _{{\varvec{\theta }}\in \varPi _{\varvec{\theta }}} \sum _{i=1}^m \sum _{t=0}^{{l}-1} \log {\pi _{\varvec{\theta }}}(a_t^i | s_t^i) \end{aligned}$$
is the maximum likelihood estimate of \(\tilde{{\varvec{\theta }}}\) (where \(\pi _{\tilde{{\varvec{\theta }}}} = {\pi _b}\) and \(\Pr (h|\tilde{{\varvec{\theta }}})\) is the probability of h under \({\pi _b}\)) and then Corollary 2 follows directly from Theorem 1.
Note that for RIS(n) with \(n > 0\), the condition that \(\pi _{\tilde{{\varvec{\theta }}}} \in \varPi ^n\) can hold even if the distribution of \(A_t \sim \pi _{\tilde{{\varvec{\theta }}}}\) (i.e., \(A_t \sim {\pi _b}\)) is only conditioned on \(s_t\). This condition holds when \(\exists {\pi _{\varvec{\theta }}}\in \varPi ^n\) such that \(\forall s_{t-n}, a_{t-n}, \dots a_{t-1}\):
$$\begin{aligned} \pi _{\tilde{{\varvec{\theta }}}}(a_t | s_t) = {\pi _{\varvec{\theta }}}(a_t | s_{t-n}, a_{t-n},\dots , s_t), \end{aligned}$$
i.e., the action probabilities only vary with respect to the immediate preceding state.
Appendix 4: SEC variance proof
In this appendix we prove Proposition 2 from Sect. 3.2:
Proposition 4
Let \({{\text {Var}}_{}\left( {\text {EST}}\right) }\) denote the variance of estimator \({\text {EST}}\). Under Assumptions 6 and 7, for the Monte Carlo estimator, \({\text {MC}}\), and the SEC estimator, \({\text {SEC}}\):
$$\begin{aligned} {{\text {Var}}_{}\left( {\text {SEC}}(B)\right) } \le {{\text {Var}}_{}\left( {\text {MC}}(B)\right) } \end{aligned}$$
Recall that B is a set of state-action pairs collected by running the current policy \(\pi\). Let X be the random variable representing the states observed in B and let U be the random variable representing the actions observed in B. We will sometimes write \(\{X, U\}\) in place of B to make the composition of B explicit. Let \({{\text {Var}}_{X}\left( {\text {EST}}(\{X,U\})\right) }\) denote the variance of estimator \({\text {EST}}\) with respect to the state set X. Let \({{\text {Var}}_{U}\left( {\text {EST}}(\{X,U\}) | X = \mathcal {X}\right) }\) denote the variance of estimator \({\text {EST}}\) with respect to the action set U given \(X = \mathcal {X}\)
Under Assumptions 6 and 7, we make two claims about the SEC estimator, \({\text {EST}}\).
Claim 1
\({{\text {Var}}_{U}\left( {\text {SEC}}(\{X, U\} | X = \mathcal {X})\right) } = 0.\)
Proof
We can write either SEC or MC as:
$$\begin{aligned} {\text {EST}}(\{X, U\}) = \sum _{s \in \mathcal {S}} d_{\mathcal {B}}(s) \sum _{a \in \mathcal {A}} \pi _{\mathcal {B}}(a|s) w(s,a) \phi (s, a) \end{aligned}$$
(31)
where \(w(s,a) = \frac{\pi (a|s)}{\pi _{\mathcal {B}}(a|s)}\) for \({\text {SEC}}\) and \(w(s,a) = 1\) for \({\text {MC}}\). In Claim 1, the sampled states are fixed and variance only arises from \(\pi _B\) and w(s, a) which vary for different realizations of \(\mathbb {A}\). When we choose \(w(s,a) = \frac{{\pi _{\varvec{\theta }}}(a|s)}{\pi _{\mathcal {B}}(a|s)}\) (as SEC does) the \(\pi _{\mathcal {B}}(a|s)\) factors cancel in Eq. 31. Since \(\pi _{\mathcal {B}}\) is the only part of SEC that depends on the random variable U, using w(s, a) eliminates variance due to action selection in the estimator. This proves Claim 1.
Claim 2
\({\mathbf {E}_{U}\biggl [SEC(\{X, U\}) \biggm | X\biggr ]} = {\mathbf {E}_{U}\biggl [MC(\{X,U\})\biggm | X\biggr ]}.\)
Proof
Claim 2 also follows from the same logic as Claim 1. The cancellation of the \(\pi _{\mathcal {B}}(a|s)\) factors converts the inner summation over actions into an exact expectation under \(\pi\). Since the Monte Carlo estimator is an unbiased estimator, the inner summation over actions must be equal to the exact expectation under \(\pi\) in expectation. Thus the expectation of both estimators conditioned on X is:
$${{\mathbf {E}}_{U}\biggl [{\text {EST}}(\{X, U\}) \biggm | X\biggr ]} = \sum _{s \in {\mathcal {S}}} d_{\mathcal {B}}(s) \sum _{a \in {\mathcal {A}}} \pi (a|s) w(s,a) \phi (s, a).$$
(32)
This proves Claim 2.
We can now prove Proposition 2.
Proposition 5
Let \({{\text {Var}}_{}\left( {\text {EST}}\right) }\) denote the variance of estimator \({\text {EST}}\). Under Assumptions 6 and 7, for the Monte Carlo estimator, \({\text {MC}}\), and the SEC estimator, \({\text {SEC}}\):
$$\begin{aligned} {{\text {Var}}_{}\left( {\text {SEC}}(B)\right) } \le {{\text {Var}}_{}\left( {\text {MC}}(B)\right) } \end{aligned}$$
Proof
Using the law of total variance, the variance of the general estimator given by (31) can be decomposed as:
$$\begin{aligned} {{\text {Var}}_{X, U}\left( {\text {EST}}\right) }&= \underbrace{{\mathbf {E} \biggl [ {{\text {Var}}_{U}\left( {\text {EST}}(\{X,U\})\right) } \biggm | X \sim \pi \biggr ] }}_{\Sigma _U} + \underbrace{{{\text {Var}}_{X}\left( {\mathbf {E} \biggl [ {\text {EST}}(\{X,U\}) \biggm | U \sim \pi \biggr ] }\right) }}_{\Sigma _X} \end{aligned}$$
The first term, \(\Sigma _U\), is the variance due to stochasticity in the action selection. From Claim 1, we know that for \({\text {SEC}}\) this term is zero while in general it is not zero for \({\text {MC}}\).Footnote 5 The second term, \(\Sigma _X\), is the variance due to only visiting a finite number of states before computing the estimate. Claim 2 shows that this term is equal for both \({\text {SEC}}\) and \({\text {MC}}\). Thus the variance of \({\text {SEC}}\) is at most that of \({\text {MC}}\).
Appendix 5: Connection to the REG estimator
In this section we show that SEC and RIS can be viewed as approximations of the REG estimator studied by Li et al. (2015). This connection is notable because Li et al. showed REG has asymptotically minimax optimal MSE, however, in MDPs, REG requires knowledge of the environment’s state transition probabilities and initial state distribution probabilities 2015 while SEC and RIS do not.
Li et al. introduce the regression estimator (REG) for policy evaluation in multi-armed bandit problems 2015. We present it here as a general estimator for any function f. REG uses the available data to estimate the mean reward for each action as \(f_D(a)\) and then computes the estimate:
$$\begin{aligned} {\text {REG}}(\pi , D) :=\sum _{a \in \mathcal {A}} \pi (a) f_D(a). \end{aligned}$$
In multi-armed bandit problems (MDPs with a single state and length one horizon), REG is identical to SEC and RIS(0) with f being either the function \(\phi\) or \(\chi\) respectively.
To apply REG to state-action expectations, one first estimates the mean \(\phi\) value over (s, a) pairs as \(\phi _D\) and then computes the estimate:
$$\begin{aligned} {\text {REG}}(\pi , {D}) = \sum _{S,A \in B} d_\pi (S) \pi (A|S) \phi _D(S, A) \end{aligned}$$
This estimate requires knowledge of \(d_\pi\) and is thus inapplicable to general RL tasks. To apply REG to trajectory expectations, one first estimates the mean \(\chi\) value for each observed trajectory as \(\chi _D(H)\) and then computes the estimate:
$$\begin{aligned} {\text {REG}}(\pi , {D}) = \sum _{H \in D} \Pr (H | \pi ) \chi _D(H) \end{aligned}$$
This estimate requires knowledge of \(d_0\) and P and is thus also inapplicable to general RL tasks.
We now elucidate a relationship between \({\text {RIS}}({l}-1)\) and REG even though they are different estimators. Let c(h) denote the number of times that trajectory h appears in D. We can rewrite REG as an importance sampling method:
$$\begin{aligned} {\text {REG}}(\pi , D)&= \sum _{h \in \mathcal {H}} \Pr (h | \pi ) \chi _D(h) \end{aligned}$$
(33)
$$\begin{aligned}&= \frac{1}{m} \sum _{h \in \mathcal {H}} c(h) \frac{\Pr (h | \pi )}{c(h) / m} \chi _D(h) \end{aligned}$$
(34)
$$\begin{aligned}&= \frac{1}{m} \sum _{i=1}^m\frac{\Pr (h_i | \pi )}{c(h_i) / m} \chi (h_i) \end{aligned}$$
(35)
The denominator in (35) can be re-written as a telescoping product to obtain an estimator that is similar to \({\text {RIS}}({l}-1)\):
$$\begin{aligned} {\text {REG}}(\pi , {D})&= \frac{1}{m} \sum _{i=1}^m\frac{\Pr (h_i | \pi )}{c(h_i) / m} \chi (h_i) \\&= \frac{1}{m} \sum _{i=1}^m\frac{\Pr (h_i | \pi )}{\frac{c(s_0)}{m}\frac{c(s_0, a_0)}{c(s_0)}\cdots \frac{c(h_i)}{c(h_i / a_{{l}-1})}}\chi (h_i) \\&= \frac{1}{m} \sum _{i=1}^m\frac{d_0(s_0) \pi (a_0 | s_0) P(s_1|s_0, a_0) \cdots }{\hat{d}(s_0){\pi _{D}}(a_0|s_0)\hat{P}(s_1 | s_0, a_0) \cdots }\\&\quad \frac{\cdots P(s_{{l}-1} | s_{{l}-2}, a_{{l}-2})\pi (a_{{l}-1} | s_{{l}-1})}{ \cdots \hat{P}(s_{{l}-1} | h_{0:{l}-1}) {\pi _{D}}(a_{{l}-1} | h_{i:j}) } \chi (h_i). \end{aligned}$$
This expression differs from \({\text {RIS}}({l}-1)\) in two ways:
-
1.
The numerator includes the initial state distribution and transition probabilities of the environment.
-
2.
The denominator includes count-based estimates of the initial state distribution and transition probabilities of the environment where the transition probabilities are conditioned on all past states and actions.
If we assume that the empirical estimates of the environment probabilities in the denominator are equal to the true environment probabilities then these factors cancel and we obtain the \({\text {RIS}}({l}-1)\) estimate. This assumption will almost always be false except in deterministic environments. However, showing that \({\text {RIS}}({l}-1)\) is approximating REG suggests that \({\text {RIS}}({l}-1)\) may have similar theoretical properties to those derived for REG by Li et al. (2015). Our SinglePath experiment (See Fig. 10 in Sect. 6) supports this conjecture: \({\text {RIS}}({l}-1)\) has high bias in the low to medium sample size but have asymptotically lower MSE compared to other methods. REG has even higher bias in the low to medium sample size range but has asymptotically lower MSE compared to \({\text {RIS}}({l}-1)\). RIS with smaller n appear to decrease the initial bias but have larger MSE as the sample size grows. The asymptotic benefit of RIS for all n is also corroborated by Corollary 2 in “Appendix 3” though Corollary 2 does not tell us anything about how different RIS methods compare. The asymptotic benefit of REG compared to RIS methods can be understood as REG correcting for sampling error in both the action selection and state transitions. Similar conclusions can be drawn for a comparison between SEC and REG.