I2RL: online inverse reinforcement learning under occlusion


Inverse reinforcement learning (IRL) is the problem of learning the preferences of an agent from observing its behavior on a task. It inverts RL which focuses on learning an agent’s behavior on a task based on the reward signals received. IRL is witnessing sustained attention due to promising applications in robotics, computer games, and finance, as well as in other sectors. Methods for IRL have, for the most part, focused on batch settings where the observed agent’s behavioral data has already been collected. However, the related problem of online IRL—where observations are incrementally accrued, yet the real-time demands of the application often prohibit a full rerun of an IRL method—has received significantly less attention. We introduce the first formal framework for online IRL, called incremental IRL (I2RL), which can serve as a common ground for online IRL methods. We demonstrate the usefulness of this framework by casting existing online IRL techniques into this framework. Importantly, we present a new method that advances maximum entropy IRL with hidden variables to the online setting. Our analysis shows that the new method has monotonically improving performance with more demonstration data as well as probabilistically bounded error, both under full and partial observability. Simulated and physical robot experiments in a multi-robot patrolling application situated in varied-sized worlds, which involves learning under high levels of occlusion, show a significantly improved performance of I2RL as compared to both batch IRL and an online imitation learning method.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. 1.

    Repeated trajectories in a demonstration can usually be excluded for many methods without affecting the learning.

  2. 2.

    This assumption holds when each session starts from the same state and the trajectories are produced by the expert’s fixed policy. In case of occlusion, even though inferring the hidden portion Z of a trajectory \(X \in {\mathscr {X}}_i\), is influenced by the visible portion, Y, this does not make the trajectories necessarily dependent on each other.

  3. 3.

    As more trajectory data is provided to GAIL, the accuracy of the expert’s estimated occupancy measure for the occluded state-action pairs improves. This helps GAIL in achieving its objective of minimizing the regularized cost.


  1. 1.

    Abbeel, P., & Ng, A.Y. (2004). Apprenticeship learning via inverse reinforcement learning. In Twenty-first international conference on machine learning (ICML), pp. 1–8.

  2. 2.

    Aghasadeghi, N., & Bretl, T. (2011). Maximum entropy inverse reinforcement learning in continuous state spaces with path integrals. In: 2011 IEEE/RSJ International conference on intelligent robots and systems, pp. 1561–1566.

  3. 3.

    Amin, K., Jiang, N., & Singh, S. (2017). Repeated inverse reinforcement learning. In Advances in neural information processing systems, pp. 1815–1824.

  4. 4.

    Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5), 469–483.

    Article  Google Scholar 

  5. 5.

    Arora, S., & Doshi, P. (2018). A survey of inverse reinforcement learning: Challenges, methods and progress. CoRR arXiv:1806.06877

  6. 6.

    Arora, S., Doshi, P., & Banerjee, B. (2019). Online inverse reinforcement learning under occlusion. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, pp. 1170–1178. International Foundation for Autonomous Agents and Multiagent Systems

  7. 7.

    Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R.E. (2000). Gambling in a rigged casino: The adversarial multi-armed bandit problem. Electronic Colloquium on Computational Complexity (ECCC) 7(68).

  8. 8.

    Babes-Vroman, M., Marivate, V., Subramanian, K., & Littman, M. (2011). Apprenticeship learning about multiple intentions. In 28th International conference on machine learning (ICML), pp. 897–904.

  9. 9.

    Bogert, K., & Doshi, P. (2014). Multi-robot inverse reinforcement learning under occlusion with interactions. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, AAMAS ’14, pp. 173–180.

  10. 10.

    Bogert, K., & Doshi, P. (2015). Toward estimating others’ transition models under occlusion for multi-robot irl. In 24th International joint conference on artificial intelligence (IJCAI), pp. 1867–1873.

  11. 11.

    Bogert, K., & Doshi, P. (2017). Scaling expectation-maximization for inverse reinforcement learning to multiple robots under occlusion. In Proceedings of the 16th conference on autonomous agents and multiagent systems, AAMAS ’17, pp. 522–529.

  12. 12.

    Bogert, K., Lin, J.F.S., Doshi, P., & Kulic, D. (2016). Expectation-maximization for inverse reinforcement learning with hidden data. In 2016 International conference on autonomous agents and multiagent systems, pp. 1034–1042.

  13. 13.

    Boularias, A., Kober, J., & Peters, J. (2011). Relative entropy inverse reinforcement learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pp. 182–189

  14. 14.

    Boularias, A., Krömer, O., & Peters, J. (2012). Structured apprenticeship learning. European Conference on Machine Learning and Knowledge Discovery in Databases, Part, II, 227–242.

    Google Scholar 

  15. 15.

    Choi, J., & Kim, K. E. (2011). Inverse reinforcement learning in partially observable environments. J. Mach. Learn. Res., 12, 691–730.

    MathSciNet  MATH  Google Scholar 

  16. 16.

    Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B (Methodological), 39, 1–38.

    MathSciNet  Article  Google Scholar 

  17. 17.

    Dudík, M., Phillips, S. J., & Schapire, R. E. (2004). Performance guarantees for regularized maximum entropy density estimation. In J. Shawe-Taylor & Y. Singer (Eds.), Learning Theory (pp. 472–486). Berlin Heidelberg: Springer.

    Google Scholar 

  18. 18.

    Gerkey, B., Vaughan, R.T., & Howard, A. (2003). The player/stage project: Tools for multi-robot and distributed sensor systems. In Proceedings of the 11th international conference on advanced robotics, vol. 1.

  19. 19.

    Herman, M., Fischer, V., Gindele, T., & Burgard, W. (2015). Inverse reinforcement learning of behavioral models for online-adapting navigation strategies. In 2015 IEEE international conference on robotics and automation (ICRA), pp. 3215–3222. IEEE.

  20. 20.

    Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. Advances in Neural Information Processing Systems (NIPS), 29, 4565–4573.

    Google Scholar 

  21. 21.

    Jun Jin, Z., Qian, H., Yi Chen, S., & Liang Zhu, M. (2010). Convergence analysis of an incremental approach to online inverse reinforcement learning. Journal of Zhejiang University-Science C, 12(1), 17–24.

    Article  Google Scholar 

  22. 22.

    Kamalaruban, P., Devidze, R., Cevher, V., & Singla, A. (2019). Interactive teaching algorithms for inverse reinforcement learning. arXiv preprint arXiv:1905.11867.

  23. 23.

    Kitani, K.M., Ziebart, B.D., Bagnell, J.A., & Hebert, M. (2012). Activity forecasting. In 12th European conference on computer vision - Volume Part IV, pp. 201–214.

  24. 24.

    Kivinen, J., & Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1), 1–63.

    MathSciNet  Article  Google Scholar 

  25. 25.

    Levine, S., Popović, Z., & Koltun, V. (2010). Feature construction for inverse reinforcement learning. In Proceedings of the 23rd international conference on neural information processing systems, NIPS’10, pp. 1342–1350. Curran Associates Inc., USA

  26. 26.

    Ng, A., & Russell, S. (2000). Algorithms for inverse reinforcement learning. In Seventeenth international conference on machine learning, pp. 663–670.

  27. 27.

    Osa, T., Pajarinen, J., Neumann, G., Bagnell, J.A., Abbeel, P., & Peters, J. (2018). An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics 7(2), 1–179.

  28. 28.

    Ramachandran, D., & Amir, E. (2007). Bayesian inverse reinforcement learning. In 20th international joint conference on artifical intelligence (IJCAI), pp. 2586–2591.

  29. 29.

    Ratliff, N., Bagnell, J., & Zinkevich, M. (2007). (online) subgradient methods for structured prediction. Journal of Machine Learning Research - Proceedings Track, 2, 380–387.

    Google Scholar 

  30. 30.

    Ratliff, N.D., Bagnell, J.A., & Zinkevich, M.A. (2006). Maximum margin planning. In 23rd international conference on machine learning, pp. 729–736.

  31. 31.

    Rhinehart, N., & Kitani, K.M. (2017). First-person activity forecasting with online inverse reinforcement learning. In International conference on computer vision (ICCV).

  32. 32.

    Russell, S. (1998). Learning agents for uncertain environments (extended abstract). In Eleventh annual conference on computational learning theory, pp. 101–103.

  33. 33.

    Steinhardt, J., & Liang, P. (2014). Adaptivity and optimism: An improved exponentiated gradient algorithm. In 31st International conference on machine learning, pp. 1593–1601.

  34. 34.

    Trivedi, M., & Doshi, P. (2018). Inverse learning of robot behavior for collaborative planning. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 1–9.

  35. 35.

    Wang, S., Rosenfeld, R., Zhao, Y., & Schuurmans, D. (2002). The Latent Maximum Entropy Principle. In IEEE international symposium on information theory, pp. 131–131.

  36. 36.

    Wang, S., & Schuurmans Yunxin Zhao, D. (2012). The Latent Maximum Entropy Principle. ACM Transactions on Knowledge Discovery from Data 6(8).

  37. 37.

    Wulfmeier, M., & Posner, I. (2015). Maximum Entropy Deep Inverse Reinforcement Learning. arXiv preprint.

  38. 38.

    Ziebart, B.D., Maas, A., Bagnell, J.A., & Dey, A.K. (2008). Maximum entropy inverse reinforcement learning. In 23rd national conference on artificial intelligence - Volume 3, pp. 1433–1438.

  39. 39.

    Ziebart, B.D., Ratliff, N., Gallagher, G., Mertz, C., Peterson, K., Bagnell, J.A., Hebert, M., Dey, A.K., & Srinivasa, S. (2009). Planning-based prediction for pedestrians. In: Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS’09, pp. 3931–3936. IEEE Press, Piscataway, NJ, USA.

Download references


We thank Kenneth Bogert for insightful discussions, and the anonymous reviewers for helpful comments and suggestions. This work was supported in part by a research contract with the Toyota Research Institute of North America (TRI-NA), and by National Science Foundation grants IIS-1830421 (to PD) and IIS-1526813 (to BB).

Author information



Corresponding author

Correspondence to Prashant Doshi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



\({\mathbb {X}}\) is the space of all possible trajectories. The expected value of any feature \(\phi _k \in [0,1], k\in \{1,2 \ldots K\}\) for trajectory X is given by function \(f_i:{\mathbb {X}} \rightarrow {\mathbb {R}}\) defined as \(f_k(X) = \sum _{\langle s,a \rangle _t \in X} \gamma ^t \phi _k(\langle s, a \rangle _t)\). Although a trajectory in a non-terminating MDP can be infinitely long, we derive range of \(f_k\) first for bounded-length trajectories and extend it later by applying infinity limit. Let \(T_{max}\) be the maximum length of any trajectory, \(0 \le |X| \le T_{max}\).


$$\begin{aligned}&\sum ^{0}_{t=0} \gamma ^t 0 \le \sum _{\langle s,a \rangle _t \in X} \gamma ^t \phi _k(\langle s, a \rangle _t) \le \sum ^{T_{max}}_{t=0} \gamma ^t \\&0 \le f_k(X) \le (1-\gamma ^{T_{max}}) / (1-\gamma ) \\ \end{aligned}$$

Applying limit \(T_{max} \rightarrow \infty\) gives us

$$\begin{aligned} 0 \le f_k(X) \le \frac{1}{1-\gamma } \end{aligned}$$

Extending the definition to all k features, we introduce function \(f:{\mathbb {X}} \rightarrow {\mathbb {R}}^k\) as \(f(X) = \sum _{\langle s,a \rangle _t \in X} \gamma ^t \phi (\langle s, a \rangle _t)\).

Note that learned feature expectations can be expressed in terms of \(f_k\) as

$$\begin{aligned}E_{{\mathbb {X}}}[\phi _k]\triangleq \sum \nolimits _{X \in {\mathbb {X}}} Pr(X)~f_k(X) ,\ k=1\ldots K\end{aligned}$$

The sessions for latent and full-observation MAXENT IRL updates estimated feature expectations of expert as follows.

$$\begin{aligned} {\hat{\phi }}^{Z|Y,1:i}_{{\varvec{\theta }}^i,k}&\triangleq \frac{1}{|{\mathscr {Y}}{}_{1:i}|} \sum \limits _{Y \in {\mathscr {Y}}{}_{1:i}} \sum \limits _{Z \in {\mathbb {Z}}} Pr(Z|Y;{\varvec{\theta }})\nonumber \\&\sum _{\langle s,a \rangle _t \in Y \cup Z} \gamma ^t \phi _k(\langle s, a \rangle _t)\nonumber \\&= \frac{|{\mathscr {Y}}{}_{1:i-1}|}{|{\mathscr {Y}}{}_{1:i-1}|+|{\mathscr {Y}}{}_i|} ~{\hat{\phi }}^{Z|Y,1:i-1}_{{\varvec{\theta }}^{i-1},k} + \frac{|{\mathscr {Y}}{}_i|}{|{\mathscr {Y}}{}_{1:i-1}|+|{\mathscr {Y}}{}_i|}~{\hat{\phi }}^{Z|Y,i}_{{\varvec{\theta }}^{i},k} \end{aligned}$$
$$\begin{aligned} {\hat{\phi }}^{1:i}_{k}&\triangleq \frac{1}{|{\mathscr {Y}}{}_{1:i}|} \sum \limits _{Y \in {\mathscr {Y}}{}_{1:i}} \sum \limits _{Z \in {\mathbb {Z}}} Pr(Z|Y;{\varvec{\theta }}) \sum _{\langle s,a \rangle _t \in Y \cup Z} \gamma ^t \phi _k(\langle s, a \rangle _t)\nonumber \\&=\frac{|{\mathscr {Y}}{}_{1:i-1}|}{|{\mathscr {Y}}{}_{1:i-1}|+|{\mathscr {Y}}{}_i|}~{\hat{\phi }}^{1:i-1}_{k} +\frac{|{\mathscr {Y}}{}_i|}{|{\mathscr {Y}}{}_{1:i-1}|+|{\mathscr {Y}}{}_i|}~{\hat{\phi }}^{i}_k \end{aligned}$$

From definitions of feature-expectations and Eq. 10, \(E_{{\mathbb {X}}}[\phi _k],\) \({\hat{\phi }}^{1:i}_{k},{\hat{\phi }}^{Z|Y,1:i}_{{\varvec{\theta }}^i,k} \in \left[ 0,\frac{1}{(1-\gamma )}\right]\).

We give below the proofs of the theorems and lemmas stated in the main exposition of the article.

Proof of Theorem 1

We use the notation:

$$\begin{aligned}E_{{\mathbb {X}}}[\phi _k]\triangleq \sum _{X \in {\mathbb {X}}}Pr(X)\sum _{\left<s,a\right>\in X}\phi _k(s,a),\ k=1\ldots K\end{aligned}$$

By allowing a relaxation in the constraints of maximum entropy estimation problem, [17] derived sample complexity bounds for the problem.

$$\begin{aligned} \begin{array}{l} \max \limits _{\varDelta } \left( -\sum \nolimits _{X \in {\mathbb {X}}} Pr(X)~ log~Pr(X) \right) \\ \\ {\text{ subject to}}~~~ \sum \nolimits _{X \in {\mathbb {X}}} Pr(X) = 1 \\ \left| E_{{\mathbb {X}}}[\phi _k] - {\hat{\phi }}^{1:i}_{k} \right| \le \beta ^{full}_k ~~~~~~ \forall k\in \{1 \ldots K \} \end{array} \end{aligned}$$

Here \(\beta ^{full} \in {\mathbb {R}}^K\) is a vector of upper bounds on the differences between \(E_{{\mathbb {X}}}[\phi _k]\) and \({\hat{\phi }}^{1:i}_{k}\).

Following proofs by Dudik et al., relaxed constraints maximum entropy IRL problem is same as \(\min _{{\varvec{\theta }}} (-\sum _{X\in {\mathscr {X}}{}_{1:i}} {\tilde{Pr}}(X) ~~~\log\) \(Pr(X|{\varvec{\theta }})+ \sum _k\) \(\beta ^{full}_k |\theta _k| )\) \(= \min _{{\varvec{\theta }}} (- LL ({\varvec{\theta }}|\) \({\mathscr {X}}{}_i,|{\mathscr {X}}{}_{i-1}|,{\hat{\phi }}^{1:i-1},~~\) \({\varvec{\theta }}^{i-1}) + \sum _k \beta ^{full}_k |\theta _k|) = \min _{{\varvec{\theta }}} NLL_{\beta ^{full}} ({\varvec{\theta }}|{\mathscr {X}}{}_i,|{\mathscr {X}}{}_{i-1}|,{\hat{\phi }}^{1:i-1} ~~\) \(,{\varvec{\theta }}^{i-1})\) (say).

The proof here is partially inspired from Corollary 1 in [17]. Let \(\beta ^{full}_k =\beta ^{full}_c = \varepsilon /(1-\gamma )\) for all \(k \in \{1 \ldots K\}\), where \(\beta ^{full}_c\) is constant because \(\varepsilon\) is fixed input. For normalized exponentiated gradient descent used here for computing maximum, \(\sum \nolimits _{1}^{K} |\theta _k|=1\). Then, \(NLL_{\beta ^{full}} ({\varvec{\theta }}|{\mathscr {X}}{}_i,|{\mathscr {X}}{}_{i-1}|,{\hat{\phi }}^{1:i-1},\) \({\varvec{\theta }}^{i-1}) = (- LL ({\varvec{\theta }}|{\mathscr {X}}{}_i,~~\) \(|{\mathscr {X}}{}_{i-1}|,{\hat{\phi }}^{1:i-1},\) \({\varvec{\theta }}^{i-1}) + \beta ^{full}_c \sum \nolimits _{1}^{k} |\theta _k|) =(- LL ({\varvec{\theta }}|{\mathscr {X}}{}_i,|{\mathscr {X}}{}_{i-1}|,~~\) \({\hat{\phi }}^{1:i-1},{\varvec{\theta }}^{i-1}) + \beta ^{full}_c)\). Assume that \({\varvec{\theta }}^i\) minimizes \(NLL_{\beta ^{full}} ({\varvec{\theta }}|~~~~\) \({\mathscr {X}}{}_i,|{\mathscr {X}}{}_{i-1}|,{\hat{\phi }}^{1:i-1},{\varvec{\theta }}^{i-1})\), a solution maximizing \(LL ({\varvec{\theta }}|{\mathscr {X}}{}_i,|{\mathscr {X}}{}_{i-1}|,~~~~~\) \({\hat{\phi }}^{1:i-1},{\varvec{\theta }}^{i-1})\).

Since \(E_{{\mathbb {X}}}[\phi _k]\in \left[ 0,\frac{1}{(1-\gamma )}\right]\), we get \((1-\gamma ) E_{{\mathbb {X}}}[\phi _k] \in \left[ 0,1\right]\). We multiply the relaxed constraint with \((1-\gamma )\) and define the negation of constraint as following event: \(A: \left| (1-\gamma ) E_{{\mathbb {X}}}[\phi _k] - (1-\gamma ) {\hat{\phi }}^{1:i}_{k} \right| > (1-\gamma )\beta ^{full}_c = \varepsilon ~~~~ {\text {for some }} k\in \{1 \ldots K \}\). A can be decomposed into following feature specific events

$$\begin{aligned}A_k: (1-\gamma )\left| E_{{\mathbb {X}}}[\phi _k] - {\hat{\phi }}^{1:i}_{k} \right| > \varepsilon ,\end{aligned}$$

where \(k \in \{1,2 \ldots K\}\). We divide this constraint on absolute value further in two signed events:

$$\begin{aligned}&(A_k)_1: (1-\gamma ) E_{{\mathbb {X}}}[\phi _k] - (1-\gamma ) {\hat{\phi }}^{1:i}_{k}> \varepsilon \\&(A_k)_2: -(1-\gamma ) E_{{\mathbb {X}}}[\phi _k] + (1-\gamma ) {\hat{\phi }}^{1:i}_{k} > \varepsilon \end{aligned}$$

Then event A is same as logical disjunction \((A_1)_1 \vee (A_1)_2 \vee (A_2)_1 \ldots\).

Applying Hoeffding’s inequality, the upper bound of probability of each signed event is given by: \(P((A_k)_1) \le \exp (-2 \varepsilon ^2\) \(|{\mathscr {X}}{}_{1:i}|) = \frac{\delta }{2K} (say), P((A_k)_2) \le \frac{\delta }{2K}\).

Applying aforesaid bounds to events for each of the K features, we get 2K events with exactly same upper bound \(\frac{\delta }{2K}\) on their respective probabilities. We use Fretchet’s inequality to derive an upper bound for the disjunction:

$$\begin{aligned} P(A) = P((A_1)_1 \vee (A_1)_2 \vee (A_2)_1 \ldots ) \le \min (1,P((A_1)_1)+P((A_1)_2)+ P((A_2)_1)\ldots ) \end{aligned}$$

As each of the probabilities in RHS are bounded from above by \(\frac{\delta }{2K}\), their sum is bounded as:

$$\begin{aligned} P(A) \le \min \left(1,\sum _{1}^{2K} \frac{\delta }{2K} \right) =\min (1,\delta ) \end{aligned}$$

Reverting to the negation of A, the probability that \(\left| (1-\gamma )E_{{\mathbb {X}}}[\phi _k] - (1-\gamma ) {\hat{\phi }}^{1:i}_{k} \right| ~~~\) \(\le \varepsilon ~~~~\forall k\in \{1 \ldots K \}\) is at least \(1-\min (1,{\delta } )=\max (0,1-\delta )\).

To keep reward value bounded, IRL assumes \(\vert \vert {\varvec{\theta }}^* \vert \vert _1 \le 1\) for all \({\varvec{\theta }}^*\). Using the assumption and Theorem 1 in [17], we get error bound:

For every \({\varvec{\theta }}^* \in [0,1]^K\), \(NLL_{\beta ^{full}} ({\varvec{\theta }}^i|{\mathscr {X}}{}_i,|{\mathscr {X}}{}_{i-1}|,{\hat{\phi }}^{1:i-1},{\varvec{\theta }}^{i-1})~~~\) \(- NLL_{\beta ^{full}} ({\varvec{\theta }}^*|{\mathscr {X}}{}_i,\) \(|{\mathscr {X}}{}_{i-1}|,{\hat{\phi }}^{1:i-1},{\varvec{\theta }}^{i-1}) \le 2 \sum _{1}^{K} \beta ^{full}_c = 2 K ~~~\) \(\beta ^{full}_c= \frac{2 K \varepsilon }{(1-\gamma )}\) with probability at least \(\max (0,1-\delta )\), where \(\delta = \; 2K \exp \big (-2\) \(\varepsilon ^2|{\mathscr {X}}{}_{1:i}| \big )\).

We modify the bound in the form of positive log-likelihood of expert’s policy, by using relation \(NLL_{\beta ^{full}} ({\varvec{\theta }}^*|{\mathscr {X}}{}_{1:i}) = (- LL ({\varvec{\theta }}^*|~~~\) \({\mathscr {X}}{}_{1:i}) + \sum _{1}^{K} \beta ^{full}_k |\theta _k|)\) and \({\varvec{\theta }}^*= {\varvec{\theta }}_E\).

Then, with \({\mathscr {X}}{}_{1:i}\) as input, with probability at least \(\max (0,1-\delta )\),

$$\begin{aligned}&NLL_{\beta ^{full}} ({\varvec{\theta }}^i|{\mathscr {X}}{}_i,|{\mathscr {X}}{}_{i-1}|,{\hat{\phi }}^{1:i-1},{\varvec{\theta }}^{i-1}) - NLL_{\beta ^{full}} ({\varvec{\theta }}_E|{\mathscr {X}}{}_{1:i}) \\&\quad = LL({\varvec{\theta }}_E|{\mathscr {X}}{}_{1:i}) -LL({\varvec{\theta }}^i |{\mathscr {X}}{}_i,|{\mathscr {X}}{}_{i-1}|,{\hat{\phi }}^{1:i-1},{\varvec{\theta }}^{i-1}) \le \frac{2 K \varepsilon }{(1-\gamma )}. \end{aligned}$$


Proof of Lemma 1

Log-likelihood of demonstrated behavior can be split as

$$\begin{aligned}&LL({\varvec{\theta }}^i|{\mathscr {Y}}{}_{1:i}) = LL({\varvec{\theta }}^i |{\mathscr {Y}}{}_i,|{\mathscr {Y}}{}_{i-1}|,{\hat{\phi }}^{Z|Y,1:i-1}_{{\varvec{\theta }}^{i-1}},{\varvec{\theta }}^{i-1})\\&\quad = \sum \limits _{Y \in {\mathscr {Y}}{}_{1:i}} {\tilde{Pr}}(Y) ~log~{Pr(Y;{\varvec{\theta }})} \\&\quad = \sum \limits _{Y \in {\mathscr {Y}}{}_{1:i}} {\tilde{Pr}}(Y)~\sum \limits _{Z \in {\mathbb {Z}}} Pr(Z|Y;{\varvec{\theta }}^i)~log~Pr(Y,Z;{\varvec{\theta }})\\&\qquad + \left(- \sum \limits _{Y \in {\mathscr {Y}}{}_{1:i}} {\tilde{Pr}}(Y)~\sum \limits _{Z \in {\mathbb {Z}}} Pr(Z|Y;{\varvec{\theta }}^i)~log~Pr(Z|Y;{\varvec{\theta }})\right)\\&\quad =Q({\mathscr {Y}}{}_i,|{\mathscr {Y}}{}_{i-1}|,{\hat{\phi }}^{Z|Y,1:i-1}_{{\varvec{\theta }}^{i-1}},{\varvec{\theta }},{\varvec{\theta }}^{i-1}) + C({\mathscr {Y}}{}_{1:i},{\varvec{\theta }},{\varvec{\theta }}^i) \end{aligned}$$

Here \({\tilde{Pr}}\) is distribution of trajectories in observed training data (\(\sum \limits _{X \in {\mathscr {X}}{}} {\tilde{Pr}}(X) [\cdot ]\) and \(\frac{1}{|{\mathscr {X}}{}|} \sum \limits _{X \in {\mathscr {X}}{}} [\cdot ]\) can be used interchangeably). EM method maximizes the log-likelihood by maximizing only Q value over \({\varvec{\theta }}\); and \({\varvec{\theta }}={\varvec{\theta }}^i\) maximizes \(Q({\mathscr {Y}}{}_i,|{\mathscr {Y}}{}_{i-1}|,{\hat{\phi }}^{Z|Y,1:i-1}_{{\varvec{\theta }}^{i-1}},{\varvec{\theta }},{\varvec{\theta }}^{i-1})\) ([36]). After all the EM iterations for current session i, the final Q value is \(Q({\mathscr {Y}}{}_i,|{\mathscr {Y}}{}_{i-1}|,\) \({\hat{\phi }}^{Z|Y,1:i-1}_{{\varvec{\theta }}^{i-1}},\) \({\varvec{\theta }}^i,{\varvec{\theta }}^i)\). Therefore, the difference in the likelihoods achieved by weights learned in consecutive sessions can be expressed as a difference in Q values. Note that LME IRL learns reward weights by inferring the maximum entropy distribution \(Pr(X;{\varvec{\theta }}) = \frac{\exp ( \sum _k \theta _k f_k(X))}{ \varOmega ^{X}_{{\varvec{\theta }}}}\) , where \(\varOmega ^{X}_{{\varvec{\theta }}} = \sum \nolimits _{X \in {\mathbb {X}}} \exp (\sum _k \theta _k f_k(X))\) and \(X=(Y,Z)\). Expand Q value as \(Q({\mathscr {Y}}{}_i,|{\mathscr {Y}}{}_{i-1}|,{\hat{\phi }}^{Z|Y,1:i-1}_{{\varvec{\theta }}^{i-1}},{\varvec{\theta }}^i,\) \({\varvec{\theta }}^i) = \sum \limits _{Y \in {\mathscr {Y}}{}_{1:i}} {\tilde{Pr}}(Y)~\sum \limits _{Z \in {\mathbb {Z}}} Pr(Z|\) \(Y;{\varvec{\theta }}^i)\) \(~ \log \left( \frac{\exp ( \sum _k \theta _k^i f_k((Y,Z)))}{ \varOmega ^{(Y,Z)}_{{\varvec{\theta }}^i}} \right)\) \(= \sum _k \theta ^i_k\cdot \sum \limits _{Y \in {\mathscr {Y}}{}_{1:i}} {\tilde{Pr}}(Y)~\sum \limits _{Z \in {\mathbb {Z}}}\)  Pr(Z| \(Y;{\varvec{\theta }}^i) f_k((Y,Z)) - \log ~ \varOmega ^{(Y,Z)}_{{\varvec{\theta }}^i} = \sum _k \theta ^i_k \cdot {\hat{\phi }}^{Z|Y,1:i}_{{\varvec{\theta }}^i,k} - \log ~ \varOmega ^{(Y,Z)}_{{\varvec{\theta }}^i}\).

Therefore the improvement in log likelihood over session i is

$$\begin{aligned}&LL({\varvec{\theta }}^i|{\mathscr {Y}}{}_i,|{\mathscr {Y}}{}_{i-1}|,{\hat{\phi }}^{Z|Y,1:i-1}_{{\varvec{\theta }}^{i-1}},{\varvec{\theta }}^{i-1})-LL({\varvec{\theta }}^{i-1}|{\mathscr {Y}}{}_{i-1},| \nonumber \\&{\mathscr {Y}}{}_{i-2}|,{\hat{\phi }}^{Z|Y,1:i-2}_{{\varvec{\theta }}^{i-2}},{\varvec{\theta }}^{i-2})\nonumber \\&\quad =Q({\mathscr {Y}}{}_i,|{\mathscr {Y}}{}_{i-1}|,{\hat{\phi }}^{Z|Y,1:i-1}_{{\varvec{\theta }}^{i-1}},{\varvec{\theta }}^i,{\varvec{\theta }}^i) - Q({\mathscr {Y}}{}_{i-1},|\nonumber \\&{\mathscr {Y}}{}_{i-2}|,{\hat{\phi }}^{Z|Y,1:i-2}_{{\varvec{\theta }}^{i-2}},{\varvec{\theta }}^{i-1},{\varvec{\theta }}^{i-1}) \nonumber \\&\quad = \sum _k \theta ^i_k {\hat{\phi }}^{Z|Y,1:i}_{{\varvec{\theta }}^i,k} - \log ~ \varOmega ^{(Y,Z)}_{{\varvec{\theta }}^i} - \sum _k \theta ^{i-1}_k {\hat{\phi }}^{Z|Y,1:i-1}_{{\varvec{\theta }}^{i-1},k} +\\&\log ~ \varOmega ^{(Y,Z)}_{{\varvec{\theta }}^{i-1}} \\&\quad = \log \frac{\varOmega ^{(Y,Z)}_{{\varvec{\theta }}^{i-1}}}{\varOmega ^{(Y,Z)}_{{\varvec{\theta }}^i}} + \sum _k \left( \theta ^i_k \frac{|{\mathscr {Y}}{}_{1:i-1}|}{|{\mathscr {Y}}{}_i|+|{\mathscr {Y}}{}_{1:i-1}|} - \theta ^{i-1}_k \right) {\hat{\phi }}^{Z|Y,1:i-1}_{{\varvec{\theta }}^{i-1},k} \\&\qquad + \sum _k \left( \theta ^i_k \frac{1}{|{\mathscr {Y}}{}_i|+|{\mathscr {Y}}{}_{1:i-1}|} {\hat{\phi }}^{Z|Y,i}_{{\varvec{\theta }}^{i},k} \right) \\&\qquad {\text {(substitute }}{\hat{\phi }}^{Z|Y,1:i}_{{\varvec{\theta }}^i,k}{\text { using Eq.} }8{\text { and simplifying)}} \end{aligned}$$

The final expression is minimized only for \({\varvec{\theta }}^i= {\varvec{\theta }}^{i-1}\) when \(|{\mathscr {Y}}{}_{1:i-1}|\gg |{\mathscr {Y}}{}_i|\), i.e., when a significant amount of training data has been accumulated. The expression is also concave in parameter \({\varvec{\theta }}^i\). Therefore, \(LL({\varvec{\theta }}^i|{\mathscr {Y}}{}_i,|{\mathscr {Y}}{}_{i-1}|,{\hat{\phi }}^{Z|Y,1:i-1}_{{\varvec{\theta }}^{i-1}},{\varvec{\theta }}^{i-1})-LL({\varvec{\theta }}^{i-1}|{\mathscr {Y}}{}_{i-1},\) \(|{\mathscr {Y}}{}_{i-2}|,{\hat{\phi }}^{Z|Y,1:i-2}_{{\varvec{\theta }}^{i-2}},{\varvec{\theta }}^{i-2})\ge 0\) for consecutive sessions thereafter. Hence, the LME I2RL is proved to converge over sequence of sessions, yielding a feasible log-linear solution to latent-MAXENT and corresponding weights solving IRL.\(\square\)

Proof of Lemma 2

We define the event \(A_k: (1-\gamma )\big | E_{{\mathbb {X}}}[\phi _k]\) \(- {\hat{\phi }}^{1:i}_{k} \big | > \varepsilon ,k \in \{1,2 \ldots K\}\).

Applying Hoeffding’s inequality for \(A_k\), we get \(P(A_k) \le 2\exp (-2 \varepsilon ^2 |{\mathscr {X}}{}_{1:i}|) \le \frac{\delta }{K}\) for any \(k \in \{1,2 \ldots K\}\), and for the same \(\varepsilon ,\delta\) as in Theorem 1. Similarly, for partial observation, given \(\varepsilon _{s}\) as the bound on the error in sampling based approximation of \({\hat{\phi }}^{1:i}_l\) as \({\hat{\phi }}^{Z|Y,1:i}_{{\varvec{\theta }}^{i},l}\), and \(n_{s}\) samples, let us define the event

$$\begin{aligned} B_l: (1-\gamma )\left| {\hat{\phi }}^{1:i}_l - {\hat{\phi }}^{Z|Y,1:i}_{{\varvec{\theta }}^{i},l} \right| > \varepsilon _{s},l \in \{1,2 \ldots K\}. \end{aligned}$$

Similar to procedure for \(P(A_k)\), applying Hoeffding bound gives us \(P(B_l) < \frac{\delta _{s}}{K}, \delta _{s}=2K \exp (-2 (\varepsilon _{s})^2 n_{s})\).

Applying Fretchets inequality over both sets A and B of events gives us:

$$\begin{aligned} P\left( (\cup _k A_k) \vee (\cup _l B_l)\right) < \min \left(1,\sum _{k=1}^{K} \frac{\delta }{K}+\sum _{l=1}^{K} \frac{\delta _{s}}{K}\right) = \min (1,\delta +\delta _{s}) . \end{aligned}$$

That is, \(P \left( \exists k,l, s.t. A_k\vee B_l\right) <\min (1,\delta +\delta _{s})\). Taking complement, \(P\left( \forall k,l, {\overline{A}}_k\wedge {\overline{B}}_l\right) \ge \max (0,1-\delta -\delta _{s})\). But \(\forall k,l, {\overline{A}}_k\wedge {\overline{B}}_l\) implies that \(\forall k\):

$$\begin{aligned} (1-\gamma )(\left| E_{{\mathbb {X}}}[\phi _k] - {\hat{\phi }}^{1:i}_{k} \right| +\left| {\hat{\phi }}^{1:i}_{k} - {\hat{\phi }}^{Z|Y,1:i}_{{\varvec{\theta }}^i,k} \right| ) \le \varepsilon + \varepsilon _{s} \end{aligned}$$

Calling \((\varepsilon +\varepsilon _{s})=2\varepsilon _{l}\), and \((\delta +\delta _{s}) = \delta _{l}\) we get

$$\begin{aligned}&P\big (\forall k,(1-\gamma ) \left(\left| E_{{\mathbb {X}}}[\phi _k] - {\hat{\phi }}^{1:i}_{k} \right| +\left| {\hat{\phi }}^{1:i}_{k} - {\hat{\phi }}^{Z|Y,1:i}_{{\varvec{\theta }}^i,k} \right| \right) \\&\quad \le 2\varepsilon _{l} \big ) \ge \max (0,1-\delta _{l}). \end{aligned}$$

Using inequality \(\left| E_{{\mathbb {X}}}[\phi _k] - {\hat{\phi }}^{Z|Y,1:i}_{{\varvec{\theta }}^i,k} \right| \le \big | E_{{\mathbb {X}}}[\phi _k] - {\hat{\phi }}^{1:i}_{k} \big | + \big | {\hat{\phi }}^{1:i}_{k} - {\hat{\phi }}^{Z|Y,1:i}_{{\varvec{\theta }}^i,k}\big |\), we get:

$$\begin{aligned} P \bigg (\forall k, (1-\gamma ) \left(\left| E_{{\mathbb {X}}}[\phi _k] - {\hat{\phi }}^{Z|Y,1:i}_{{\varvec{\theta }}^i,k} \right| \right) \le 2\varepsilon _{l} \bigg ) \ge \max (0,1-\delta _{l}). \end{aligned}$$


Proof of Theorem 2

Latent maximum entropy IRL problem is equivalent to \(\max _{{\varvec{\theta }}} \sum \nolimits _{Y \in {\mathscr {Y}}_{1:i} }{\tilde{Pr}}(Y) \log Pr(Y|{\varvec{\theta }})\) (Sect. 3.3, [12]) or \(\max _{{\varvec{\theta }}} LL({\varvec{\theta }}^i|{\mathscr {Y}}{}_i,|{\mathscr {Y}}{}_{i-1}|,{\hat{\phi }}^{Z|Y,1:i-1}_{{\varvec{\theta }}^{i-1}}\) \(,{\varvec{\theta }}^{i-1})\).

Relaxed constraint latent maximum entropy IRL is:

$$\begin{aligned} \begin{array}{l} \max \limits _{\varDelta } \left( -\sum \nolimits _{X \in {\mathbb {X}}} Pr(X)~ log~Pr(X) \right) \\ \\ {\text{ subject to}} ~~~ \sum \nolimits _{X \in {\mathbb {X}}} Pr(X) = 1 \\ \left| E_{{\mathbb {X}}}[\phi _k] - {\hat{\phi }}^{Z|Y,1:i}_{{\varvec{\theta }}^i,k} \right| \le \beta _k ~~~~~~ \forall k\in \{1 \ldots K \} \end{array} \end{aligned}$$

Here \(\beta \in {\mathbb {R}}^K\) is an estimate of vector of upper bounds on the differences between \(E_{{\mathbb {X}}}[\phi _k]\) and \({\hat{\phi }}^{Z|Y,1:i}_{{\varvec{\theta }}^i,k}\).

The form of relaxed latent maximum entropy problem and the likelihood for that problem is no different than those for relaxed maximum entropy. Starting from results in Lemma 2, assuming \(\beta _k =\beta _c = 2\varepsilon _{l}/(1-\gamma )\) for all \(k \in \{1 \ldots K\}\) and using steps similar to the proof of Theorem 1, we get \(LL({\varvec{\theta }}_E|{\mathscr {Y}}{}_{1:i}) - LL({\varvec{\theta }}^i |{\mathscr {Y}}{}_i,|{\mathscr {Y}}{}_{i-1}|,{\hat{\phi }}^{Z|Y,1:i-1}_{{\varvec{\theta }}^{i-1}},{\varvec{\theta }}^{i-1}) \le \frac{4 K \varepsilon _{l}}{(1-\gamma )}\) with probbility at least \(\max (0,1-\delta _{l})\). \(\square\)

Proof of Theorem 3

The log-loss after ith session is \(LL({\varvec{\theta }}_E|{\mathscr {X}}{}_{1:i}) - LL({\varvec{\theta }}^i |{\mathscr {X}}{}_i,|{\mathscr {X}}{}_{i-1}|,{\hat{\phi }}^{1:i-1},{\varvec{\theta }}^{i-1})\). Let the regret after \(i=T\) be \(\frac{1}{T} \sum _{i=1}^{T} LL({\varvec{\theta }}_E|{\mathscr {X}}{}_{1:i}) - \frac{1}{T} \sum _{i=1}^{T} LL({\varvec{\theta }}^i |{\mathscr {X}}{}_i,|{\mathscr {X}}{}_{i-1}|,{\hat{\phi }}^{1:i-1},{\varvec{\theta }}^{i-1})\).

According to Theorem 1, \(LL({\varvec{\theta }}_E|{\mathscr {X}}{}_{1:i}) - LL({\varvec{\theta }}^i |{\mathscr {X}}{}_i,|{\mathscr {X}}{}_{i-1}|,{\hat{\phi }}^{1:i-1},{\varvec{\theta }}^{i-1}) \le \beta \cdot \varepsilon _{l}\) with probbility at least \(\max (0,1-\delta )\), where \(\beta =\frac{2 K }{(1-\gamma )}\). As \(\varepsilon\) is user specified, let \(\varepsilon = \frac{c}{i}\). Then, the inequality becomes \(LL({\varvec{\theta }}_E|{\mathscr {X}}{}_{1:i}) - LL({\varvec{\theta }}^i |{\mathscr {X}}{}_i,|{\mathscr {X}}{}_{i-1}|,{\hat{\phi }}^{1:i-1},{\varvec{\theta }}^{i-1}) \le \beta \frac{c}{i}\). Summing the result over \(i\in \{1,2,\ldots T\}\) and dividing by T, we get

$$\begin{aligned} \frac{1}{T} \sum _{i=1}^{T} LL({\varvec{\theta }}_E|{\mathscr {X}}{}_{1:i}) - \frac{1}{T} \sum _{i=1}^{T} LL({\varvec{\theta }}^i |{\mathscr {X}}{}_i,|{\mathscr {X}}{}_{i-1}|,{\hat{\phi }}^{1:i-1},{\varvec{\theta }}^{i-1}) \le \beta \frac{1}{T} \sum _{i=1}^{T} \frac{c}{i} \end{aligned}$$

The RHS above is bounded as \(\beta \frac{1}{T} \sum _{i=1}^{T} \frac{c}{i} \le \beta \frac{1}{T} c \log T = \beta c \frac{\log T}{T}\). As \(T \rightarrow \infty\), \(\beta c \frac{\log T}{T} \rightarrow 0\). Therefore, as sessions progress, regret is guaranteed to vanish. \(\square\)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Arora, S., Doshi, P. & Banerjee, B. I2RL: online inverse reinforcement learning under occlusion. Auton Agent Multi-Agent Syst 35, 4 (2021). https://doi.org/10.1007/s10458-020-09485-4

Download citation


  • Robot learning
  • Online learning
  • Robotics
  • Reinforcement learning
  • Inverse reinforcement learning