Skip to main content
Log in

IOB: integrating optimization transfer and behavior transfer for multi-policy reuse

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

Humans have the ability to reuse previously learned policies to solve new tasks quickly, and reinforcement learning (RL) agents can do the same by transferring knowledge from source policies to a related target task. Transfer RL methods can reshape the policy optimization objective (optimization transfer) or influence the behavior policy (behavior transfer) using source policies. However, selecting the appropriate source policy with limited samples to guide target policy learning has been a challenge. Previous methods introduce additional components, such as hierarchical policies or estimations of source policies’ value functions, which can lead to non-stationary policy optimization or heavy sampling costs, diminishing transfer effectiveness. To address this challenge, we propose a novel transfer RL method that selects the source policy without training extra components. Our method utilizes the Q function in the actor-critic framework to guide policy selection, choosing the source policy with the largest one-step improvement over the current target policy. We integrate optimization transfer and behavior transfer (IOB) by regularizing the learned policy to mimic the guidance policy and combining them as the behavior policy. This integration significantly enhances transfer effectiveness, surpasses state-of-the-art transfer RL baselines in benchmark tasks, and improves final performance and knowledge transferability in continual learning scenarios. Additionally, we show that our optimization transfer technique is guaranteed to improve target policy learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Algorithm 2
Algorithm 3
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Guberman, S. R., & Greenfield, P. M. (1991). Learning and transfer in everyday cognition. Cognitive Development, 6(3), 233–260.

    Article  Google Scholar 

  2. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., & Bolton, A. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354–359.

    Article  Google Scholar 

  3. Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., & Georgiev, P. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782), 350–354.

    Article  Google Scholar 

  4. Ceron, J.S.O., & Castro, P.S. (2021). Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research. In: International Conference on Machine Learning, pp. 1373–1383 . PMLR

  5. Fernández, F., & Veloso, M. (2006). Probabilistic policy reuse in a reinforcement learning agent. In: Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 720–727

  6. Barreto, A., Borsa, D., Quan, J., Schaul, T., Silver, D., Hessel, M., Mankowitz, D., Zidek, A., & Munos, R. (2018) Transfer in deep reinforcement learning using successor features and generalised policy improvement. In: International Conference on Machine Learning, pp. 501–510. PMLR

  7. Li, S., Wang, R., Tang, M., & Zhang, C. (2019). Hierarchical reinforcement learning with advantage-based auxiliary rewards. Advances in Neural Information Processing Systems 32

  8. Yang, T., Hao, J., Meng, Z., Zhang, Z., Hu, Y., Chen, Y., Fan, C., Wang, W., Liu, W., Wang, Z., & Peng, J. (2020). Efficient deep reinforcement learning via adaptive policy transfer. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pp. 3094–3100

  9. Zhang, J., Li, S., & Zhang, C. (2022). Cup: Critic-guided policy reuse. In: Advances in Neural Information Processing Systems

  10. Li, S., Gu, F., Zhu, G., & Zhang, C. (2018). Context-aware policy reuse. arXiv preprint arXiv:1806.03793

  11. Teh, Y., Bapst, V., Czarnecki, W.M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., & Pascanu, R. (2017). Distral: Robust multitask reinforcement learning. Advances in Neural Information Processing systems 30

  12. Barreto, A., Dabney, W., Munos, R., Hunt, J.J., Schaul, T., Hasselt, H.P., & Silver, D. (2017). Successor features for transfer in reinforcement learning. Advances in Neural Information Processing Systems 30

  13. Cheng, C.-A., Kolobov, A., & Agarwal, A. (2020). Policy improvement via imitation of multiple oracles. Advances in Neural Information Processing Systems, 33, 5587–5598.

    Google Scholar 

  14. Pateria, S., Subagdja, B., Tan, A.-H., & Quek, C. (2021). Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR), 54(5), 1–35.

    Article  Google Scholar 

  15. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., &Wierstra, D. (2016). Continuous control with deep reinforcement learning. In: ICLR (Poster)

  16. Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596. PMLR

  17. Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., & et al. (2018) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905

  18. Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., & Levine, S. (2020). Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In: Conference on Robot Learning, pp. 1094–1100. PMLR

  19. Zhu, Z., Lin, K., Zhou, & J. (2020). Transfer learning in deep reinforcement learning: A survey. arXiv preprint arXiv:2009.07888

  20. Parisotto, E., Ba, J.L., & Salakhutdinov, R. (2015). Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342

  21. Hou, Y., Ong, Y.-S., Feng, L., & Zurada, J. M. (2017). An evolutionary transfer reinforcement learning framework for multiagent systems. IEEE Transactions on Evolutionary Computation, 21(4), 601–615.

    Article  Google Scholar 

  22. Laroche, R., & Barlier, M. (2017). Transfer reinforcement learning with shared dynamics. In: Thirty-First AAAI Conference on Artificial Intelligence

  23. Lehnert, L., & Littman, M. L. (2020). Successor features combine elements of model-free and model-based reinforcement learning. Journal of Machine Learning Research, 21(196), 1–53.

    MathSciNet  MATH  Google Scholar 

  24. Barekatain, M., Yonetani, R., & Hamaya, M. (2020). Multipolar: Multi-source policy aggregation for transfer reinforcement learning between diverse environmental dynamics. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 3108–3116

  25. Li, S., & Zhang, C. (2018) An optimal online method of selecting source policies for reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32

  26. Gimelfarb, M., Sanner, S., & Lee, C.-G. (2021). Contextual policy transfer in reinforcement learning domains via deep mixtures-of-experts. In: Uncertainty in Artificial Intelligence, pp. 1787–1797. PMLR

  27. Yang, X., Ji, Z., Wu, J., Lai, Y.-K., Wei, C., Liu, G., & Setchi, R. (2021). Hierarchical reinforcement learning with universal policies for multistep robotic manipulation. IEEE Transactions on Neural Networks and Learning Systems, 33(9), 4727–4741.

    Article  MathSciNet  Google Scholar 

  28. Rusu, A.A., Rabinowitz, N.C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., & Hadsell, R. (2016). Progressive neural networks. arXiv preprint arXiv:1606.04671

  29. Berseth, G., Xie, C., Cernek, P., & Panne, M. (2018). Progressive reinforcement learning with distillation for multi-skilled motion control. arXiv preprint arXiv:1802.04765

  30. Schwarz, J., Czarnecki, W., Luketina, J., Grabska-Barwinska, A., Teh, Y.W., Pascanu, R., & Hadsell, R. (2018). Progress & compress: A scalable framework for continual learning. In: International Conference on Machine Learning, pp. 4528–4537. PMLR

  31. Mallya, A., & Lazebnik, S. (2018). Packnet: Adding multiple tasks to a single network by iterative pruning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7765–7773

  32. Duan, Y., Schulman, J., Chen, X., Bartlett, P.L., Sutskever, I., & Abbeel, P. (2016). Rl \(^2\): Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779

  33. Finn, C., Abbeel, P., & Levine, S. (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR

  34. Rakelly, K., Zhou, A., Finn, C., Levine, S., & Quillen, D. (2019). Efficient off-policy meta-reinforcement learning via probabilistic context variables. In: International Conference on Machine Learning, pp. 5331–5340. PMLR

  35. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning, pp. 1861–1870. PMLR

  36. Lan, Q., Pan, Y., Fyshe, A., & White, M. (2020). Maxmin q-learning: Controlling the estimation bias of q-learning. arXiv preprint arXiv:2002.06487

  37. Kuznetsov, A., Shvechikov, P., Grishin, A., & Vetrov, D. (2020). Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In: International Conference on Machine Learning, pp. 5556–5566. PMLR

  38. Zhang, S., & Sutton, R.S. (2017). A deeper look at experience replay. arXiv preprint arXiv:1712.01275

  39. Fedus, W., Ramachandran, P., Agarwal, R., Bengio, Y., Larochelle, H., Rowland, M., & Dabney, W. (2020). Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, pp. 3061–3071. PMLR

  40. Khetarpal, K., Riemer, M., Rish, I., & Precup, D. (2020). Towards continual reinforcement learning: A review and perspectives. arXiv preprint arXiv:2012.13490

  41. Wolczyk, M., Zajkac, M., Pascanu, R., Kucinski, L., & Milos, P. (2021). Continual world: A robotic benchmark for continual reinforcement learning. Advances in Neural Information Processing Systems, 34, 28496–28510.

    Google Scholar 

  42. Yang, R., Xu, H., Wu, Y., & Wang, X. (2020). Multi-task reinforcement learning with soft modularization. Advances in Neural Information Processing Systems, 33, 4767–4777.

    Google Scholar 

  43. Sodhani, S., Zhang, A., & Pineau, J. (2021). Multi-task reinforcement learning with context-based representations. In: International Conference on Machine Learning, pp. 9767–9779. PMLR

  44. Wan, M., Gangwani, T., & Peng, J. (2020) Mutual information based knowledge transfer under state-action dimension mismatch. arXiv preprint arXiv:2006.07041

  45. Zhang, Q., Xiao, T., Efros, A.A., Pinto, L., & Wang, X. (2020). Learning cross-domain correspondence for control with dynamics cycle-consistency. arXiv preprint arXiv:2012.09811

  46. Heng, Y., Yang, T., ZHENG, Y., Jianye, H., & Taylor, M.E. (2022). Cross-domain adaptive transfer reinforcement learning based on state-action correspondence. In: The 38th Conference on Uncertainty in Artificial Intelligence

  47. Pol, E., Worrall, D., Hoof, H., Oliehoek, F., & Welling, M. (2020). MDP homomorphic networks: Group symmetries in reinforcement learning. Advances in Neural Information Processing Systems, 33, 4199–4210.

    Google Scholar 

  48. Pol, E., Kipf, T., Oliehoek, F.A., & Welling, M. (2020). Plannable approximations to MDP homomorphisms: Equivariance under actions. arXiv preprint arXiv:2002.11963

  49. Fedotov, A. A., Harremoës, P., & Topsoe, F. (2003). Refinements of pinsker’s inequality. IEEE Transactions on Information Theory, 49(6), 1491–1498.

    Article  MathSciNet  MATH  Google Scholar 

  50. Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. In: In Proceedings of 19th International Conference on Machine Learning. Citeseer

Download references

Acknowledgements

This work is supported by the Key Program of the National Natural Science Foundation of China (Grant No.51935005), National Natural Science Foundation of China(Grant No.62306088), Basic Research Project (Grant No.JCKY20200603C010), Natural Science Foundation of Heilongjiang Province of China (Grant No.LH2021F023), as well as Science and Technology Planning Project of Heilongjiang Province of China (Grant No.GA21C031), and China Academy of Launch Vehicle Technology (CALT2022-18).

Author information

Authors and Affiliations

Authors

Contributions

SL and HL wrote the main manuscript text, while JZ performed the experiments. PL provided funding support and contributed to manuscript revisions. ZW and CZ revised the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Chongjie Zhang.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A.1 Proof of theorem 1

Proof

As \(|\widetilde{Q}_{\pi _{tar}}(s,a)-{Q}_{\pi _{tar}}(s,a)|\le \mu \text {\ for\ all}\ s \in \mathcal {S}, a \in A\), we have that for all \(s \in \mathcal {S}\), the difference between the true value function \(V_{\pi _{tar}}\) and the approximated value function \(\widetilde{V}_{\pi _{tar}}\) is bounded:

$$\begin{aligned}&V_{\pi _{tar}}(s)\\&=\mathbb {E}_{a \sim \pi _{tar}(\cdot |s)}\left[ Q_{\pi _{tar}}(s,a)-\alpha \log \pi _{tar}(a|s)\right] \\&\le \mathbb {E}_{a \sim \pi _{tar}(\cdot |s)}\left[ \widetilde{Q}_{\pi _{tar}}(s,a)-\alpha \log \pi _{tar}(a|s)+\mu \right] \\&= \widetilde{V}_{\pi _{tar}}(s)+ \mu . \end{aligned}$$

As \(\pi _{tar}(\cdot |s)\) is contained in \(\Pi ^s\), with \(\widetilde{\pi }_{g}\) defined in Eq. (9), it is obvious that for all \(s \in \mathcal {S}\),

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{a \sim \widetilde{\pi }_{g}} (\cdot |s)\left[ \widetilde{Q}_{\pi _{tar}}(s,a)-\alpha \log \widetilde{\pi }_g(a|s)\right] \ge \\&\mathbb {E}_{a \sim \pi _{tar} (\cdot |s)}\left[ \widetilde{Q}_{\pi _{tar}}(s,a)-\alpha \log \pi _{tar}(a|s)\right] =\widetilde{V}_{\pi _{tar}}(s). \end{aligned} \end{aligned}$$
(1)

Then for all \(s_i \in \mathcal {S}\),

$$\begin{aligned}&V_{\pi _{tar}}(s_i) \le \widetilde{V}_{\pi _{tar}}(s_i)+ \mu \\&\le \mathbb {E}_{a_i \sim \widetilde{\pi }_{g}(\cdot |s_i)}[\widetilde{Q}_{\pi _{tar}}(s_i,a_i)-\alpha \log \widetilde{\pi }_{g}(a_i|s_i)] + \mu \\&\le \mathbb {E}_{a_i \sim \widetilde{\pi }_{g}}(\cdot |s_i)[{Q}_{\pi _{tar}}(s_i,a_i)-\alpha \log \widetilde{\pi }_{g}(a_i|s_i)]+ 2\mu \\&=\mathbb {E}_{a_i \sim \widetilde{\pi }_{g}}(a|s_i)[r(s_i,a_i)-\alpha \log \widetilde{\pi }_{g}(a_i|s_i) +\gamma V_{\pi _{tar}}(s_{i+1})] \\&\ \ \ + 2\mu \\&\vdots \\&\le \mathbb {E}_{\widetilde{\pi }_{g}}[\sum _{\tau =0}^{\infty }\gamma ^{\tau }(r(s_{i+\tau }, a_{i+\tau })-\alpha \log \widetilde{\pi }_{g}(a_{i+\tau }|s_{i+\tau }))]\\&\ \ \ + 2 \sum _{\tau =0}^{\infty }\gamma ^{\tau }\mu \\&=V_{\widetilde{\pi }_{g}} (s_i)+\frac{2\mu }{1-\gamma }. \end{aligned}$$

A.2 Proof of theorem 2

Proof

According to the Pinsker’s inequality [49], \(D_{KL}(\pi _{tar}^{l+1}(\cdot |s)||\widetilde{\pi }_{g}^l(\cdot |s))\ge \frac{1}{2\ln 2}||\pi _{tar}^{l+1}(\cdot |s)-\widetilde{\pi }_{g}^l(\cdot |s)||_1^2\), where \(||\cdot ||_1\) is the L1 norm. So we have that for all \(s \in \mathcal {S}\), \(||\pi _{tar}^{l+1}(\cdot |s)-\widetilde{\pi }_{g}^l(\cdot |s)||_1 \le \sqrt{2\ln 2 \delta }\). According to the Performance Difference Lemma [50], we have that for all \(s \in \mathcal {S}\):

$$\begin{aligned}&V_{\widetilde{\pi }_{g}^l}(s)-V_{\pi _{tar}^{l+1}}(s) \\&= \frac{1}{1-\gamma }\mathbb {E}_{s' \sim \rho _{s}^{\widetilde{\pi }_{g}^l}}(s') [\mathbb {E}_{a \sim \widetilde{\pi }_{g}^l(\cdot |s')}[Q_{\pi _{tar}^{l+1}}(s',a)-\alpha \log \widetilde{\pi }_{g}^l(a|s)]\\&\ \ \ -\mathbb {E}_{a \sim \widetilde{\pi }_{tar}^{l+1}(\cdot |s')}[Q_{\pi _{tar}^{l+1}}(s',a) -\alpha \log \widetilde{\pi }_{tar}^{l+1}(a|s)]] \\&\le \frac{1}{1-\gamma }\max \limits _{s' \in \mathcal {S}}[\mathbb {E}_{a \sim \widetilde{\pi }_{g}^l}(\cdot |s')[Q_{\pi _{tar}^{l+1}}(s',a)] \\&\ \ \ -\mathbb {E}_{a \sim \pi _{tar}^{l+1}(\cdot |s')}[Q_{\pi _{tar}^{l+1}}(s',a)]] \\&\ \ \ +\frac{\alpha }{1-\gamma }\max \limits _{s'' \in \mathcal {S}}\left| \mathcal {H} (\widetilde{\pi }_{g}^l(\cdot |s''))-\mathcal {H}(\pi _{tar}^l(\cdot |s''))\right| \\&= \frac{1}{1-\gamma }\max \limits _{s' \in \mathcal {S}}\int \left( \widetilde{\pi }_{g}^l(\cdot |s) -\pi _{tar}^{l+1}(a|s)\right) Q_{\pi _{tar}^{l+1}}(s',a)da \\&\ \ \ +\frac{\alpha }{1-\gamma }\widetilde{\mathcal {H}}_{max} \\&\le \frac{1}{1-\gamma } \max \limits _{s' \in \mathcal {S}}\int \left| \widetilde{\pi }_{g}^l(a|s) -\pi _{tar}^{l+1}(a|s)\right| \cdot \left| Q_{\pi _{tar}^{l+1}}(s',a)\right| da \\&\ \ \ +\frac{\alpha }{1-\gamma }\widetilde{\mathcal {H}}_{max} \\&\le \frac{1}{1-\gamma } \max \limits _{s' \in \mathcal {S}}\int \left| \widetilde{\pi }_{g}^l(a|s) -\pi _{tar}^{l+1}(a|s)\right| \cdot \frac{\widetilde{R}_{max}+\alpha \mathcal {H}_{max}^{l+1}}{1-\gamma }da \\&\ \ \ +\frac{\alpha }{1-\gamma }\widetilde{\mathcal {H}}_{max} \end{aligned}$$
$$\begin{aligned}&= \frac{\widetilde{R}_{max}+\alpha \mathcal {H}_{max}^{l+1}}{(1-\gamma )^2}\max \limits _{s' \in \mathcal {S}}||\widetilde{\pi }_{g}^l(\cdot |s)-\pi _{tar}^{l+1}(\cdot |s)||_1 \nonumber \\&\ \ \ +\frac{\alpha }{1-\gamma }\widetilde{\mathcal {H}}_{max} \nonumber \\&\le \frac{\sqrt{2\ln 2\delta }(\widetilde{R}_{max}+\alpha \mathcal {H}_{max}^{l+1}) +\alpha (1-\gamma )\widetilde{\mathcal {H}}_{max}}{(1-\gamma )^2}, \end{aligned}$$
(2)

where \(\rho _{s}^{\widetilde{\pi }_{g}^l}(s')=(1-\gamma )\sum _{t=0}^{\infty }\gamma ^tp(s_t=s'|s_0=s,\widetilde{\pi }_{g}^l)\) is the normalized discounted state occupancy distribution. Note that

$$\begin{aligned}&|Q_{\pi _{tar}^{l+1}}(s,a)|\nonumber \\&=|\mathbb {E}_{\pi _{tar}^{l+1}}[\sum _{i=0}^{\infty }\gamma ^i(r(s_{\tau +i},a_{\tau +i}) \nonumber \\&\ \ \ -\alpha \log \pi _{tar}^{l+1}(\cdot |s))|s_\tau =s,a_\tau =a]| \nonumber \\&\le \mathbb {E}_{\pi }[\sum _{i=0}^{\infty }\gamma ^i(\widetilde{R}_{max}+\gamma \mathcal {H}_{max}^{l+1})] \end{aligned}$$
(3)
$$\begin{aligned}&=\frac{\widetilde{R}_{max}+\alpha \mathcal {H}_{max}^{l+1}}{1-\gamma }. \end{aligned}$$
(4)

Eventually, we have

$$\begin{aligned}&V_{\pi _{tar}^{l+1}}(s) \nonumber \\&\ge V_{\widetilde{\pi }_{g}^l}(s) - \frac{\sqrt{2\ln 2\delta }(\widetilde{R}_{max} +\alpha \mathcal {H}_{max}^{l+1})+\alpha (1-\gamma )\widetilde{\mathcal {H}}_{max}}{(1-\gamma )^2} \nonumber \\&\ge V_{\pi _{tar}^l}(s)-\frac{\sqrt{2\ln 2\delta }(\widetilde{R}_{max} +\alpha \mathcal {H}_{max}^{l+1})}{(1-\gamma )^2}- \frac{2\mu +\alpha \widetilde{\mathcal {H}}_{max}}{1-\gamma }. \end{aligned}$$
(5)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, S., Li, H., Zhang, J. et al. IOB: integrating optimization transfer and behavior transfer for multi-policy reuse. Auton Agent Multi-Agent Syst 38, 3 (2024). https://doi.org/10.1007/s10458-023-09630-9

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10458-023-09630-9

Keywords

Navigation