IOB: integrating optimization transfer and behavior transfer for multi-policy reuse

Li, Siyuan; Li, Hao; Zhang, Jin; Wang, Zhen; Liu, Peng; Zhang, Chongjie

doi:10.1007/s10458-023-09630-9

IOB: integrating optimization transfer and behavior transfer for multi-policy reuse

Published: 09 December 2023

Volume 38, article number 3, (2024)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Siyuan Li¹,
Hao Li²,
Jin Zhang³,
Zhen Wang²,
Peng Liu¹ &
…
Chongjie Zhang⁴

257 Accesses
Explore all metrics

Abstract

Humans have the ability to reuse previously learned policies to solve new tasks quickly, and reinforcement learning (RL) agents can do the same by transferring knowledge from source policies to a related target task. Transfer RL methods can reshape the policy optimization objective (optimization transfer) or influence the behavior policy (behavior transfer) using source policies. However, selecting the appropriate source policy with limited samples to guide target policy learning has been a challenge. Previous methods introduce additional components, such as hierarchical policies or estimations of source policies’ value functions, which can lead to non-stationary policy optimization or heavy sampling costs, diminishing transfer effectiveness. To address this challenge, we propose a novel transfer RL method that selects the source policy without training extra components. Our method utilizes the Q function in the actor-critic framework to guide policy selection, choosing the source policy with the largest one-step improvement over the current target policy. We integrate optimization transfer and behavior transfer (IOB) by regularizing the learned policy to mimic the guidance policy and combining them as the behavior policy. This integration significantly enhances transfer effectiveness, surpasses state-of-the-art transfer RL baselines in benchmark tasks, and improves final performance and knowledge transferability in continual learning scenarios. Additionally, we show that our optimization transfer technique is guaranteed to improve target policy learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-agent cooperation policy gradient method based on enhanced exploration for cooperative tasks

Article 30 September 2023

Exploration in policy optimization through multiple paths

Article 26 June 2021

Latent-Conditioned Policy Gradient for Multi-Objective Deep Reinforcement Learning

References

Guberman, S. R., & Greenfield, P. M. (1991). Learning and transfer in everyday cognition. Cognitive Development, 6(3), 233–260.
Article Google Scholar
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., & Bolton, A. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354–359.
Article Google Scholar
Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., & Georgiev, P. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782), 350–354.
Article Google Scholar
Ceron, J.S.O., & Castro, P.S. (2021). Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research. In: International Conference on Machine Learning, pp. 1373–1383 . PMLR
Fernández, F., & Veloso, M. (2006). Probabilistic policy reuse in a reinforcement learning agent. In: Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 720–727
Barreto, A., Borsa, D., Quan, J., Schaul, T., Silver, D., Hessel, M., Mankowitz, D., Zidek, A., & Munos, R. (2018) Transfer in deep reinforcement learning using successor features and generalised policy improvement. In: International Conference on Machine Learning, pp. 501–510. PMLR
Li, S., Wang, R., Tang, M., & Zhang, C. (2019). Hierarchical reinforcement learning with advantage-based auxiliary rewards. Advances in Neural Information Processing Systems 32
Yang, T., Hao, J., Meng, Z., Zhang, Z., Hu, Y., Chen, Y., Fan, C., Wang, W., Liu, W., Wang, Z., & Peng, J. (2020). Efficient deep reinforcement learning via adaptive policy transfer. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pp. 3094–3100
Zhang, J., Li, S., & Zhang, C. (2022). Cup: Critic-guided policy reuse. In: Advances in Neural Information Processing Systems
Li, S., Gu, F., Zhu, G., & Zhang, C. (2018). Context-aware policy reuse. arXiv preprint arXiv:1806.03793
Teh, Y., Bapst, V., Czarnecki, W.M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., & Pascanu, R. (2017). Distral: Robust multitask reinforcement learning. Advances in Neural Information Processing systems 30
Barreto, A., Dabney, W., Munos, R., Hunt, J.J., Schaul, T., Hasselt, H.P., & Silver, D. (2017). Successor features for transfer in reinforcement learning. Advances in Neural Information Processing Systems 30
Cheng, C.-A., Kolobov, A., & Agarwal, A. (2020). Policy improvement via imitation of multiple oracles. Advances in Neural Information Processing Systems, 33, 5587–5598.
Google Scholar
Pateria, S., Subagdja, B., Tan, A.-H., & Quek, C. (2021). Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR), 54(5), 1–35.
Article Google Scholar
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., &Wierstra, D. (2016). Continuous control with deep reinforcement learning. In: ICLR (Poster)
Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596. PMLR
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., & et al. (2018) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905
Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., & Levine, S. (2020). Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In: Conference on Robot Learning, pp. 1094–1100. PMLR
Zhu, Z., Lin, K., Zhou, & J. (2020). Transfer learning in deep reinforcement learning: A survey. arXiv preprint arXiv:2009.07888
Parisotto, E., Ba, J.L., & Salakhutdinov, R. (2015). Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342
Hou, Y., Ong, Y.-S., Feng, L., & Zurada, J. M. (2017). An evolutionary transfer reinforcement learning framework for multiagent systems. IEEE Transactions on Evolutionary Computation, 21(4), 601–615.
Article Google Scholar
Laroche, R., & Barlier, M. (2017). Transfer reinforcement learning with shared dynamics. In: Thirty-First AAAI Conference on Artificial Intelligence
Lehnert, L., & Littman, M. L. (2020). Successor features combine elements of model-free and model-based reinforcement learning. Journal of Machine Learning Research, 21(196), 1–53.
MathSciNet MATH Google Scholar
Barekatain, M., Yonetani, R., & Hamaya, M. (2020). Multipolar: Multi-source policy aggregation for transfer reinforcement learning between diverse environmental dynamics. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 3108–3116
Li, S., & Zhang, C. (2018) An optimal online method of selecting source policies for reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32
Gimelfarb, M., Sanner, S., & Lee, C.-G. (2021). Contextual policy transfer in reinforcement learning domains via deep mixtures-of-experts. In: Uncertainty in Artificial Intelligence, pp. 1787–1797. PMLR
Yang, X., Ji, Z., Wu, J., Lai, Y.-K., Wei, C., Liu, G., & Setchi, R. (2021). Hierarchical reinforcement learning with universal policies for multistep robotic manipulation. IEEE Transactions on Neural Networks and Learning Systems, 33(9), 4727–4741.
Article MathSciNet Google Scholar
Rusu, A.A., Rabinowitz, N.C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., & Hadsell, R. (2016). Progressive neural networks. arXiv preprint arXiv:1606.04671
Berseth, G., Xie, C., Cernek, P., & Panne, M. (2018). Progressive reinforcement learning with distillation for multi-skilled motion control. arXiv preprint arXiv:1802.04765
Schwarz, J., Czarnecki, W., Luketina, J., Grabska-Barwinska, A., Teh, Y.W., Pascanu, R., & Hadsell, R. (2018). Progress & compress: A scalable framework for continual learning. In: International Conference on Machine Learning, pp. 4528–4537. PMLR
Mallya, A., & Lazebnik, S. (2018). Packnet: Adding multiple tasks to a single network by iterative pruning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7765–7773
Duan, Y., Schulman, J., Chen, X., Bartlett, P.L., Sutskever, I., & Abbeel, P. (2016). Rl $^2$: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779
Finn, C., Abbeel, P., & Levine, S. (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR
Rakelly, K., Zhou, A., Finn, C., Levine, S., & Quillen, D. (2019). Efficient off-policy meta-reinforcement learning via probabilistic context variables. In: International Conference on Machine Learning, pp. 5331–5340. PMLR
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning, pp. 1861–1870. PMLR
Lan, Q., Pan, Y., Fyshe, A., & White, M. (2020). Maxmin q-learning: Controlling the estimation bias of q-learning. arXiv preprint arXiv:2002.06487
Kuznetsov, A., Shvechikov, P., Grishin, A., & Vetrov, D. (2020). Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In: International Conference on Machine Learning, pp. 5556–5566. PMLR
Zhang, S., & Sutton, R.S. (2017). A deeper look at experience replay. arXiv preprint arXiv:1712.01275
Fedus, W., Ramachandran, P., Agarwal, R., Bengio, Y., Larochelle, H., Rowland, M., & Dabney, W. (2020). Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, pp. 3061–3071. PMLR
Khetarpal, K., Riemer, M., Rish, I., & Precup, D. (2020). Towards continual reinforcement learning: A review and perspectives. arXiv preprint arXiv:2012.13490
Wolczyk, M., Zajkac, M., Pascanu, R., Kucinski, L., & Milos, P. (2021). Continual world: A robotic benchmark for continual reinforcement learning. Advances in Neural Information Processing Systems, 34, 28496–28510.
Google Scholar
Yang, R., Xu, H., Wu, Y., & Wang, X. (2020). Multi-task reinforcement learning with soft modularization. Advances in Neural Information Processing Systems, 33, 4767–4777.
Google Scholar
Sodhani, S., Zhang, A., & Pineau, J. (2021). Multi-task reinforcement learning with context-based representations. In: International Conference on Machine Learning, pp. 9767–9779. PMLR
Wan, M., Gangwani, T., & Peng, J. (2020) Mutual information based knowledge transfer under state-action dimension mismatch. arXiv preprint arXiv:2006.07041
Zhang, Q., Xiao, T., Efros, A.A., Pinto, L., & Wang, X. (2020). Learning cross-domain correspondence for control with dynamics cycle-consistency. arXiv preprint arXiv:2012.09811
Heng, Y., Yang, T., ZHENG, Y., Jianye, H., & Taylor, M.E. (2022). Cross-domain adaptive transfer reinforcement learning based on state-action correspondence. In: The 38th Conference on Uncertainty in Artificial Intelligence
Pol, E., Worrall, D., Hoof, H., Oliehoek, F., & Welling, M. (2020). MDP homomorphic networks: Group symmetries in reinforcement learning. Advances in Neural Information Processing Systems, 33, 4199–4210.
Google Scholar
Pol, E., Kipf, T., Oliehoek, F.A., & Welling, M. (2020). Plannable approximations to MDP homomorphisms: Equivariance under actions. arXiv preprint arXiv:2002.11963
Fedotov, A. A., Harremoës, P., & Topsoe, F. (2003). Refinements of pinsker’s inequality. IEEE Transactions on Information Theory, 49(6), 1491–1498.
Article MathSciNet MATH Google Scholar
Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. In: In Proceedings of 19th International Conference on Machine Learning. Citeseer

Download references

Acknowledgements

This work is supported by the Key Program of the National Natural Science Foundation of China (Grant No.51935005), National Natural Science Foundation of China(Grant No.62306088), Basic Research Project (Grant No.JCKY20200603C010), Natural Science Foundation of Heilongjiang Province of China (Grant No.LH2021F023), as well as Science and Technology Planning Project of Heilongjiang Province of China (Grant No.GA21C031), and China Academy of Launch Vehicle Technology (CALT2022-18).

Author information

Authors and Affiliations

Faculty of Computing, Harbin Institute of Technology, 150001, Harbin, China
Siyuan Li & Peng Liu
School of Cybersecurity, Northwestern Polytechnical University, 710072, Xi’an, China
Hao Li & Zhen Wang
Interdisciplinary Information Sciences, Tsinghua University, 100084, Beijing, China
Jin Zhang
McKelvey School of Engineering, Washington University in St. Louis, St. Louis, 63130, USA
Chongjie Zhang

Authors

Siyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Hao Li
View author publications
You can also search for this author in PubMed Google Scholar
Jin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Peng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chongjie Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SL and HL wrote the main manuscript text, while JZ performed the experiments. PL provided funding support and contributed to manuscript revisions. ZW and CZ revised the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Chongjie Zhang.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A.1 Proof of theorem 1

Proof

As $|\widetilde{Q}_{\pi _{tar}}(s,a)-{Q}_{\pi _{tar}}(s,a)|\le \mu \text {\ for\ all}\ s \in \mathcal {S}, a \in A$, we have that for all $s \in \mathcal {S}$, the difference between the true value function $V_{\pi _{tar}}$ and the approximated value function $\widetilde{V}_{\pi _{tar}}$ is bounded:

$$\begin{aligned}&V_{\pi _{tar}}(s)\\&=\mathbb {E}_{a \sim \pi _{tar}(\cdot |s)}\left[ Q_{\pi _{tar}}(s,a)-\alpha \log \pi _{tar}(a|s)\right] \\&\le \mathbb {E}_{a \sim \pi _{tar}(\cdot |s)}\left[ \widetilde{Q}_{\pi _{tar}}(s,a)-\alpha \log \pi _{tar}(a|s)+\mu \right] \\&= \widetilde{V}_{\pi _{tar}}(s)+ \mu . \end{aligned}$$

As $\pi _{tar}(\cdot |s)$ is contained in $\Pi ^s$, with $\widetilde{\pi }_{g}$ defined in Eq. (9), it is obvious that for all $s \in \mathcal {S}$,

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{a \sim \widetilde{\pi }_{g}} (\cdot |s)\left[ \widetilde{Q}_{\pi _{tar}}(s,a)-\alpha \log \widetilde{\pi }_g(a|s)\right] \ge \\&\mathbb {E}_{a \sim \pi _{tar} (\cdot |s)}\left[ \widetilde{Q}_{\pi _{tar}}(s,a)-\alpha \log \pi _{tar}(a|s)\right] =\widetilde{V}_{\pi _{tar}}(s). \end{aligned} \end{aligned}$$

(1)

Then for all $s_i \in \mathcal {S}$,

$$\begin{aligned}&V_{\pi _{tar}}(s_i) \le \widetilde{V}_{\pi _{tar}}(s_i)+ \mu \\&\le \mathbb {E}_{a_i \sim \widetilde{\pi }_{g}(\cdot |s_i)}[\widetilde{Q}_{\pi _{tar}}(s_i,a_i)-\alpha \log \widetilde{\pi }_{g}(a_i|s_i)] + \mu \\&\le \mathbb {E}_{a_i \sim \widetilde{\pi }_{g}}(\cdot |s_i)[{Q}_{\pi _{tar}}(s_i,a_i)-\alpha \log \widetilde{\pi }_{g}(a_i|s_i)]+ 2\mu \\&=\mathbb {E}_{a_i \sim \widetilde{\pi }_{g}}(a|s_i)[r(s_i,a_i)-\alpha \log \widetilde{\pi }_{g}(a_i|s_i) +\gamma V_{\pi _{tar}}(s_{i+1})] \\&\ \ \ + 2\mu \\&\vdots \\&\le \mathbb {E}_{\widetilde{\pi }_{g}}[\sum _{\tau =0}^{\infty }\gamma ^{\tau }(r(s_{i+\tau }, a_{i+\tau })-\alpha \log \widetilde{\pi }_{g}(a_{i+\tau }|s_{i+\tau }))]\\&\ \ \ + 2 \sum _{\tau =0}^{\infty }\gamma ^{\tau }\mu \\&=V_{\widetilde{\pi }_{g}} (s_i)+\frac{2\mu }{1-\gamma }. \end{aligned}$$

A.2 Proof of theorem 2

Proof

According to the Pinsker’s inequality [49], $D_{KL}(\pi _{tar}^{l+1}(\cdot |s)||\widetilde{\pi }_{g}^l(\cdot |s))\ge \frac{1}{2\ln 2}||\pi _{tar}^{l+1}(\cdot |s)-\widetilde{\pi }_{g}^l(\cdot |s)||_1^2$, where $||\cdot ||_1$ is the L1 norm. So we have that for all $s \in \mathcal {S}$, $||\pi _{tar}^{l+1}(\cdot |s)-\widetilde{\pi }_{g}^l(\cdot |s)||_1 \le \sqrt{2\ln 2 \delta }$. According to the Performance Difference Lemma [50], we have that for all $s \in \mathcal {S}$:

$$\begin{aligned}&V_{\widetilde{\pi }_{g}^l}(s)-V_{\pi _{tar}^{l+1}}(s) \\&= \frac{1}{1-\gamma }\mathbb {E}_{s' \sim \rho _{s}^{\widetilde{\pi }_{g}^l}}(s') [\mathbb {E}_{a \sim \widetilde{\pi }_{g}^l(\cdot |s')}[Q_{\pi _{tar}^{l+1}}(s',a)-\alpha \log \widetilde{\pi }_{g}^l(a|s)]\\&\ \ \ -\mathbb {E}_{a \sim \widetilde{\pi }_{tar}^{l+1}(\cdot |s')}[Q_{\pi _{tar}^{l+1}}(s',a) -\alpha \log \widetilde{\pi }_{tar}^{l+1}(a|s)]] \\&\le \frac{1}{1-\gamma }\max \limits _{s' \in \mathcal {S}}[\mathbb {E}_{a \sim \widetilde{\pi }_{g}^l}(\cdot |s')[Q_{\pi _{tar}^{l+1}}(s',a)] \\&\ \ \ -\mathbb {E}_{a \sim \pi _{tar}^{l+1}(\cdot |s')}[Q_{\pi _{tar}^{l+1}}(s',a)]] \\&\ \ \ +\frac{\alpha }{1-\gamma }\max \limits _{s'' \in \mathcal {S}}\left| \mathcal {H} (\widetilde{\pi }_{g}^l(\cdot |s''))-\mathcal {H}(\pi _{tar}^l(\cdot |s''))\right| \\&= \frac{1}{1-\gamma }\max \limits _{s' \in \mathcal {S}}\int \left( \widetilde{\pi }_{g}^l(\cdot |s) -\pi _{tar}^{l+1}(a|s)\right) Q_{\pi _{tar}^{l+1}}(s',a)da \\&\ \ \ +\frac{\alpha }{1-\gamma }\widetilde{\mathcal {H}}_{max} \\&\le \frac{1}{1-\gamma } \max \limits _{s' \in \mathcal {S}}\int \left| \widetilde{\pi }_{g}^l(a|s) -\pi _{tar}^{l+1}(a|s)\right| \cdot \left| Q_{\pi _{tar}^{l+1}}(s',a)\right| da \\&\ \ \ +\frac{\alpha }{1-\gamma }\widetilde{\mathcal {H}}_{max} \\&\le \frac{1}{1-\gamma } \max \limits _{s' \in \mathcal {S}}\int \left| \widetilde{\pi }_{g}^l(a|s) -\pi _{tar}^{l+1}(a|s)\right| \cdot \frac{\widetilde{R}_{max}+\alpha \mathcal {H}_{max}^{l+1}}{1-\gamma }da \\&\ \ \ +\frac{\alpha }{1-\gamma }\widetilde{\mathcal {H}}_{max} \end{aligned}$$

$$\begin{aligned}&= \frac{\widetilde{R}_{max}+\alpha \mathcal {H}_{max}^{l+1}}{(1-\gamma )^2}\max \limits _{s' \in \mathcal {S}}||\widetilde{\pi }_{g}^l(\cdot |s)-\pi _{tar}^{l+1}(\cdot |s)||_1 \nonumber \\&\ \ \ +\frac{\alpha }{1-\gamma }\widetilde{\mathcal {H}}_{max} \nonumber \\&\le \frac{\sqrt{2\ln 2\delta }(\widetilde{R}_{max}+\alpha \mathcal {H}_{max}^{l+1}) +\alpha (1-\gamma )\widetilde{\mathcal {H}}_{max}}{(1-\gamma )^2}, \end{aligned}$$

(2)

where $\rho _{s}^{\widetilde{\pi }_{g}^l}(s')=(1-\gamma )\sum _{t=0}^{\infty }\gamma ^tp(s_t=s'|s_0=s,\widetilde{\pi }_{g}^l)$ is the normalized discounted state occupancy distribution. Note that

$$\begin{aligned}&|Q_{\pi _{tar}^{l+1}}(s,a)|\nonumber \\&=|\mathbb {E}_{\pi _{tar}^{l+1}}[\sum _{i=0}^{\infty }\gamma ^i(r(s_{\tau +i},a_{\tau +i}) \nonumber \\&\ \ \ -\alpha \log \pi _{tar}^{l+1}(\cdot |s))|s_\tau =s,a_\tau =a]| \nonumber \\&\le \mathbb {E}_{\pi }[\sum _{i=0}^{\infty }\gamma ^i(\widetilde{R}_{max}+\gamma \mathcal {H}_{max}^{l+1})] \end{aligned}$$

(3)

$$\begin{aligned}&=\frac{\widetilde{R}_{max}+\alpha \mathcal {H}_{max}^{l+1}}{1-\gamma }. \end{aligned}$$

(4)

Eventually, we have

$$\begin{aligned}&V_{\pi _{tar}^{l+1}}(s) \nonumber \\&\ge V_{\widetilde{\pi }_{g}^l}(s) - \frac{\sqrt{2\ln 2\delta }(\widetilde{R}_{max} +\alpha \mathcal {H}_{max}^{l+1})+\alpha (1-\gamma )\widetilde{\mathcal {H}}_{max}}{(1-\gamma )^2} \nonumber \\&\ge V_{\pi _{tar}^l}(s)-\frac{\sqrt{2\ln 2\delta }(\widetilde{R}_{max} +\alpha \mathcal {H}_{max}^{l+1})}{(1-\gamma )^2}- \frac{2\mu +\alpha \widetilde{\mathcal {H}}_{max}}{1-\gamma }. \end{aligned}$$

(5)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, S., Li, H., Zhang, J. et al. IOB: integrating optimization transfer and behavior transfer for multi-policy reuse. Auton Agent Multi-Agent Syst 38, 3 (2024). https://doi.org/10.1007/s10458-023-09630-9

Download citation

Accepted: 07 November 2023
Published: 09 December 2023
DOI: https://doi.org/10.1007/s10458-023-09630-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

IOB: integrating optimization transfer and behavior transfer for multi-policy reuse

Abstract

Access this article

Similar content being viewed by others

Multi-agent cooperation policy gradient method based on enhanced exploration for cooperative tasks

Exploration in policy optimization through multiple paths

Latent-Conditioned Policy Gradient for Multi-Objective Deep Reinforcement Learning

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix

A.1 Proof of theorem 1

Proof

A.2 Proof of theorem 2

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

IOB: integrating optimization transfer and behavior transfer for multi-policy reuse

Abstract

Access this article

Similar content being viewed by others

Multi-agent cooperation policy gradient method based on enhanced exploration for cooperative tasks

Exploration in policy optimization through multiple paths

Latent-Conditioned Policy Gradient for Multi-Objective Deep Reinforcement Learning

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix

A.1 Proof of theorem 1

Proof

A.2 Proof of theorem 2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation