Abstract
Humans have the ability to reuse previously learned policies to solve new tasks quickly, and reinforcement learning (RL) agents can do the same by transferring knowledge from source policies to a related target task. Transfer RL methods can reshape the policy optimization objective (optimization transfer) or influence the behavior policy (behavior transfer) using source policies. However, selecting the appropriate source policy with limited samples to guide target policy learning has been a challenge. Previous methods introduce additional components, such as hierarchical policies or estimations of source policies’ value functions, which can lead to non-stationary policy optimization or heavy sampling costs, diminishing transfer effectiveness. To address this challenge, we propose a novel transfer RL method that selects the source policy without training extra components. Our method utilizes the Q function in the actor-critic framework to guide policy selection, choosing the source policy with the largest one-step improvement over the current target policy. We integrate optimization transfer and behavior transfer (IOB) by regularizing the learned policy to mimic the guidance policy and combining them as the behavior policy. This integration significantly enhances transfer effectiveness, surpasses state-of-the-art transfer RL baselines in benchmark tasks, and improves final performance and knowledge transferability in continual learning scenarios. Additionally, we show that our optimization transfer technique is guaranteed to improve target policy learning.
Similar content being viewed by others
References
Guberman, S. R., & Greenfield, P. M. (1991). Learning and transfer in everyday cognition. Cognitive Development, 6(3), 233–260.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., & Bolton, A. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354–359.
Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., & Georgiev, P. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782), 350–354.
Ceron, J.S.O., & Castro, P.S. (2021). Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research. In: International Conference on Machine Learning, pp. 1373–1383 . PMLR
Fernández, F., & Veloso, M. (2006). Probabilistic policy reuse in a reinforcement learning agent. In: Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 720–727
Barreto, A., Borsa, D., Quan, J., Schaul, T., Silver, D., Hessel, M., Mankowitz, D., Zidek, A., & Munos, R. (2018) Transfer in deep reinforcement learning using successor features and generalised policy improvement. In: International Conference on Machine Learning, pp. 501–510. PMLR
Li, S., Wang, R., Tang, M., & Zhang, C. (2019). Hierarchical reinforcement learning with advantage-based auxiliary rewards. Advances in Neural Information Processing Systems 32
Yang, T., Hao, J., Meng, Z., Zhang, Z., Hu, Y., Chen, Y., Fan, C., Wang, W., Liu, W., Wang, Z., & Peng, J. (2020). Efficient deep reinforcement learning via adaptive policy transfer. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pp. 3094–3100
Zhang, J., Li, S., & Zhang, C. (2022). Cup: Critic-guided policy reuse. In: Advances in Neural Information Processing Systems
Li, S., Gu, F., Zhu, G., & Zhang, C. (2018). Context-aware policy reuse. arXiv preprint arXiv:1806.03793
Teh, Y., Bapst, V., Czarnecki, W.M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., & Pascanu, R. (2017). Distral: Robust multitask reinforcement learning. Advances in Neural Information Processing systems 30
Barreto, A., Dabney, W., Munos, R., Hunt, J.J., Schaul, T., Hasselt, H.P., & Silver, D. (2017). Successor features for transfer in reinforcement learning. Advances in Neural Information Processing Systems 30
Cheng, C.-A., Kolobov, A., & Agarwal, A. (2020). Policy improvement via imitation of multiple oracles. Advances in Neural Information Processing Systems, 33, 5587–5598.
Pateria, S., Subagdja, B., Tan, A.-H., & Quek, C. (2021). Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR), 54(5), 1–35.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., &Wierstra, D. (2016). Continuous control with deep reinforcement learning. In: ICLR (Poster)
Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596. PMLR
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., & et al. (2018) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905
Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., & Levine, S. (2020). Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In: Conference on Robot Learning, pp. 1094–1100. PMLR
Zhu, Z., Lin, K., Zhou, & J. (2020). Transfer learning in deep reinforcement learning: A survey. arXiv preprint arXiv:2009.07888
Parisotto, E., Ba, J.L., & Salakhutdinov, R. (2015). Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342
Hou, Y., Ong, Y.-S., Feng, L., & Zurada, J. M. (2017). An evolutionary transfer reinforcement learning framework for multiagent systems. IEEE Transactions on Evolutionary Computation, 21(4), 601–615.
Laroche, R., & Barlier, M. (2017). Transfer reinforcement learning with shared dynamics. In: Thirty-First AAAI Conference on Artificial Intelligence
Lehnert, L., & Littman, M. L. (2020). Successor features combine elements of model-free and model-based reinforcement learning. Journal of Machine Learning Research, 21(196), 1–53.
Barekatain, M., Yonetani, R., & Hamaya, M. (2020). Multipolar: Multi-source policy aggregation for transfer reinforcement learning between diverse environmental dynamics. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 3108–3116
Li, S., & Zhang, C. (2018) An optimal online method of selecting source policies for reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32
Gimelfarb, M., Sanner, S., & Lee, C.-G. (2021). Contextual policy transfer in reinforcement learning domains via deep mixtures-of-experts. In: Uncertainty in Artificial Intelligence, pp. 1787–1797. PMLR
Yang, X., Ji, Z., Wu, J., Lai, Y.-K., Wei, C., Liu, G., & Setchi, R. (2021). Hierarchical reinforcement learning with universal policies for multistep robotic manipulation. IEEE Transactions on Neural Networks and Learning Systems, 33(9), 4727–4741.
Rusu, A.A., Rabinowitz, N.C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., & Hadsell, R. (2016). Progressive neural networks. arXiv preprint arXiv:1606.04671
Berseth, G., Xie, C., Cernek, P., & Panne, M. (2018). Progressive reinforcement learning with distillation for multi-skilled motion control. arXiv preprint arXiv:1802.04765
Schwarz, J., Czarnecki, W., Luketina, J., Grabska-Barwinska, A., Teh, Y.W., Pascanu, R., & Hadsell, R. (2018). Progress & compress: A scalable framework for continual learning. In: International Conference on Machine Learning, pp. 4528–4537. PMLR
Mallya, A., & Lazebnik, S. (2018). Packnet: Adding multiple tasks to a single network by iterative pruning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7765–7773
Duan, Y., Schulman, J., Chen, X., Bartlett, P.L., Sutskever, I., & Abbeel, P. (2016). Rl \(^2\): Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779
Finn, C., Abbeel, P., & Levine, S. (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR
Rakelly, K., Zhou, A., Finn, C., Levine, S., & Quillen, D. (2019). Efficient off-policy meta-reinforcement learning via probabilistic context variables. In: International Conference on Machine Learning, pp. 5331–5340. PMLR
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning, pp. 1861–1870. PMLR
Lan, Q., Pan, Y., Fyshe, A., & White, M. (2020). Maxmin q-learning: Controlling the estimation bias of q-learning. arXiv preprint arXiv:2002.06487
Kuznetsov, A., Shvechikov, P., Grishin, A., & Vetrov, D. (2020). Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In: International Conference on Machine Learning, pp. 5556–5566. PMLR
Zhang, S., & Sutton, R.S. (2017). A deeper look at experience replay. arXiv preprint arXiv:1712.01275
Fedus, W., Ramachandran, P., Agarwal, R., Bengio, Y., Larochelle, H., Rowland, M., & Dabney, W. (2020). Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, pp. 3061–3071. PMLR
Khetarpal, K., Riemer, M., Rish, I., & Precup, D. (2020). Towards continual reinforcement learning: A review and perspectives. arXiv preprint arXiv:2012.13490
Wolczyk, M., Zajkac, M., Pascanu, R., Kucinski, L., & Milos, P. (2021). Continual world: A robotic benchmark for continual reinforcement learning. Advances in Neural Information Processing Systems, 34, 28496–28510.
Yang, R., Xu, H., Wu, Y., & Wang, X. (2020). Multi-task reinforcement learning with soft modularization. Advances in Neural Information Processing Systems, 33, 4767–4777.
Sodhani, S., Zhang, A., & Pineau, J. (2021). Multi-task reinforcement learning with context-based representations. In: International Conference on Machine Learning, pp. 9767–9779. PMLR
Wan, M., Gangwani, T., & Peng, J. (2020) Mutual information based knowledge transfer under state-action dimension mismatch. arXiv preprint arXiv:2006.07041
Zhang, Q., Xiao, T., Efros, A.A., Pinto, L., & Wang, X. (2020). Learning cross-domain correspondence for control with dynamics cycle-consistency. arXiv preprint arXiv:2012.09811
Heng, Y., Yang, T., ZHENG, Y., Jianye, H., & Taylor, M.E. (2022). Cross-domain adaptive transfer reinforcement learning based on state-action correspondence. In: The 38th Conference on Uncertainty in Artificial Intelligence
Pol, E., Worrall, D., Hoof, H., Oliehoek, F., & Welling, M. (2020). MDP homomorphic networks: Group symmetries in reinforcement learning. Advances in Neural Information Processing Systems, 33, 4199–4210.
Pol, E., Kipf, T., Oliehoek, F.A., & Welling, M. (2020). Plannable approximations to MDP homomorphisms: Equivariance under actions. arXiv preprint arXiv:2002.11963
Fedotov, A. A., Harremoës, P., & Topsoe, F. (2003). Refinements of pinsker’s inequality. IEEE Transactions on Information Theory, 49(6), 1491–1498.
Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. In: In Proceedings of 19th International Conference on Machine Learning. Citeseer
Acknowledgements
This work is supported by the Key Program of the National Natural Science Foundation of China (Grant No.51935005), National Natural Science Foundation of China(Grant No.62306088), Basic Research Project (Grant No.JCKY20200603C010), Natural Science Foundation of Heilongjiang Province of China (Grant No.LH2021F023), as well as Science and Technology Planning Project of Heilongjiang Province of China (Grant No.GA21C031), and China Academy of Launch Vehicle Technology (CALT2022-18).
Author information
Authors and Affiliations
Contributions
SL and HL wrote the main manuscript text, while JZ performed the experiments. PL provided funding support and contributed to manuscript revisions. ZW and CZ revised the manuscript. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
A.1 Proof of theorem 1
Proof
As \(|\widetilde{Q}_{\pi _{tar}}(s,a)-{Q}_{\pi _{tar}}(s,a)|\le \mu \text {\ for\ all}\ s \in \mathcal {S}, a \in A\), we have that for all \(s \in \mathcal {S}\), the difference between the true value function \(V_{\pi _{tar}}\) and the approximated value function \(\widetilde{V}_{\pi _{tar}}\) is bounded:
As \(\pi _{tar}(\cdot |s)\) is contained in \(\Pi ^s\), with \(\widetilde{\pi }_{g}\) defined in Eq. (9), it is obvious that for all \(s \in \mathcal {S}\),
Then for all \(s_i \in \mathcal {S}\),
A.2 Proof of theorem 2
Proof
According to the Pinsker’s inequality [49], \(D_{KL}(\pi _{tar}^{l+1}(\cdot |s)||\widetilde{\pi }_{g}^l(\cdot |s))\ge \frac{1}{2\ln 2}||\pi _{tar}^{l+1}(\cdot |s)-\widetilde{\pi }_{g}^l(\cdot |s)||_1^2\), where \(||\cdot ||_1\) is the L1 norm. So we have that for all \(s \in \mathcal {S}\), \(||\pi _{tar}^{l+1}(\cdot |s)-\widetilde{\pi }_{g}^l(\cdot |s)||_1 \le \sqrt{2\ln 2 \delta }\). According to the Performance Difference Lemma [50], we have that for all \(s \in \mathcal {S}\):
where \(\rho _{s}^{\widetilde{\pi }_{g}^l}(s')=(1-\gamma )\sum _{t=0}^{\infty }\gamma ^tp(s_t=s'|s_0=s,\widetilde{\pi }_{g}^l)\) is the normalized discounted state occupancy distribution. Note that
Eventually, we have
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, S., Li, H., Zhang, J. et al. IOB: integrating optimization transfer and behavior transfer for multi-policy reuse. Auton Agent Multi-Agent Syst 38, 3 (2024). https://doi.org/10.1007/s10458-023-09630-9
Accepted:
Published:
DOI: https://doi.org/10.1007/s10458-023-09630-9