Judgmentally adjusted Q-values based on Q-ensemble for offline reinforcement learning

Liu, Wenzhuo; Xiang, Shuying; Zhang, Tao; Han, Yanan; Guo, Xingxing; Zhang, Yahui; Hao, Yue

doi:10.1007/s00521-024-09839-z

Judgmentally adjusted Q-values based on Q-ensemble for offline reinforcement learning

Original Article
Published: 14 May 2024

(2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Wenzhuo Liu¹,
Shuying Xiang^1,2,
Tao Zhang¹,
Yanan Han¹,
Xingxing Guo¹,
Yahui Zhang¹ &
…
Yue Hao²

66 Accesses
Explore all metrics

Abstract

Recent advancements in offline reinforcement learning (offline RL) have leveraged the Q-ensemble approach to derive optimal policies from static datasets collected in the past. By increasing the batch size, a portion of Q-ensemble instances penalizing out-of-distribution (OOD) data can be replaced, significantly reducing the Q-ensemble size while maintaining comparable performance and expediting the algorithm’s training. To further enhance the Q-ensembles’ ability to penalize OOD data, a technique involving large batch punishment and a binary classification network was employed. This method differentiates in-distribution (ID) data from OOD data. For ID data, positive adjustments to Q values were made (reward-based adjustment), whereas negative adjustments (penalty-based adjustment) were applied for OOD data, which replaced some OOD data punishment within large Q-ensembles, reducing their size without compromising performance. For different tasks on the D4RL benchmark datasets, we selectively use one of its methods. Experimental results demonstrated that employing reward-based adjustment improved algorithm performance. Simultaneously, utilizing penalty-based adjustment reduced Q-ensemble size without compromising performance. In comparison to LB-SAC, this approach reduced average convergence time by 38% for datasets utilizing penalty-based adjustment, thanks to the introduction of a simpler binary classification network and a reduced number of Q networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on semi-supervised learning

Article Open access 15 November 2019

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey of transfer learning

Article Open access 28 May 2016

Data availability

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Code availability

The code presented in this study is available on request from the corresponding author. The code is not publicly available due to privacy.

References

Badia AP, Piot B, Kapturowski S, Sprechmann P, Vitvitskyi A, Guo ZD, Blundell C (2020) Agent57: Outperforming the atari human benchmark. In: International Conference on Machine Learning, pp. 507–517 . PMLR
Berner C, Brockman G, Chan B, Cheung V, Dębiak P, Dennison C, Farhi D, Fischer Q, Hashme S, Hesse C et al. (2019) Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680
Baker B, Akkaya I, Zhokov P, Huizinga J, Tang J, Ecoffet A, Houghton B, Sampedro R, Clune J (2022) Video pretraining (vpt): learning to act by watching unlabeled online videos. Adv Neural Inf Process Syst 35:24639–24654
Google Scholar
Levine S, Kumar A, Tucker G, Fu J (2020) Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643
Agarwal R, Schuurmans D, Norouzi M (2020) An optimistic perspective on offline reinforcement learning. In: International conference on machine learning, pp. 104–114. PMLR
Fujimoto S, Meger D, Precup D (2019) Off-policy deep reinforcement learning without exploration. In: International conference on machine learning, pp. 2052–2062. PMLR
Kumar A, Fu J, Soh M, Tucker G, Levine S (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. Adv Neural Inf Process Syst 32
Kumar A, Zhou A, Tucker G, Levine S (2020) Conservative q-learning for offline reinforcement learning. Adv Neural Inf Process Syst 33:1179–1191
Google Scholar
Kostrikov I, Nair A, Levine S (2021) Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169
An G, Moon S, Kim J-H, Song HO (2021) Uncertainty-based offline reinforcement learning with diversified q-ensemble. Adv Neural Inf Process Syst 34:7436–7447
Google Scholar
Nikulin A, Kurenkov V, Tarasov D, Akimov D, Kolesnikov S (2022) Q-ensemble for offline rl: Don’t scale the ensemble, scale the batch size. In: 3rd Offline RL Workshop: Offline RL as a”Launchpad”
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning, pp. 1861–1870. PMLR
Fu J, Kumar A, Nachum O, Tucker G, Levine S (2020) D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219
Wu Y, Tucker G, Nachum O (2019) Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361
Fujimoto S, Gu SS (2021) A minimalist approach to offline reinforcement learning. Adv Neural Inf Process Syst 34:20132–20145
Google Scholar
Nair A, Gupta A, Dalal M, Levine S (2020) Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359
Ghasemipour K, Gu SS, Nachum O (2022) Why so pessimistic? estimating uncertainties for offline rl through ensembles, and why their independence matters. Adv Neural Inf Process Syst 35:18267–18281
Google Scholar
Rezaeifar S, Dadashi R, Vieillard N, Hussenot L, Bachem O, Pietquin O, Geist M (2022) Offline reinforcement learning as anti-exploration. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 8106–8114
Chen X, Ghadirzadeh A, Yu T, Gao Y, Wang J, Li W, Liang B, Finn C, Zhang C (2022) Latent-variable advantage-weighted policy optimization for offline rl. arXiv preprint arXiv:2203.08949
Zhou W, Bajracharya S, Held D (2021) Plas: Latent action space for offline reinforcement learning. In: Conference on Robot Learning, pp. 1719–1735. PMLR
Akimov D, Kurenkov V, Nikulin A, Tarasov D, Kolesnikov S (2022) Let offline rl flow: Training conservative agents in the latent space of normalizing flows. arXiv preprint arXiv:2211.11096
Sheikh H, Frisbee K, Phielipp M (2022) Dns: Determinantal point process based neural network sampler for ensemble reinforcement learning. In: International Conference on Machine Learning, pp. 19731–19746. PMLR
Lee K, Laskin M, Srinivas A, Abbeel P (2021) Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In: International Conference on Machine Learning, pp. 6131–6141. PMLR
Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Adv Neural Inf Process Syst 31
Kurutach T, Clavera I, Duan Y, Tamar A, Abbeel P (2018) Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592
Lai H, Shen J, Zhang W, Yu Y (2020) Bidirectional model-based policy optimization. In: International Conference on Machine Learning, pp. 5618–5627. PMLR
Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped dqn. Adv Neural Inf Process Syst 29
Clements WR, Van Delft B, Robaglia B-M, Slaoui RB, Toth S (2019) Estimating risk and uncertainty in deep reinforcement learning. arXiv preprint arXiv:1905.09638
Yu T, Thomas G, Yu L, Ermon S, Zou JY, Levine S, Finn C, Ma T (2020) Mopo: Model-based offline policy optimization. Adv Neural Inf Process Syst 33:14129–14142
Google Scholar
Kidambi R, Rajeswaran A, Netrapalli P, Joachims T (2020) Morel: Model-based offline reinforcement learning. Adv Neural Inf Process Syst 33:21810–21823
Google Scholar
Hong J, Kumar A, Levine S (2022) Confidence-conditioned value functions for offline reinforcement learning. arXiv preprint arXiv:2212.04607
Ghosh D, Ajay A, Agrawal P, Levine S (2022) Offline rl policies should be trained to be adaptive. In: International Conference on Machine Learning, pp. 7513–7530. PMLR
Pinto L, Gupta A (2016) Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 3406–3413. IEEE
Levine S, Pastor P, Krizhevsky A, Ibarz J, Quillen D (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int J Robot Res 37(4–5):421–436
Article Google Scholar
Kretzschmar H, Spies M, Sprunk C, Burgard W (2016) Socially compliant mobile robot navigation via inverse reinforcement learning. Int J Robot Res 35(11):1289–1307
Article Google Scholar
Hodge VJ, Hawkins R, Alexander R (2021) Deep reinforcement learning for drone navigation using sensor data. Neural Comput Appl 33:2015–2033
Article Google Scholar
Nilsson J (1998) Real-time control systems with delays
Ramstedt S, Pal C (2019) Real-time reinforcement learning. Adv Neural Inf Process Syst 32
Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596. PMLR
Royston J (1982) Expected normal order statistics(exact and approximate): algorithm as 177. Appl Stat 31(2):161–5
Article Google Scholar
Hoffer E, Hubara I, Soudry D (2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Adv Neural Inf Process Syst 30
You Y, Gitman I, Ginsburg B (2017) Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888
Krizhevsky A (2014) One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997
Lyu J, Ma X, Li X, Lu Z (2022) Mildly conservative q-learning for offline reinforcement learning. Adv Neural Inf Process Syst 35:1711–1724
Google Scholar
Reid M, Yamada Y, Gu SS (2022) Can wikipedia help offline reinforcement learning? arXiv preprint arXiv:2201.12122
Seno T, Imai M (2022) d3rlpy: An offline deep reinforcement learning library. J Mach Learn Res 23(1):14205–14224
MathSciNet Google Scholar
Kumar A, Agarwal R, Geng X, Tucker G, Levine S (2022) Offline q-learning on diverse multi-task data both scales and generalizes. arXiv preprint arXiv:2211.15144
Smith L, Kostrikov I, Levine S (2022) A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning. arXiv preprint arXiv:2208.07860
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101
Yu T, Kumar A, Rafailov R, Rajeswaran A, Levine S, Finn C (2021) Combo: Conservative offline model-based policy optimization. Adv Neural Inf Process Syst 34:28954–28967
Google Scholar
Rigter M, Lacerda B, Hawes N (2022) Rambo-rl: Robust adversarial model-based offline reinforcement learning. Adv Neural Inf Process Syst 35:16082–16097
Google Scholar

Download references

Funding

This study was funded by the National Key Research and Development Program of China (2021YFB2801900, 2021YFB2801901, 2021YFB2801902, 2021YFB2801904); the National Natural Science Foundation of China (No. 61974177, No. 61674119); the National Outstanding Youth Science Fund Project of National Natural Science Foundation of China (62022062); the Fundamental Research Funds for the Central Universities (QTZX23041).

Author information

Authors and Affiliations

State Key Laboratory of Integrated Service Networks, Xidian University, Xi’an, 710071, China
Wenzhuo Liu, Shuying Xiang, Tao Zhang, Yanan Han, Xingxing Guo & Yahui Zhang
State Key Discipline Laboratory of Wide Bandgap Semiconductor Technology, Xidian University, Xi’an, 710071, China
Shuying Xiang & Yue Hao

Authors

Wenzhuo Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shuying Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Tao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yanan Han
View author publications
You can also search for this author in PubMed Google Scholar
Xingxing Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yahui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yue Hao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.X. and W.L. helped in conceptualization, methodology, formal analysis; W.L. contributed to software; T.Z. and W.L. validated the study; W.L. and Y.H. (Yanan Han) curated the data; W.L. and S.X. helped in writing—original draft preparation; T.Z., Y.Z., and X.G. were involved in writing—review and editing; T.Z. visualized the study; S.X. and Y.H. (Yue Hao) helped in supervision and funding acquisition. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Shuying Xiang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Limitations

See Tables 5, 6, 7, 8, 9, 10, 11, 12.

Overfitting [46] pointed out that the best checkpoint throughout the training process significantly outperforms the typical reported best performance in deep offline RL papers. This phenomenon may be attributed to a form of overfitting, as it is often observed that performance starts to degrade with increasing learning steps. JAQ-SAC also experiences similar issues on some datasets, and we specifically discuss the overfitting problem that may exist in the hopper-medium dataset in Appendix B. The introduction of a self-decay learning rate strategy partially addresses this issue. Understanding how to reduce overfitting and stabilize the learned optimal policy during the learning process is an intriguing direction in offline RL. However, due to the dynamic nature of test environments, devising methodologies to study overfitting properties in offline RL setups has yet to be developed.

The roughness in the classification performed by the binary classification network While we emphasize that our binary classification network can acquire reasonably reliable prior knowledge, it may exhibit rough classification in tasks highly sensitive to OOD data. For reward-based penalties, this rough classification has a dual impact on algorithm performance. On the one hand, the roughness of classification may categorize some objectively OOD data as ID data but beneficial for seeking better policies. In this case, the roughness can be explained as the generalization ability of the binary classification network. On the other hand, it may also misclassify some OOD data that is not conducive to training as ID data, thereby affecting the algorithm’s stability. The hyperparameter \(\rho\) can help us to some extent in reducing the algorithm’s instability and achieving better performance.

For penalty-based adjustments, the most ideal scenario would be to reduce the number of Q-ensemble members from N to 2, relying solely on the anti-exploration rewards provided by penalty-based adjustments to penalize OOD data. However, the rough classification of the binary classification network does not allow us to uniformly reduce N to 2 on most datasets. We can combine the basic uncertainty penalty with the prior knowledge penalty from the binary classification network to sufficiently penalize OOD data while minimizing the impact on the penalty strength for ID data. While we cannot completely eliminate the use of Q-ensemble, we are able to achieve similar performance with fewer Q-ensembles, indicating that our approach remains advanced compared to the original algorithm.

Appendix B Tricks

During the experiments, we employed certain techniques for specific datasets to achieve stability or more competitive results. For instance, in the hopper-medium dataset, reducing the number of critics from 25 to 15 did not yield stable training results, even when maintaining the original layer normalization for the critics. To address this instability during the later stages of training, we introduced a learning rate decay mechanism. Table 9 provides the specific parameter settings for the learning rate decay. Through careful learning rate decay, we were able to ensure the algorithm’s stability in the later stages of training.

Additionally, for the walker2d-expert dataset, we applied layer normalization to the critics (which was not used in LB-SAC for this task) and successfully reduced the number of critics from 30 to 10 through penalty-based adjustment. This outcome was achieved after several attempts, including: (1).using the same number of critics (10) without layer normalization. (2).Using more critics (15) without layer normalization.

Through hyperparameter tuning with \(\rho\), we obtained similar performance (around 107) for case 1, requiring larger \(\rho\) values for stability, and for case 2, needing more critics for stable performance. However, in the case of 10 critics with layer normalization, we achieved higher performance (\(112.3\pm 0.3\)) while still penalizing OOD data sufficiently. This was because layer normalization allowed us to learn an appropriately conservative policy with fewer critics. It is worth noting that layer normalization was not applied to all datasets, as it may result in learning a more conservative policy, which might not be desirable for all scenarios.

Appendix C Detailed experiments setup

In our experiments, we used the hyperparameters listed in Table 5, with \(\rho\) being tuned within an appropriate range. Similar to [11, 47, 48], in some datasets, we added layer normalization [49] after each layer of the critic network to improve stability and convergence time. All experiments were conducted on RTX 3090 GPUs.

For Gym domain datasets, we used version v2. In the final experiments, we employed 4 different seeds and reported the final average normalized score over 10 evaluation episodes. When conducting hyperparameter tuning, we used fewer seeds for experimentation. Initially, we adopted a Normal Sample Scheme and reward-based adjustment to help the algorithm achieve higher performance. This approach was successful on the walker2d-medium and walker2d-medium-replay datasets. However, for other datasets, introducing reward-based adjustment resulted in unstable policy scores during training or did not yield significant performance improvements.

Subsequently, we switched to penalty-based adjustment, starting with a critic count of \(min(2,N - 5)\). N refers to the number of critics used by LB-SAC. Through detailed hyperparameter tuning, if we could not find suitable hyperparameter values \(\rho\) within the current range, we continued increasing the critic count and repeated the parameter search process until we found the best \(\rho\) value. For tasks using reward-based adjustment, we searched for suitable hyperparameter values in the range of \(\{0.25, 0.5, 0.75, 1.0\}\). For tasks using penalty-based adjustment, we ran LB-SAC with a single seed and recorded the Mean Squared Error (MSE) metric during training. By comparing the MSE metric between JAQ-SAC and LB-SAC, we could dynamically adjust the \(\rho\) value. In general, a larger \(\rho\) resulted in a smaller converged MSE, indicating that the learned policy’s action values were closer to the action values in the dataset. This was because a larger \(\rho\) led to a greater anti-exploration reward for OOD data, stronger penalization, and a more conservative learned policy. The \(\rho\) hyperparameter was dynamically adjusted with a minimum adjustment unit of 0.25, allowing each dataset to find the appropriate \(\rho\) value, with a maximum \(\rho\) of 2.5, as shown in Table 5.

Table 5 Specific parameters for JAQ-SAC environments. In the table, the "method" parameter includes two options: "p" (penalty) and "r" (reward), representing the algorithm’s use of penalty-based adjustment and reward-based adjustment, respectively

Full size table

Table 6 Number of critics for each algorithm

Full size table

Table 7 Algorithms SAC-N,LB-SAC,EDAC,JAQ-SAC shared hyperparameters

Full size table

Table 8 Binary network architecture and training hyperparameters

Full size table

Table 9 Training hyperparameters for SAC-N, EDAC, LB-SAC, and JAQ-SAC

Full size table

Appendix D Additional results

Table 10 Evaluation of Q-ensemble methods in the high-dimensional state and action space (total of 119 dimensions) for the ant-medium task

Full size table

Table 11 Additionally, we report the normalized scores of our method in comparison with other offline reinforcement learning algorithms

Full size table

We also conducted experiments in the maze2d environment. Maze2D is a navigation task, and the test objective is to hope that the offline RL algorithm can concatenate suboptimal trajectories to find the shortest path to the target point, including umaze, medium, and large environments, as shown in Fig. 11. The three environments maze2d-umaze-v1, maze2d-medium-v1, maze2d-large-v1 use a sparse reward which has a value of 1.0 when the agent (light green ball) is within a 0.5 unit radius of the target (light red ball).

Table 12 We present the relevant results of the Maze2d experiment

Full size table

Appendix E Convergence time

See Table 13.

Table 13 Convergence time (in minutes) for D4RL datasets using penalty-based adjustment

Full size table

Appendix F Hyperparameter sensitivity

See Table 14.

Table 14 Impact of the hyperparameter \(\rho\) on algorithm performance (normalized score) in the Walker2d dataset

Full size table

Appendix G Pseudocode

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, W., Xiang, S., Zhang, T. et al. Judgmentally adjusted Q-values based on Q-ensemble for offline reinforcement learning. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09839-z

Download citation

Received: 17 October 2023
Accepted: 12 April 2024
Published: 14 May 2024
DOI: https://doi.org/10.1007/s00521-024-09839-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Judgmentally adjusted Q-values based on Q-ensemble for offline reinforcement learning

Abstract

Access this article