Skip to main content
Log in

Judgmentally adjusted Q-values based on Q-ensemble for offline reinforcement learning

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Recent advancements in offline reinforcement learning (offline RL) have leveraged the Q-ensemble approach to derive optimal policies from static datasets collected in the past. By increasing the batch size, a portion of Q-ensemble instances penalizing out-of-distribution (OOD) data can be replaced, significantly reducing the Q-ensemble size while maintaining comparable performance and expediting the algorithm’s training. To further enhance the Q-ensembles’ ability to penalize OOD data, a technique involving large batch punishment and a binary classification network was employed. This method differentiates in-distribution (ID) data from OOD data. For ID data, positive adjustments to Q values were made (reward-based adjustment), whereas negative adjustments (penalty-based adjustment) were applied for OOD data, which replaced some OOD data punishment within large Q-ensembles, reducing their size without compromising performance. For different tasks on the D4RL benchmark datasets, we selectively use one of its methods. Experimental results demonstrated that employing reward-based adjustment improved algorithm performance. Simultaneously, utilizing penalty-based adjustment reduced Q-ensemble size without compromising performance. In comparison to LB-SAC, this approach reduced average convergence time by 38% for datasets utilizing penalty-based adjustment, thanks to the introduction of a simpler binary classification network and a reduced number of Q networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Code availability

The code presented in this study is available on request from the corresponding author. The code is not publicly available due to privacy.

References

  1. Badia AP, Piot B, Kapturowski S, Sprechmann P, Vitvitskyi A, Guo ZD, Blundell C (2020) Agent57: Outperforming the atari human benchmark. In: International Conference on Machine Learning, pp. 507–517 . PMLR

  2. Berner C, Brockman G, Chan B, Cheung V, Dębiak P, Dennison C, Farhi D, Fischer Q, Hashme S, Hesse C et al. (2019) Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680

  3. Baker B, Akkaya I, Zhokov P, Huizinga J, Tang J, Ecoffet A, Houghton B, Sampedro R, Clune J (2022) Video pretraining (vpt): learning to act by watching unlabeled online videos. Adv Neural Inf Process Syst 35:24639–24654

    Google Scholar 

  4. Levine S, Kumar A, Tucker G, Fu J (2020) Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643

  5. Agarwal R, Schuurmans D, Norouzi M (2020) An optimistic perspective on offline reinforcement learning. In: International conference on machine learning, pp. 104–114. PMLR

  6. Fujimoto S, Meger D, Precup D (2019) Off-policy deep reinforcement learning without exploration. In: International conference on machine learning, pp. 2052–2062. PMLR

  7. Kumar A, Fu J, Soh M, Tucker G, Levine S (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. Adv Neural Inf Process Syst 32

  8. Kumar A, Zhou A, Tucker G, Levine S (2020) Conservative q-learning for offline reinforcement learning. Adv Neural Inf Process Syst 33:1179–1191

    Google Scholar 

  9. Kostrikov I, Nair A, Levine S (2021) Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169

  10. An G, Moon S, Kim J-H, Song HO (2021) Uncertainty-based offline reinforcement learning with diversified q-ensemble. Adv Neural Inf Process Syst 34:7436–7447

    Google Scholar 

  11. Nikulin A, Kurenkov V, Tarasov D, Akimov D, Kolesnikov S (2022) Q-ensemble for offline rl: Don’t scale the ensemble, scale the batch size. In: 3rd Offline RL Workshop: Offline RL as a”Launchpad”

  12. Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning, pp. 1861–1870. PMLR

  13. Fu J, Kumar A, Nachum O, Tucker G, Levine S (2020) D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219

  14. Wu Y, Tucker G, Nachum O (2019) Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361

  15. Fujimoto S, Gu SS (2021) A minimalist approach to offline reinforcement learning. Adv Neural Inf Process Syst 34:20132–20145

    Google Scholar 

  16. Nair A, Gupta A, Dalal M, Levine S (2020) Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359

  17. Ghasemipour K, Gu SS, Nachum O (2022) Why so pessimistic? estimating uncertainties for offline rl through ensembles, and why their independence matters. Adv Neural Inf Process Syst 35:18267–18281

    Google Scholar 

  18. Rezaeifar S, Dadashi R, Vieillard N, Hussenot L, Bachem O, Pietquin O, Geist M (2022) Offline reinforcement learning as anti-exploration. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 8106–8114

  19. Chen X, Ghadirzadeh A, Yu T, Gao Y, Wang J, Li W, Liang B, Finn C, Zhang C (2022) Latent-variable advantage-weighted policy optimization for offline rl. arXiv preprint arXiv:2203.08949

  20. Zhou W, Bajracharya S, Held D (2021) Plas: Latent action space for offline reinforcement learning. In: Conference on Robot Learning, pp. 1719–1735. PMLR

  21. Akimov D, Kurenkov V, Nikulin A, Tarasov D, Kolesnikov S (2022) Let offline rl flow: Training conservative agents in the latent space of normalizing flows. arXiv preprint arXiv:2211.11096

  22. Sheikh H, Frisbee K, Phielipp M (2022) Dns: Determinantal point process based neural network sampler for ensemble reinforcement learning. In: International Conference on Machine Learning, pp. 19731–19746. PMLR

  23. Lee K, Laskin M, Srinivas A, Abbeel P (2021) Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In: International Conference on Machine Learning, pp. 6131–6141. PMLR

  24. Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Adv Neural Inf Process Syst 31

  25. Kurutach T, Clavera I, Duan Y, Tamar A, Abbeel P (2018) Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592

  26. Lai H, Shen J, Zhang W, Yu Y (2020) Bidirectional model-based policy optimization. In: International Conference on Machine Learning, pp. 5618–5627. PMLR

  27. Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped dqn. Adv Neural Inf Process Syst 29

  28. Clements WR, Van Delft B, Robaglia B-M, Slaoui RB, Toth S (2019) Estimating risk and uncertainty in deep reinforcement learning. arXiv preprint arXiv:1905.09638

  29. Yu T, Thomas G, Yu L, Ermon S, Zou JY, Levine S, Finn C, Ma T (2020) Mopo: Model-based offline policy optimization. Adv Neural Inf Process Syst 33:14129–14142

    Google Scholar 

  30. Kidambi R, Rajeswaran A, Netrapalli P, Joachims T (2020) Morel: Model-based offline reinforcement learning. Adv Neural Inf Process Syst 33:21810–21823

    Google Scholar 

  31. Hong J, Kumar A, Levine S (2022) Confidence-conditioned value functions for offline reinforcement learning. arXiv preprint arXiv:2212.04607

  32. Ghosh D, Ajay A, Agrawal P, Levine S (2022) Offline rl policies should be trained to be adaptive. In: International Conference on Machine Learning, pp. 7513–7530. PMLR

  33. Pinto L, Gupta A (2016) Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 3406–3413. IEEE

  34. Levine S, Pastor P, Krizhevsky A, Ibarz J, Quillen D (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int J Robot Res 37(4–5):421–436

    Article  Google Scholar 

  35. Kretzschmar H, Spies M, Sprunk C, Burgard W (2016) Socially compliant mobile robot navigation via inverse reinforcement learning. Int J Robot Res 35(11):1289–1307

    Article  Google Scholar 

  36. Hodge VJ, Hawkins R, Alexander R (2021) Deep reinforcement learning for drone navigation using sensor data. Neural Comput Appl 33:2015–2033

    Article  Google Scholar 

  37. Nilsson J (1998) Real-time control systems with delays

  38. Ramstedt S, Pal C (2019) Real-time reinforcement learning. Adv Neural Inf Process Syst 32

  39. Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596. PMLR

  40. Royston J (1982) Expected normal order statistics(exact and approximate): algorithm as 177. Appl Stat 31(2):161–5

    Article  Google Scholar 

  41. Hoffer E, Hubara I, Soudry D (2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Adv Neural Inf Process Syst 30

  42. You Y, Gitman I, Ginsburg B (2017) Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888

  43. Krizhevsky A (2014) One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997

  44. Lyu J, Ma X, Li X, Lu Z (2022) Mildly conservative q-learning for offline reinforcement learning. Adv Neural Inf Process Syst 35:1711–1724

    Google Scholar 

  45. Reid M, Yamada Y, Gu SS (2022) Can wikipedia help offline reinforcement learning? arXiv preprint arXiv:2201.12122

  46. Seno T, Imai M (2022) d3rlpy: An offline deep reinforcement learning library. J Mach Learn Res 23(1):14205–14224

    MathSciNet  Google Scholar 

  47. Kumar A, Agarwal R, Geng X, Tucker G, Levine S (2022) Offline q-learning on diverse multi-task data both scales and generalizes. arXiv preprint arXiv:2211.15144

  48. Smith L, Kostrikov I, Levine S (2022) A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning. arXiv preprint arXiv:2208.07860

  49. Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450

  50. Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

  51. Yu T, Kumar A, Rafailov R, Rajeswaran A, Levine S, Finn C (2021) Combo: Conservative offline model-based policy optimization. Adv Neural Inf Process Syst 34:28954–28967

    Google Scholar 

  52. Rigter M, Lacerda B, Hawes N (2022) Rambo-rl: Robust adversarial model-based offline reinforcement learning. Adv Neural Inf Process Syst 35:16082–16097

    Google Scholar 

Download references

Funding

This study was funded by the National Key Research and Development Program of China (2021YFB2801900, 2021YFB2801901, 2021YFB2801902, 2021YFB2801904); the National Natural Science Foundation of China (No. 61974177, No. 61674119); the National Outstanding Youth Science Fund Project of National Natural Science Foundation of China (62022062); the Fundamental Research Funds for the Central Universities (QTZX23041).

Author information

Authors and Affiliations

Authors

Contributions

S.X. and W.L. helped in conceptualization, methodology, formal analysis; W.L. contributed to software; T.Z. and W.L. validated the study; W.L. and Y.H. (Yanan Han) curated the data; W.L. and S.X. helped in writing—original draft preparation; T.Z., Y.Z., and X.G. were involved in writing—review and editing; T.Z. visualized the study; S.X. and Y.H. (Yue Hao) helped in supervision and funding acquisition. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Shuying Xiang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Limitations

See Tables 5, 6, 7, 8, 9, 10, 11, 12.


Overfitting [46] pointed out that the best checkpoint throughout the training process significantly outperforms the typical reported best performance in deep offline RL papers. This phenomenon may be attributed to a form of overfitting, as it is often observed that performance starts to degrade with increasing learning steps. JAQ-SAC also experiences similar issues on some datasets, and we specifically discuss the overfitting problem that may exist in the hopper-medium dataset in Appendix B. The introduction of a self-decay learning rate strategy partially addresses this issue. Understanding how to reduce overfitting and stabilize the learned optimal policy during the learning process is an intriguing direction in offline RL. However, due to the dynamic nature of test environments, devising methodologies to study overfitting properties in offline RL setups has yet to be developed.


The roughness in the classification performed by the binary classification network While we emphasize that our binary classification network can acquire reasonably reliable prior knowledge, it may exhibit rough classification in tasks highly sensitive to OOD data. For reward-based penalties, this rough classification has a dual impact on algorithm performance. On the one hand, the roughness of classification may categorize some objectively OOD data as ID data but beneficial for seeking better policies. In this case, the roughness can be explained as the generalization ability of the binary classification network. On the other hand, it may also misclassify some OOD data that is not conducive to training as ID data, thereby affecting the algorithm’s stability. The hyperparameter \(\rho\) can help us to some extent in reducing the algorithm’s instability and achieving better performance.

For penalty-based adjustments, the most ideal scenario would be to reduce the number of Q-ensemble members from N to 2, relying solely on the anti-exploration rewards provided by penalty-based adjustments to penalize OOD data. However, the rough classification of the binary classification network does not allow us to uniformly reduce N to 2 on most datasets. We can combine the basic uncertainty penalty with the prior knowledge penalty from the binary classification network to sufficiently penalize OOD data while minimizing the impact on the penalty strength for ID data. While we cannot completely eliminate the use of Q-ensemble, we are able to achieve similar performance with fewer Q-ensembles, indicating that our approach remains advanced compared to the original algorithm.

Appendix B Tricks

During the experiments, we employed certain techniques for specific datasets to achieve stability or more competitive results. For instance, in the hopper-medium dataset, reducing the number of critics from 25 to 15 did not yield stable training results, even when maintaining the original layer normalization for the critics. To address this instability during the later stages of training, we introduced a learning rate decay mechanism. Table 9 provides the specific parameter settings for the learning rate decay. Through careful learning rate decay, we were able to ensure the algorithm’s stability in the later stages of training.

Additionally, for the walker2d-expert dataset, we applied layer normalization to the critics (which was not used in LB-SAC for this task) and successfully reduced the number of critics from 30 to 10 through penalty-based adjustment. This outcome was achieved after several attempts, including: (1).using the same number of critics (10) without layer normalization. (2).Using more critics (15) without layer normalization.

Through hyperparameter tuning with \(\rho\), we obtained similar performance (around 107) for case 1, requiring larger \(\rho\) values for stability, and for case 2, needing more critics for stable performance. However, in the case of 10 critics with layer normalization, we achieved higher performance (\(112.3\pm 0.3\)) while still penalizing OOD data sufficiently. This was because layer normalization allowed us to learn an appropriately conservative policy with fewer critics. It is worth noting that layer normalization was not applied to all datasets, as it may result in learning a more conservative policy, which might not be desirable for all scenarios.

Appendix C Detailed experiments setup

In our experiments, we used the hyperparameters listed in Table 5, with \(\rho\) being tuned within an appropriate range. Similar to [11, 47, 48], in some datasets, we added layer normalization [49] after each layer of the critic network to improve stability and convergence time. All experiments were conducted on RTX 3090 GPUs.

For Gym domain datasets, we used version v2. In the final experiments, we employed 4 different seeds and reported the final average normalized score over 10 evaluation episodes. When conducting hyperparameter tuning, we used fewer seeds for experimentation. Initially, we adopted a Normal Sample Scheme and reward-based adjustment to help the algorithm achieve higher performance. This approach was successful on the walker2d-medium and walker2d-medium-replay datasets. However, for other datasets, introducing reward-based adjustment resulted in unstable policy scores during training or did not yield significant performance improvements.

Subsequently, we switched to penalty-based adjustment, starting with a critic count of \(min(2,N - 5)\). N refers to the number of critics used by LB-SAC. Through detailed hyperparameter tuning, if we could not find suitable hyperparameter values \(\rho\) within the current range, we continued increasing the critic count and repeated the parameter search process until we found the best \(\rho\) value. For tasks using reward-based adjustment, we searched for suitable hyperparameter values in the range of \(\{0.25, 0.5, 0.75, 1.0\}\). For tasks using penalty-based adjustment, we ran LB-SAC with a single seed and recorded the Mean Squared Error (MSE) metric during training. By comparing the MSE metric between JAQ-SAC and LB-SAC, we could dynamically adjust the \(\rho\) value. In general, a larger \(\rho\) resulted in a smaller converged MSE, indicating that the learned policy’s action values were closer to the action values in the dataset. This was because a larger \(\rho\) led to a greater anti-exploration reward for OOD data, stronger penalization, and a more conservative learned policy. The \(\rho\) hyperparameter was dynamically adjusted with a minimum adjustment unit of 0.25, allowing each dataset to find the appropriate \(\rho\) value, with a maximum \(\rho\) of 2.5, as shown in Table 5.

Table 5 Specific parameters for JAQ-SAC environments. In the table, the "method" parameter includes two options: "p" (penalty) and "r" (reward), representing the algorithm’s use of penalty-based adjustment and reward-based adjustment, respectively
Table 6 Number of critics for each algorithm
Table 7 Algorithms SAC-N,LB-SAC,EDAC,JAQ-SAC shared hyperparameters
Table 8 Binary network architecture and training hyperparameters
Table 9 Training hyperparameters for SAC-N, EDAC, LB-SAC, and JAQ-SAC

Appendix D Additional results

Table 10 Evaluation of Q-ensemble methods in the high-dimensional state and action space (total of 119 dimensions) for the ant-medium task
Table 11 Additionally, we report the normalized scores of our method in comparison with other offline reinforcement learning algorithms

We also conducted experiments in the maze2d environment. Maze2D is a navigation task, and the test objective is to hope that the offline RL algorithm can concatenate suboptimal trajectories to find the shortest path to the target point, including umaze, medium, and large environments, as shown in Fig. 11. The three environments maze2d-umaze-v1, maze2d-medium-v1, maze2d-large-v1 use a sparse reward which has a value of 1.0 when the agent (light green ball) is within a 0.5 unit radius of the target (light red ball).

Fig. 11
figure 11

Illustrations of three different difficulty maze environments. "umaze" denotes a simple and unstructured maze environment, where the agent needs to learn navigation skills, avoid obstacles, and reach the goal. "medium" represents a moderately sized maze environment, potentially featuring more obstacles or larger spatial dimensions compared to "umaze." Navigating such environments may require the agent to employ more sophisticated strategies for effective task completion. "large" signifies a larger-scale maze environment, typically characterized by increased complexity with more obstacles, branching paths, or larger spatial extents. In such environments, the agent needs enhanced learning capabilities to successfully address navigation challenges

Table 12 We present the relevant results of the Maze2d experiment

Appendix E Convergence time

See Table 13.

Table 13 Convergence time (in minutes) for D4RL datasets using penalty-based adjustment

Appendix F Hyperparameter sensitivity

See Table 14.

Table 14 Impact of the hyperparameter \(\rho\) on algorithm performance (normalized score) in the Walker2d dataset

Appendix G Pseudocode

Algorithm 1
figure a

Judgmentally adjusted Q-value estimation for SAC (JAQ-SAC)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, W., Xiang, S., Zhang, T. et al. Judgmentally adjusted Q-values based on Q-ensemble for offline reinforcement learning. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09839-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00521-024-09839-z

Keywords

Navigation