Skip to main content
Log in

A comparison of reinforcement learning frameworks for software testing tasks

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Software testing activities scrutinize the artifacts and the behavior of a software product to find possible defects and ensure that the product meets its expected requirements. Although various approaches of software testing have shown to be very promising in revealing defects in software, some of them lack automation or are partly automated which increases the testing time, the manpower needed, and overall software testing costs. Recently, Deep Reinforcement Learning (DRL) has been successfully employed in complex testing tasks such as game testing, regression testing, and test case prioritization to automate the process and provide continuous adaptation. Practitioners can employ DRL by implementing from scratch a DRL algorithm or using a DRL framework. DRL frameworks offer well-maintained implemented state-of-the-art DRL algorithms to facilitate and speed up the development of DRL applications. Developers have widely used these frameworks to solve problems in various domains including software testing. However, to the best of our knowledge, there is no study that empirically evaluates the effectiveness and performance of implemented algorithms in DRL frameworks. Moreover, some guidelines are lacking from the literature that would help practitioners choose one DRL framework over another. In this paper, therefore, we empirically investigate the applications of carefully selected DRL algorithms (based on the characteristics of algorithms and environments) on two important software testing tasks: test case prioritization in the context of Continuous Integration (CI) and game testing. For the game testing task, we conduct experiments on a simple game and use DRL algorithms to explore the game to detect bugs. Results show that some of the selected DRL frameworks such as Tensorforce outperform recent approaches in the literature. To prioritize test cases, we run extensive experiments on a CI environment where DRL algorithms from different frameworks are used to rank the test cases. We find some cases where our DRL configurations outperform the implementation of the baseline. Our results show that the performance difference between implemented algorithms in some cases is considerable, motivating further investigation. Moreover, empirical evaluations on some benchmark problems are recommended for researchers looking to select DRL frameworks, to make sure that DRL algorithms perform as intended.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25

Similar content being viewed by others

Data Availability

The source code of our implementation and the results of experiments are publicly available Replication package (2022). https://github.com/npaulinastevia/DRL_se

Notes

  1. https://www.techrepublic.com/article/report-software-failure-caused-1-7-trillion-in-financial-losses-in-2017/

  2. https://github.com/google/gin-config

  3. https://www.open-mpi.org

  4. https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart

  5. https://coverage.readthedocs.io

  6. https://stable-baselines3.readthedocs.io/en/master/guide/migration.html

References

  • Cartpole (2016). https://gym.openai.com/envs/CartPole-v0/

  • Mspacman (2018). https://gym.openai.com/envs/MsPacman-v0/

  • Replication package (2022). https://github.com/npaulinastevia/DRL_se

  • Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jozefowicz R, Jia Y, Kaiser L, Kudlur M, Levenberg J, Mané D, Schuster M, Monga R, Moore S, Murray D, Olah C, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) Tensorflow, large-scale machine learning on heterogeneous systems. 10.5281/zenodo.4724125

  • Adamo D, Khan MK, Koppula S, Bryce R (2018) Reinforcement learning for android gui testing. In: Proceedings of the 9th ACM SIGSOFT International Workshop on Automating TEST Case Design, Selection, and Evaluation, pp 2–8

  • Alshahwan N, Gao X, Harman M, Jia Y, Mao K, Mols A, Tei T, Zorin I (2018) Deploying search based software engineering with sapienz at facebook. In: International Symposium on Search Based Software Engineering, Springer, pp 3–45

  • Arcuri A, Briand L (2014) A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Softw Test Verification Reliab 24(3):219–250

    Article  Google Scholar 

  • Bagherzadeh M, Kahani N, Briand L (2021) Reinforcement learning for test case prioritization. IEEE Transactions on Software Engineering

  • Bahrpeyma F, Haghighi H, Zakerolhosseini A (2015) An adaptive rl based approach for dynamic resource provisioning in cloud virtualized data centers. Computing 97(12):1209–1234

    Article  MathSciNet  Google Scholar 

  • Bergdahl J, Gordillo C, Tollmar K, Gisslén L (2020) Augmenting automated game testing with deep reinforcement learning. In: 2020 IEEE Conference on Games (CoG), IEEE, pp 600–603

  • Bertolino A, Guerriero A, Miranda B, Pietrantuono R, Russo S (2020) Learning-to-rank vs ranking-to-learn: strategies for regression testing in continuous integration. In: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp 1–12

  • Böttinger K, Godefroid P, Singh R (2018) Deep reinforcement fuzzing. In: 2018 IEEE Security and Privacy Workshops (SPW), IEEE, pp 116–122

  • Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016a) Openai gym. arXiv:1606.01540

  • Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016b) Openai gym. arXiv:1606.01540

  • Castro PS, Moitra S, Gelada C, Kumar S, Bellemare MG (2018) Dopamine: A research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110

  • Chen J, Ma H, Zhang L (2020) Enhanced compiler bug isolation via memoized search. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pp 78–89

  • Dai H, Li Y, Wang C, Singh R, Huang PS, Kohli P (2019) Learning transferable graph exploration. Advances in Neural Information Processing Systems 32

  • Dhariwal P, Hesse C, Klimov O, Nichol A, Plappert M, Radford A, Schulman J, Sidor S, Wu Y, Zhokhov P (2017) Openai baselines. https://github.com/openai/baselines

  • Drozd W, Wagner MD (2018) Fuzzergym: A competitive framework for fuzzing and learning. arXiv preprint arXiv:1807.07490

  • Dulac-Arnold G, Mankowitz D, Hester T (2019) Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901

  • Fortunato M, Azar MG, Piot B, Menick J, Osband I, Graves A, Mnih V, Munos R, Hassabis D, Pietquin O, et al. (2017) Noisy networks for exploration. arXiv preprint arXiv:1706.10295

  • Fraser G, Arcuri A (2011) Evosuite: automatic test suite generation for object-oriented software. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pp 416–419

  • Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, PMLR, pp 1587–1596

  • Games PA, Howell JF (1976) Pairwise multiple comparison procedures with unequal n’s and/or variances: a monte carlo study. J Educ Stat 1(2):113–125

    Google Scholar 

  • Gu S, Lillicrap T, Sutskever I, Levine S (2016) Continuous deep q-learning with model-based acceleration. In: International conference on machine learning, PMLR, pp 2829–2838

  • Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870

  • Hamlet R, Maciniak J (1994) Random testing, encyclopedia of software engineering. Wiley, New York, pp 970–978

    Google Scholar 

  • Harman M, Jia Y, Zhang Y (2015) Achievements, open problems and challenges for search based software testing. 2015 IEEE 8th International Conference on Software Testing. Verification and Validation (ICST), IEEE, pp 1–12

    Google Scholar 

  • Hill A, Raffin A, Ernestus M, Gleave A, Kanervisto A, Traore R, Dhariwal P, Hesse C, Klimov O, Nichol A, Plappert M, Radford A, Schulman J, Sidor S, Wu Y (2018) Stable baselines. https://github.com/hill-a/stable-baselines

  • Hill A, Raffin A, Ernestus M, Gleave A, Kanervisto A, Traore R, Dhariwal P, Hesse C, Klimov O, Nichol A et al (2019) Stable baselines. 2018. https://github.com/hill-a/stable-baselines

  • Kim J, Kwon M, Yoo S (2018) Generating test input with deep reinforcement learning. In: 2018 IEEE/ACM 11th International Workshop on Search-Based Software Testing (SBST), IEEE, pp 51–58

  • Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  • Knuth DE (1997) The art of computer programming, vol 3. Pearson Education

  • Koroglu Y, Sen A, Muslu O, Mete Y, Ulker C, Tanriverdi T, Donmez Y (2018) Qbe: Qlearning-based exploration of android applications. 2018 IEEE 11th International Conference on Software Testing. Verification and Validation (ICST), IEEE, pp 105–115

    Google Scholar 

  • Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971

  • Malialis K, Devlin S, Kudenko D (2015) Distributed reinforcement learning for adaptive and robust network intrusion response. Connect Sci 27(3):234–252

    Article  Google Scholar 

  • McGraw KO, Wong SP (1992) A common language effect size statistic. Psychol Bull 111(2):361

    Article  Google Scholar 

  • Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602

  • Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937

  • Moghadam MH, Saadatmand M, Borg M, Bohlin M, Lisper B (2021) An autonomous performance testing framework using self-adaptive fuzzy reinforcement learning. Softw Qual J 1–33

  • Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: An imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R (eds) Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

  • Plappert M (2016) keras-rl. https://github.com/keras-rl/keras-rl

  • Raffin A, Hill A, Gleave A, Kanervisto A, Ernestus M, Dormann N (2021) Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research 22(268):1–8. http://jmlr.org/papers/v22/20-1364.html

  • Reichstaller A, Knapp A (2018) Risk-based testing of self-adaptive systems using run-time predictions. In: 2018 IEEE 12th international conference on self-adaptive and self-organizing systems (SASO), IEEE, pp 80–89

  • Romdhana A, Merlo A, Ceccato M, Tonella P (2022) Deep reinforcement learning for black-box testing of android apps. ACM Transactions on Software Engineering and Methodology

  • Santos RES, Magalhães CVC, Capretz LF, Correia-Neto JS, da Silva FQB, Saher A (2018) Computer games are serious business and so is their quality: Particularities of software testing in game development from the perspective of practitioners. In: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, Association for Computing Machinery, New York, NY, USA, ESEM ’18. https://doi.org/10.1145/3239235.3268923

  • Schaarschmidt M, Kuhnle A, Ellis B, Fricke K, Gessert F, Yoneki E (2018) Lift: Reinforcement learning in computer systems by learning from demonstrations. arXiv preprint arXiv:1808.07903

  • Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. In: International conference on machine learning, PMLR, pp 1889–1897

  • Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

  • Singh L, Sharma DK (2013) An architecture for extracting information from hidden web databases using intelligent agent technology through reinforcement learning. In: 2013 IEEE conference on Information & Communication Technologies, IEEE, pp 292–297

  • Soualhia M, Khomh F, Tahar S (2020) A dynamic and failure-aware task scheduling framework for hadoop. IEEE Trans Cloud Comput 8(2):553–569. https://doi.org/10.1109/TCC.2018.2805812

    Article  Google Scholar 

  • Spieker H, Gotlieb A, Marijan D, Mossige M (2017) Reinforcement learning for automatic test case prioritization and selection in continuous integration. In: Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp 12–22

  • Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J (2017) Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567

  • Sutton RS, Barto AG, et al. (1998) Introduction to reinforcement learning, vol 135. MIT press Cambridge

  • Tufano R, Scalabrino S, Pascarella L, Aghajani E, Oliveto R, Bavota G (2022) Using reinforcement learning for load testing of video games. In: Proceedings of the 44th International Conference on Software Engineering, pp 2303–2314

  • Vuong TAT, Takada S (2018) A reinforcement learning based approach to automated testing of android applications. In: Proceedings of the 9th ACM SIGSOFT International Workshop on Automating TEST Case Design, Selection, and Evaluation, pp 31–37

  • Wang Z, Schaul T, Hessel M, Hasselt H, Lanctot M, Freitas N (2016) Dueling network architectures for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1995–2003

  • Welch BL (1947) The generalization of ‘student’s’ problem when several different population varlances are involved. Biometrika 34(1–2):28–35

    MathSciNet  MATH  Google Scholar 

  • Yang T, Meng Z, Hao J, Zhang C, Zheng Y, Zheng Z (2018) Towards efficient detection and optimal response against sophisticated opponents. arXiv preprint arXiv:1809.04240

  • Yang T, Hao J, Meng Z, Zheng Y, Zhang C, Zheng Z (2019) Bayes-tomop: A fast detection and best response algorithm towards sophisticated opponents. In: AAMAS, pp 2282–2284

  • Zhang C, Zhang Y, Shi X, Almpanidis G, Fan G, Shen X (2019) On incremental learning for gradient boosting decision trees. Neural Process Lett 50(1):957–987

    Article  Google Scholar 

  • Zheng Y, Xie X, Su T, Ma L, Hao J, Meng Z, Liu Y, Shen R, Chen Y, Fan C (2019) Wuji: Automatic online combat game testing using evolutionary deep reinforcement learning. In: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), IEEE, pp 772–784

  • Zhu H, Hall PA, May JH (1997) Software unit test coverage and adequacy. ACM Computing Surveys (CSUR) 29(4):366–427

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paulina Stevia Nouwou Mindom.

Ethics declarations

Conflict of interest

The authors declared that they have no conflict of interest.

Additional information

Communicated by: Paolo Tonella.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nouwou Mindom, P.S., Nikanjam, A. & Khomh, F. A comparison of reinforcement learning frameworks for software testing tasks. Empir Software Eng 28, 111 (2023). https://doi.org/10.1007/s10664-023-10363-2

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-023-10363-2

Keywords

Navigation