Abstract
Software testing activities scrutinize the artifacts and the behavior of a software product to find possible defects and ensure that the product meets its expected requirements. Although various approaches of software testing have shown to be very promising in revealing defects in software, some of them lack automation or are partly automated which increases the testing time, the manpower needed, and overall software testing costs. Recently, Deep Reinforcement Learning (DRL) has been successfully employed in complex testing tasks such as game testing, regression testing, and test case prioritization to automate the process and provide continuous adaptation. Practitioners can employ DRL by implementing from scratch a DRL algorithm or using a DRL framework. DRL frameworks offer well-maintained implemented state-of-the-art DRL algorithms to facilitate and speed up the development of DRL applications. Developers have widely used these frameworks to solve problems in various domains including software testing. However, to the best of our knowledge, there is no study that empirically evaluates the effectiveness and performance of implemented algorithms in DRL frameworks. Moreover, some guidelines are lacking from the literature that would help practitioners choose one DRL framework over another. In this paper, therefore, we empirically investigate the applications of carefully selected DRL algorithms (based on the characteristics of algorithms and environments) on two important software testing tasks: test case prioritization in the context of Continuous Integration (CI) and game testing. For the game testing task, we conduct experiments on a simple game and use DRL algorithms to explore the game to detect bugs. Results show that some of the selected DRL frameworks such as Tensorforce outperform recent approaches in the literature. To prioritize test cases, we run extensive experiments on a CI environment where DRL algorithms from different frameworks are used to rank the test cases. We find some cases where our DRL configurations outperform the implementation of the baseline. Our results show that the performance difference between implemented algorithms in some cases is considerable, motivating further investigation. Moreover, empirical evaluations on some benchmark problems are recommended for researchers looking to select DRL frameworks, to make sure that DRL algorithms perform as intended.
Similar content being viewed by others
Data Availability
The source code of our implementation and the results of experiments are publicly available Replication package (2022). https://github.com/npaulinastevia/DRL_se
Notes
References
Cartpole (2016). https://gym.openai.com/envs/CartPole-v0/
Mspacman (2018). https://gym.openai.com/envs/MsPacman-v0/
Replication package (2022). https://github.com/npaulinastevia/DRL_se
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jozefowicz R, Jia Y, Kaiser L, Kudlur M, Levenberg J, Mané D, Schuster M, Monga R, Moore S, Murray D, Olah C, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) Tensorflow, large-scale machine learning on heterogeneous systems. 10.5281/zenodo.4724125
Adamo D, Khan MK, Koppula S, Bryce R (2018) Reinforcement learning for android gui testing. In: Proceedings of the 9th ACM SIGSOFT International Workshop on Automating TEST Case Design, Selection, and Evaluation, pp 2–8
Alshahwan N, Gao X, Harman M, Jia Y, Mao K, Mols A, Tei T, Zorin I (2018) Deploying search based software engineering with sapienz at facebook. In: International Symposium on Search Based Software Engineering, Springer, pp 3–45
Arcuri A, Briand L (2014) A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Softw Test Verification Reliab 24(3):219–250
Bagherzadeh M, Kahani N, Briand L (2021) Reinforcement learning for test case prioritization. IEEE Transactions on Software Engineering
Bahrpeyma F, Haghighi H, Zakerolhosseini A (2015) An adaptive rl based approach for dynamic resource provisioning in cloud virtualized data centers. Computing 97(12):1209–1234
Bergdahl J, Gordillo C, Tollmar K, Gisslén L (2020) Augmenting automated game testing with deep reinforcement learning. In: 2020 IEEE Conference on Games (CoG), IEEE, pp 600–603
Bertolino A, Guerriero A, Miranda B, Pietrantuono R, Russo S (2020) Learning-to-rank vs ranking-to-learn: strategies for regression testing in continuous integration. In: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp 1–12
Böttinger K, Godefroid P, Singh R (2018) Deep reinforcement fuzzing. In: 2018 IEEE Security and Privacy Workshops (SPW), IEEE, pp 116–122
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016a) Openai gym. arXiv:1606.01540
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016b) Openai gym. arXiv:1606.01540
Castro PS, Moitra S, Gelada C, Kumar S, Bellemare MG (2018) Dopamine: A research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110
Chen J, Ma H, Zhang L (2020) Enhanced compiler bug isolation via memoized search. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pp 78–89
Dai H, Li Y, Wang C, Singh R, Huang PS, Kohli P (2019) Learning transferable graph exploration. Advances in Neural Information Processing Systems 32
Dhariwal P, Hesse C, Klimov O, Nichol A, Plappert M, Radford A, Schulman J, Sidor S, Wu Y, Zhokhov P (2017) Openai baselines. https://github.com/openai/baselines
Drozd W, Wagner MD (2018) Fuzzergym: A competitive framework for fuzzing and learning. arXiv preprint arXiv:1807.07490
Dulac-Arnold G, Mankowitz D, Hester T (2019) Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901
Fortunato M, Azar MG, Piot B, Menick J, Osband I, Graves A, Mnih V, Munos R, Hassabis D, Pietquin O, et al. (2017) Noisy networks for exploration. arXiv preprint arXiv:1706.10295
Fraser G, Arcuri A (2011) Evosuite: automatic test suite generation for object-oriented software. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pp 416–419
Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, PMLR, pp 1587–1596
Games PA, Howell JF (1976) Pairwise multiple comparison procedures with unequal n’s and/or variances: a monte carlo study. J Educ Stat 1(2):113–125
Gu S, Lillicrap T, Sutskever I, Levine S (2016) Continuous deep q-learning with model-based acceleration. In: International conference on machine learning, PMLR, pp 2829–2838
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870
Hamlet R, Maciniak J (1994) Random testing, encyclopedia of software engineering. Wiley, New York, pp 970–978
Harman M, Jia Y, Zhang Y (2015) Achievements, open problems and challenges for search based software testing. 2015 IEEE 8th International Conference on Software Testing. Verification and Validation (ICST), IEEE, pp 1–12
Hill A, Raffin A, Ernestus M, Gleave A, Kanervisto A, Traore R, Dhariwal P, Hesse C, Klimov O, Nichol A, Plappert M, Radford A, Schulman J, Sidor S, Wu Y (2018) Stable baselines. https://github.com/hill-a/stable-baselines
Hill A, Raffin A, Ernestus M, Gleave A, Kanervisto A, Traore R, Dhariwal P, Hesse C, Klimov O, Nichol A et al (2019) Stable baselines. 2018. https://github.com/hill-a/stable-baselines
Kim J, Kwon M, Yoo S (2018) Generating test input with deep reinforcement learning. In: 2018 IEEE/ACM 11th International Workshop on Search-Based Software Testing (SBST), IEEE, pp 51–58
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Knuth DE (1997) The art of computer programming, vol 3. Pearson Education
Koroglu Y, Sen A, Muslu O, Mete Y, Ulker C, Tanriverdi T, Donmez Y (2018) Qbe: Qlearning-based exploration of android applications. 2018 IEEE 11th International Conference on Software Testing. Verification and Validation (ICST), IEEE, pp 105–115
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971
Malialis K, Devlin S, Kudenko D (2015) Distributed reinforcement learning for adaptive and robust network intrusion response. Connect Sci 27(3):234–252
McGraw KO, Wong SP (1992) A common language effect size statistic. Psychol Bull 111(2):361
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937
Moghadam MH, Saadatmand M, Borg M, Bohlin M, Lisper B (2021) An autonomous performance testing framework using self-adaptive fuzzy reinforcement learning. Softw Qual J 1–33
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: An imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R (eds) Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Plappert M (2016) keras-rl. https://github.com/keras-rl/keras-rl
Raffin A, Hill A, Gleave A, Kanervisto A, Ernestus M, Dormann N (2021) Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research 22(268):1–8. http://jmlr.org/papers/v22/20-1364.html
Reichstaller A, Knapp A (2018) Risk-based testing of self-adaptive systems using run-time predictions. In: 2018 IEEE 12th international conference on self-adaptive and self-organizing systems (SASO), IEEE, pp 80–89
Romdhana A, Merlo A, Ceccato M, Tonella P (2022) Deep reinforcement learning for black-box testing of android apps. ACM Transactions on Software Engineering and Methodology
Santos RES, Magalhães CVC, Capretz LF, Correia-Neto JS, da Silva FQB, Saher A (2018) Computer games are serious business and so is their quality: Particularities of software testing in game development from the perspective of practitioners. In: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, Association for Computing Machinery, New York, NY, USA, ESEM ’18. https://doi.org/10.1145/3239235.3268923
Schaarschmidt M, Kuhnle A, Ellis B, Fricke K, Gessert F, Yoneki E (2018) Lift: Reinforcement learning in computer systems by learning from demonstrations. arXiv preprint arXiv:1808.07903
Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. In: International conference on machine learning, PMLR, pp 1889–1897
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
Singh L, Sharma DK (2013) An architecture for extracting information from hidden web databases using intelligent agent technology through reinforcement learning. In: 2013 IEEE conference on Information & Communication Technologies, IEEE, pp 292–297
Soualhia M, Khomh F, Tahar S (2020) A dynamic and failure-aware task scheduling framework for hadoop. IEEE Trans Cloud Comput 8(2):553–569. https://doi.org/10.1109/TCC.2018.2805812
Spieker H, Gotlieb A, Marijan D, Mossige M (2017) Reinforcement learning for automatic test case prioritization and selection in continuous integration. In: Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp 12–22
Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J (2017) Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567
Sutton RS, Barto AG, et al. (1998) Introduction to reinforcement learning, vol 135. MIT press Cambridge
Tufano R, Scalabrino S, Pascarella L, Aghajani E, Oliveto R, Bavota G (2022) Using reinforcement learning for load testing of video games. In: Proceedings of the 44th International Conference on Software Engineering, pp 2303–2314
Vuong TAT, Takada S (2018) A reinforcement learning based approach to automated testing of android applications. In: Proceedings of the 9th ACM SIGSOFT International Workshop on Automating TEST Case Design, Selection, and Evaluation, pp 31–37
Wang Z, Schaul T, Hessel M, Hasselt H, Lanctot M, Freitas N (2016) Dueling network architectures for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1995–2003
Welch BL (1947) The generalization of ‘student’s’ problem when several different population varlances are involved. Biometrika 34(1–2):28–35
Yang T, Meng Z, Hao J, Zhang C, Zheng Y, Zheng Z (2018) Towards efficient detection and optimal response against sophisticated opponents. arXiv preprint arXiv:1809.04240
Yang T, Hao J, Meng Z, Zheng Y, Zhang C, Zheng Z (2019) Bayes-tomop: A fast detection and best response algorithm towards sophisticated opponents. In: AAMAS, pp 2282–2284
Zhang C, Zhang Y, Shi X, Almpanidis G, Fan G, Shen X (2019) On incremental learning for gradient boosting decision trees. Neural Process Lett 50(1):957–987
Zheng Y, Xie X, Su T, Ma L, Hao J, Meng Z, Liu Y, Shen R, Chen Y, Fan C (2019) Wuji: Automatic online combat game testing using evolutionary deep reinforcement learning. In: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), IEEE, pp 772–784
Zhu H, Hall PA, May JH (1997) Software unit test coverage and adequacy. ACM Computing Surveys (CSUR) 29(4):366–427
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declared that they have no conflict of interest.
Additional information
Communicated by: Paolo Tonella.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nouwou Mindom, P.S., Nikanjam, A. & Khomh, F. A comparison of reinforcement learning frameworks for software testing tasks. Empir Software Eng 28, 111 (2023). https://doi.org/10.1007/s10664-023-10363-2
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-023-10363-2