Skip to main content
Log in

GBGallery : A benchmark and framework for game testing

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Software bug database and benchmark are the wheels of advancing automated software testing. In practice, real bugs often occur sparsely relative to the amount of software code, the extraction and curation of which are quite labor-intensive but can be essential to facilitate the innovation of testing techniques. Over the past decade, several milestones have been made to construct bug databases, pushing the progress of automated software testing research. However, up to the present, it still lacks a real bug database and benchmark for game software, making current game testing research mostly stagnant. The missing of bug database and framework greatly limits the development of automated game testing techniques. To bridge this gap, we first perform large-scale real bug collection and manual analysis from 5 large commercial games, with a total of more than 250,000 lines of code. Based on this, we propose GBGallery, a game bug database and an extensible framework, to enable automated game testing research. In its initial version, GBGallery contains 76 real bugs from 5 games and incorporates 5 state-of-the-art testing techniques for comparative study as a baseline for further research. With GBGallery, we perform large-scale empirical studies and find that the current automated game testing is still at an early stage, where new testing techniques for game software should be extensively investigated. We make GBGallery publicly available, hoping to facilitate the game testing research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The full game version is not granted due to the permission restriction.

References

  • Aleem S, Capretz LF, Ahmed F (2016) Critical success factors to improve the game development process from a developer’s perspective. J Comput Sci Technol 31(5):925–950

    Article  Google Scholar 

  • Amann S, Nadi S, Nguyen HA, Nguyen TN, Mezini M (2016) Mubench: A benchmark for api-misuse detectors. In: 2016 IEEE/ACM 13th working conference on mining software repositories (MSR), pp 464–467

  • Amann S, Nguyen HA, Nadi S, Nguyen TN, Mezini M (2018) A systematic evaluation of static api-misuse detectors. IEEE Trans Softw Eng 45(12):1170–1188

    Article  Google Scholar 

  • Banerjee I, Nguyen B N, Garousi V, Memon A M (2013) Graphical user interface (GUI) testing: Systematic mapping and repository. Information & Software Technology 55(10):1679–1694

    Article  Google Scholar 

  • Borrelli A, Nardone V, Di Lucca GA, Canfora G, Di Penta M (2020) Detecting video game-specific bad smells in unity projects. Association for Computing Machinery, New York, NY, USA, pp 198–208. https://doi.org/10.1145/3379597.3387454

    Google Scholar 

  • Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym. arXiv:1606.01540

  • Buglog (2015) Video game bug blog. https://airtable.com/universe/expEU1JW4I8ie2zOB/basic-video-game-bug-loghttps://airtable.com/universe/expEU1JW4I8ie2zOB/basic-video-game-bug-log

  • Burda Y, Edwards H, Storkey A, Klimov O (2018) Exploration by random network distillation. arXiv:1810.12894

  • Cadar C, Dunbar D, Engler DR et al (2008) Klee: unassisted and automatic generation of high-coverage tests for complex systems programs. In: OSDI, vol 8, pp 209–224

  • Dallmeier V, Zimmermann T (2007) Extraction of bug localization benchmarks from history. In: Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, pp 433–436

  • Do H, Elbaum S, Rothermel G (2005) Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Empir Softw Eng 10(4):405–435

    Article  Google Scholar 

  • Fraser G, Arcuri A (2011) Evosuite: automatic test suite generation for object-oriented software. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pp 416–419

  • GBgallery (2021) https://sites.google.com/view/gbgallery

  • Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, Horgan D, Quan J, Sendonaris A, Osband I, et al. (2018) Deep q-learning from demonstrations. In: Thirty-second AAAI conference on artificial intelligence

  • Hill A, Raffin A, Ernestus M, Gleave A, Kanervisto A, Traore R, Dhariwal P, Hesse C, Klimov O, Nichol A, Plappert M, Radford A, Schulman J, Sidor S, Wu Y (2018) Stable baselines. https://github.com/hill-a/stable-baselines

  • Hutchins M, Foster H, Goradia T, Ostrand T (1994) Experiments on the effectiveness of dataflow- and control-flow-based test adequacy criteria. In: Proceedings of 16th international conference on software engineering, pp 191–200

  • Iftikhar S, Iqbal MZ, Khan MU, Mahmood W (2015) An automated model based testing approach for platform games. In: 2015 ACM/IEEE 18th international conference on model driven engineering languages and systems (MODELS). IEEE, pp 426–435

  • Inozemtseva L, Holmes R (2014) Coverage is not strongly correlated with test suite effectiveness. In: Proceedings of the 36th international conference on software engineering, pp 435–445

  • Just R, Jalali D, Ernst M D (2014) Defects4j: A database of existing faults to enable controlled testing studies for java programs. In: Proceedings of the 2014 international symposium on software testing and analysis, pp 437–440

  • Khalid H, Nagappan M, Shihab E, Hassan A E (2014) Prioritizing the devices to test your app on: A case study of android game apps. In: 22nd ACM SIGSOFT international symposium on foundations of software engineering, pp 610–620

  • Konda V R, Tsitsiklis JN (2000) Actor-critic algorithms. In: Advances in neural information processing systems, pp 1008–1014

  • Lin D, Bezemer C-P, Hassan AE (2017) Studying the urgent updates of popular games on the steam platform. Empir Softw Eng 22(4):2095–2126

    Article  Google Scholar 

  • Liu K, Koyuncu A, Bissyandé T F, Kim D, Klein J, Le Traon Y (2019) You cannot fix what you cannot find! an investigation of fault localization bias in benchmarking automated program repair systems. In: 2019 12th IEEE conference on software testing, validation and verification (ICST), pp 102–113

  • Lovreto G, Endo AT, Nardi P, Durelli V H S (2018) Automated tests for mobile games: An experience report. In: 17th Brazilian symposium on computer games and digital entertainment, SBGames 2018, Foz do Iguaçu, Brazil, October 29 - November 1, 2018, pp 48–56

  • Madeiral F, Urli S, Maia M, Monperrus M (2019) Bears: An extensible java bug benchmark for automatic program repair studies. In: 2019 IEEE 26th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 468–478

  • Madeiral F, Urli S, Maia M, Monperrus M (2019) Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies. In: Proceedings of the 26th IEEE international conference on software analysis, evolution and reengineering (SANER ’19). arXiv:1901.06024

  • Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. arXiv:1312.5602

  • Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533

    Article  Google Scholar 

  • Newzoo (2020) Global games market report 2020. https://newzoo.com/insights/trend-reports/newzoo-global-games-market-report-2020-light-version

  • Nordin M, King D, Posthuma S (2018) But is it fun? software testing in the video game industry. http://www.es.mdh.se/icst2018/live/

  • Pacheco C, Ernst MD (2007) Randoop: feedback-directed random testing for java. In: Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, pp 815–816

  • Papadakis M, Shin D, Yoo S, Bae D (2018) Are mutation scores correlated with real fault detection? a large scale empirical study on the relationship between mutants and real faults. In: IEEE/ACM 40th Intl Conf on Software Engineering (ICSE), pp 537–548

  • Pearson S, Campos J, Just R, Fraser G, Abreu R, Ernst M D, Pang D, Keller B (2017) Evaluating and improving fault localization. In: 2017 IEEE/ACM 39th international conference on software engineering (ICSE), pp 609–620

  • Saha RK, Lyu Y, Lam W, Yoshida H, Prasad MR (2018) Bugs. jar: a large-scale, diverse dataset of real-world java bugs. In: Proceedings of the 15th international conference on mining software repositories, pp 10–13

  • Shamshiri S, Just R, Rojas J M, Fraser G, McMinn P, Arcuri A (2015) Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges. In: Proceedings of the 30th IEEE/ACM international conference on automated software engineering. ASE ’15, pp 201–211

  • Wu Y, Chen Y, Xie X, Yu B, Fan C, Ma L (2020) Regression testing of massively multiplayer online role-playing games. In: 2020 IEEE international conference on software maintenance and evolution (ICSME), pp 692–696

  • Zheng Y, Xie X, Su T, Ma L, Hao J, Meng Z, Liu Y, Shen R, Chen Y, Fan C (2019) Wuji: Automatic online combat game testing using evolutionary deep reinforcement learning. In: 2019 34th IEEE/ACM international conference on automated software engineering (ASE), pp 772–784

Download references

Acknowledgments

This work was supported in part by funding from the Canada First Research Excellence Fund as part of the University of Alberta’s Future Energy Systems research initiative, Canada CIFAR AI Chairs Program, Amii RAP program, the Natural Sciences and Engineering Research Council of Canada (NSERC No.RGPIN-2021-02549, No.RGPAS-2021-00034, No.DGECR-2021-00019), the Ministry of Education, Singapore under its Academic Research Fund Tier 1 (21-SIS-SMU-033), as well as JSPS KAKENHI Grant No.JP20H04168, No.JP21H04877, JST-Mirai Program Grant No.JPMJMI20B8, and JST SPRING Grant.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xiaofei Xie or Yingfeng Chen.

Additional information

Communicated by: Shaowei Wang, Tse-Hsun (Peter) Chen, Sebastian Baltes, Ivano Malavolta, Christoph Treude and Alexander Serebrenik

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Collective Knowledge in Software Engineering

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Z., Wu, Y., Ma, L. et al. GBGallery : A benchmark and framework for game testing. Empir Software Eng 27, 140 (2022). https://doi.org/10.1007/s10664-022-10158-x

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-022-10158-x

Keywords

Navigation