Skip to main content
Log in

Operationalizing validity of empirical software engineering studies

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Empirical Software Engineering studies apply methods, like linear regression, statistic tests, or correlation analysis, to better understand software engineering scenarios. Assuring the validity of such methods and corresponding results is challenging but critical. This is also reflected by quality criteria on the validity that are part of the reviewing process for the corresponding research results. However, such criteria are often hard to define operationally and thus hard to judge by the reviewers. In this paper, we describe a new strategy to define and communicate the validity of methods and results. We conceptually decompose a study into an empirical scenario, a used method, and the produced results. Validity can only be described as the relationship between the three parts. To make the empirical scenario fully operational, we convert informal assumptions on it into executable simulation code that leverages artificial data to replace (or complement) our real data. We can then run the method on the artificial data and examine the impact of our assumptions on the quality of results. This may operationally i) support the validity of a method for a valid result, ii) threaten the validity of a method for an invalid result if assumptions are controversial, or iii) invalidate a method for an invalid result if assumptions are plausible. We encourage researchers to submit simulations as additional artifacts to the reviewing process to make such statements explicit. Rating if a simulated scenario is plausible or controversial is subjective and may benefit from involving a reviewer. We show that existing empirical software engineering studies can benefit from such additional validation artifacts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

Data Availability

All artifacts and data sets are provided online on GitHub (https://github.com/topleet/MSR2022).

Notes

  1. In R, random number generators are vertorized and start with a letter r followed by an abbreviation for the distribution family (we will see rbinom, rnorm and rpoisson).

  2. All our reproductions of other papers are fully available online to guarantee the reproduction of this paper.

References

  • Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Parzen E, Tanabe K, Kitagawa G (eds) Selected papers of hirotugu akaike. Springer, pp 199–213

    Chapter  Google Scholar 

  • Alali A, Kagdi HH, Maletic JI (2008) What’s a typical commit? A characterization of open source software repositories. In: ICPC, pp 182–191. IEEE Computer society

  • Albayrak Ö, Carver JC (2014) Investigation of individual factors impacting the effectiveness of requirements inspections: a replicated experiment. Empir Softw Eng 19(1):241–266

    Article  Google Scholar 

  • Anda B, Sjøberg DIK (2005) Investigating the role of use cases in the construction of class diagrams. Empir Softw Eng 10(3):285–309

    Article  Google Scholar 

  • Apa C, Dieste O, Espinosa GEG, Fonseca CER (2014) Effectiveness for detecting faults within and outside the scope of testing techniques: an independent replication. Empir Softw Eng 19(2):378–417

    Article  Google Scholar 

  • Baayen RH, Davidson DJ, Bates DM (2008) Mixed-effects modeling with crossed random effects for subjects and items. J Memory Lang 59(4):390–412

    Article  Google Scholar 

  • Bangash AA, Sahar H, Hindle A, Ali K (2020) On the time-based conclusion stability of cross-project defect prediction models. Empirical Software Engineering pp 1–38

  • Barón MM, Wyrich M, Graziotin D, Wagner S (2023) Evidence profiles for validity threats in program comprehension experiments. In: ICSE, pp 1907–1919. IEEE

  • Barr DJ, Levy R, Scheepers C, Tily HJ (2013) Random effects structure for confirmatory hypothesis testing: Keep it maximal. J Memory Lang 368(3):255–278

    Article  Google Scholar 

  • Beheim B, Atkinson QD, Bulbulia J, Gervais W, Gray RD, Henrich J, Lang M, Monroe MW, Muthukrishna M, Norenzayan A, Purzycki BG, Shariff A, Slingerland E, Spicer R, Willard AK (2021) Treatment of missing data determined conclusions regarding moralizing gods. Nature 595(7866):1476–4687

    Article  Google Scholar 

  • Bidoki NH, Schiappa M, Sukthankar G, Garibay I (2020) Modeling social coding dynamics with sampled historical data. Online Soc Netw Med 16:100070

    Article  Google Scholar 

  • Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu PT (2009) Fair and balanced?: bias in bug-fix datasets. In: ESEC/SIGSOFT FSE, pp 121–130. ACM

  • Blythe J, Bollenbacher J, Huang D, Hui P, Krohn R, Pacheco D, Muric G, Sapienza A, Tregubov A, Ahn Y, Flammini A, Lerman K, Menczer F, Weninger T, Ferrara E (2019) Massive multi-agent data-driven simulations of the GitHub ecosystem. In: PAAMS, Lecture notes in computer science, vol 11523, pp 3–15. Springer

  • Boh WF, Slaughter S, Espinosa JA (2007) Learning from experience in software development: A multilevel analysis. Manag Sci 53(8):1315–1331

    Google Scholar 

  • Borges H, Hora AC, Valente MT (2016) Predicting the popularity of GitHub repositories. In: PROMISE, pp 9:1–9:10. ACM

  • Borle NC, Feghhi M, Stroulia E, Greiner R, Hindle A (2018) Analyzing the effects of test driven development in GitHub. Empir Softw Eng 23(4):1931–1958

    Article  Google Scholar 

  • Burton A, Altman DG, Royston P, Holder RL (2006) The design of simulation studies in medical statistics. Stat Med 25(24):4279–4292

    Article  MathSciNet  Google Scholar 

  • Canfora G, Lucia AD, Penta MD, Oliveto R, Panichella A, Panichella S (2015) Defect prediction as a multiobjective optimization problem. Softw Test Verification Reliab 25(4):426–459

    Article  Google Scholar 

  • Casalnuovo C, Devanbu PT, Oliveira A, Filkov V, Ray B (2015) Assert use in GitHub projects. In: ICSE (1), pp 755–766. IEEE Computer Society

  • Clyburne-Sherin A, Fei X, Green SA (2019) Computational reproducibility via containers in psychology. Meta-psychology 3

  • Cohen J, Cohen P, West SG, Aiken LS (2013) Applied multiple regression/correlation analysis for the behavioral sciences. Routledge

    Book  Google Scholar 

  • Cosentino V, Izquierdo JLC, Cabot J (2016) Findings from GitHub: methods, datasets and limitations. In: Proceedings MSR, pp 137–141

  • Dias M, Bacchelli A, Gousios G, Cassou D, Ducasse S (2015) Untangling fine-grained code changes. In: SANER, pp 341–350. IEEE Computer society

  • Falcão F, Barbosa C, Fonseca B, Garcia A, Ribeiro M, Gheyi R (2020) On relating technical, social factors, and the introduction of bugs. In: SANER, pp 378–388. IEEE

  • Fang H, Lamba H, Herbsleb JD, Vasilescu B (2022) This is damn slick! estimating the impact of tweets on open source project popularity and new contributors. In: ICSE, pp 2116–2129. ACM

  • Gabel M, Su, Z (2010) A study of the uniqueness of source code. In: SIGSOFT FSE, pp 147–156. ACM

  • Gasparini A, Abrams KR, Barrett JK, Major RW, Sweeting MJ, Brunskill NJ, Crowther MJ (2020) Mixed-effects models for health care longitudinal data with an informative visiting process: A Monte Carlo simulation study. Statistica Neerlandica 74(1):5–23

    Article  MathSciNet  Google Scholar 

  • Gelman A, Hill J (2006) Data analysis using regression and multilevel/hierarchical models. Cambridge University Press

    Book  Google Scholar 

  • Gelman A, Hill J, Vehtari A (2020) Regression and other stories. Cambridge University Press

    Book  MATH  Google Scholar 

  • Ghaleb TA, da Costa DA, Zou Y (2019) An empirical study of the long duration of continuous integration builds. Empir Softw Eng 24(4):2102–2139

    Article  Google Scholar 

  • Harrell FE (2015) Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis, vol 2. Springer

    Book  MATH  Google Scholar 

  • Härtel J, Lämmel R (2020) Incremental map-reduce on repository history. In: SANER, pp 320–331. IEEE

  • Härtel J, Lämmel R (2022) Operationalizing threats to MSR studies by simulation-based testing. In: MSR, pp 86–97. IEEE

  • He Z, Peters F, Menzies T, Yang Y (2013) Learning from open-source projects: An empirical study on defect prediction. In: ESEM, pp 45–54. IEEE Computer society

  • Herzig K, Zeller A (2013) The impact of tangled code changes. In: MSR, pp 121–130. IEEE Computer society

  • Honsel, V (2015) Statistical learning and software mining for agent based simulation of software evolution. In: ICSE (2), pp 863–866. IEEE Computer society

  • Honsel V, Honsel D, Grabowski J (2014) Software process simulation based on mining software repositories. In: ICDM Workshops, pp 828–831. IEEE Computer society

  • Honsel V, Honsel D, Herbold S, Grabowski J, Waack S (2015) Mining software dependency networks for agent-based simulation of software evolution. In: ASE Workshops, pp 102–108. IEEE Computer society

  • Imbens GW, Rubin DB (2015) Causal inference in statistics, social, and biomedical sciences. Cambridge University Press

    Book  MATH  Google Scholar 

  • Iyer RN, Yun SA, Nagappan M, Hoey J (2019) Effects of personality traits on pull request acceptance. IEEE Transactions on Software Engineering

  • Jamie DM (2002) Using computer simulation methods to teach statistics: A review of the literature. Journal of Statistics Education 10(1)

  • Jbara A, Matan A, Feitelson DG (2014) High-MCC functions in the Linux kernel. Empir Softw Eng 19(5):1261–1298

    Article  Google Scholar 

  • Jiarpakdee J, Tantithamthavorn C, Hassan AE (2021) The impact of correlated metrics on the interpretation of defect models. IEEE Trans Softw Eng 47(2):320–331

    Article  Google Scholar 

  • Johnson J, Lubo S, Yedla N, Aponte J, Sharif B (2019) An empirical study assessing source code readability in comprehension. In: ICSME, pp 513–523. IEEE

  • Jolak R, Savary-Leblanc M, Dalibor M, Wortmann A, Hebig R, Vincur J, Polásek I, Pallec XL, Gérard S, Chaudron MRV (2020) Software engineering whispers: The effect of textual vs. graphical software design descriptions on software design communication. Empir Softw Eng 25(6):4427–4471

    Article  Google Scholar 

  • Kamei Y, Shihab E, Adams B, Hassan AE, Mockus A, Sinha A, Ubayashi N (2013) A large-scale empirical study of just-in-time quality assurance. IEEE Trans Softw Eng 39(6):757–773

    Article  Google Scholar 

  • Kochhar PS, Lo D (2017) Revisiting assert use in GitHub projects. In: EASE, pp 298–307. ACM

  • Martens A, Koziolek H, Prechelt L, Reussner RH (2011) From monolithic to component-based performance evaluation of software architectures - A series of experiments analysing accuracy and effort. Empir Softw Eng 16(5):587–622

    Article  Google Scholar 

  • McChesney IR, Bond RR (2020) Observations on the linear order of program code reading patterns in programmers with dyslexia. In: EASE, pp 81–89. ACM

  • McElreath, R (2020) Statistical rethinking: A Bayesian course with examples in R and Stan. CRC press

  • Miller G (2006) A Scientist’s nightmare: Software problem leads to five retractions. Science 314(5807):1856–1857

    Article  Google Scholar 

  • Mockus, A (2010) Organizational volatility and its effects on software defects. In: SIGSOFT FSE, pp 117–126. ACM

  • Mockus A, Weiss DM (2000) Predicting risk of software changes. Bell Labs Tech J 5(2):169–180

    Article  Google Scholar 

  • Morris TP, White IR, Crowther MJ (2019) Using simulation studies to evaluate statistical methods. Stat Med 38(11):2074–2102

    Article  MathSciNet  Google Scholar 

  • Nagappan, N, Zeller, A, Zimmermann, T, Herzig, K, Murphy, B (2010) Change bursts as defect predictors. In: ISSRE, pp 309–318. IEEE Computer society

  • Nam J, Fu W, Kim S, Menzies T, Tan L (2018) Heterogeneous defect prediction. IEEE Trans Softw Eng 44(9):874–896

    Article  Google Scholar 

  • Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: ICSE, pp 382–391. IEEE Computer society

  • Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR (1996) A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49(12):1373–1379

    Article  Google Scholar 

  • Penta MD, Cerulo L, Guéhéneuc Y, Antoniol G (2008) An empirical study of the relationships between design pattern roles and class change proneness. In: ICSM, pp 217–226. IEEE Computer society

  • Posnett D, Filkov V, Devanbu, PT (2011) Ecological inference in empirical software engineering. In: ASE, pp 362–371. IEEE Computer society

  • Rahman F, Devanbu PT (2011) Ownership, experience and defects: a fine-grained study of authorship. In: ICSE, pp 491–500. ACM

  • Rahman F, Posnett D, Devanbu PT (2012) Recalling the "imprecision" of cross-project defect prediction. In: SIGSOFT FSE, p 61. ACM

  • Rahman MM, Roy CK, Collins JA (2016) CoRReCT: code reviewer recommendation in GitHub based on cross-project and technology experience. In: ICSE (Companion Volume), pp 222–231. ACM

  • Reyes RP, Dieste O, Fonseca ER, Juristo N (2018) Statistical errors in software engineering experiments: a preliminary literature review. In: ICSE, pp 1195–1206. ACM

  • Roberts DR, Bahn V, Ciuti S, Boyce MS, Elith J, Guillera-Arroita G, Hauenstein S, Lahoz-Monfort JJ, Schröder B, Thuiller W, Warton DI, Wintle BA, Hartig F, Dormann CF (2017) Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40(8):913–929

    Article  Google Scholar 

  • Sayagh M, Kerzazi N, Petrillo F, Bennani K, Adams B (2020) What should your run-time configuration framework do to help developers? Empir Softw Eng 25(2):1259–1293

    Article  Google Scholar 

  • Scholtes I, Mavrodiev P, Schweitzer F (2016) From Aristotle to Ringelmann: a large-scale analysis of team productivity and coordination in Open Source Software projects. Empir Softw Eng 21(2):642–683

    Article  Google Scholar 

  • Seifer P, Härtel J, Leinberger M, Lämmel R, Staab S (2019) Empirical study on the usage of graph query languages in open source Java projects. In: SLE, pp 152–166. ACM

  • Seo T, Lee H (2009) Agent-based simulation model for the evolution process of open source software. In: SEKE, pp 170–177. Knowledge systems institute graduate school

  • Shadish WR, Cook TD, Campbell DT (2002) Experimental and quasi-experimental designs for generalized causal inference. Houghton mifflin company

  • Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88(422):486–494

    Article  MathSciNet  MATH  Google Scholar 

  • Sjøberg DIK, Hannay JE, Hansen O, Kampenes VB, Karahasanovic A, Liborg N, Rekdal AC (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31(9):733–753

    Article  Google Scholar 

  • Sliwerski J, Zimmermann T, Zeller A (2005) When do changes induce fixes? In: MSR. ACM

  • Stodden V, Seiler J, Ma Z (2018) An empirical analysis of journal policy effectiveness for computational reproducibility. Proc Natl Acad Sci USA 115(11):2584–2589

    Article  Google Scholar 

  • Tan M, Tan L, Dara S, Mayeux C (2015) Online defect prediction for imbalanced data. In: ICSE (2), pp 99–108. IEEE Computer society

  • Tantithamthavorn C, Hassan AE (2018) An experience report on defect modelling in practice: pitfalls and challenges. In: ICSE (SEIP), pp 286–295. ACM

  • Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Softw Eng 43(1):1–18

    Article  Google Scholar 

  • Thongtanunam P, McIntosh S, Hassan AE, Iida H (2016) Revisiting code ownership and its relationship with software quality in the scope of modern code review. In: ICSE, pp 1039–1050. ACM

  • Tichy WF, Lukowicz P, Prechelt L, Heinz EA (1995) Experimental evaluation in computer science: A quantitative study. J Syst Softw 28(1):9–18

    Article  Google Scholar 

  • Tsay J, Dabbish L, Herbsleb JD (2014) Influence of social and technical factors for evaluating contribution in GitHub. In: ICSE, pp 356–366. ACM

  • Tufano M, Bavota G, Poshyvanyk D, Penta MD, Oliveto R, Lucia AD (2017) An empirical study on developer-related factors characterizing fix-inducing commits. J Softw Evol Process 29(1)

  • Vasilescu B, Posnett D, Ray B, van den Brand MGJ, Serebrenik A, Devanbu PT, Filkov V (2015) Gender and tenure diversity in GitHub teams. In: CHI, pp 3789–3798. ACM

  • Vokác M (2004) Defect frequency and design patterns: An empirical study of industrial code. IEEE Trans Softw Eng 30(12):904–917

    Article  Google Scholar 

  • Wood M (2005) The role of simulation approaches in statistics. Journal of Statistics Education 13(3)

  • Yan M, Xia X, Fan Y, Lo D, Hassan AE, Zhang X (2020) Effort-aware just-in-time defect identification in practice: a case study at Alibaba. In: ESEC/SIGSOFT FSE, pp 1308–1319. ACM

  • Zhang F, Hassan AE, McIntosh S, Zou Y (2017) The use of summation to aggregate software metrics hinders the performance of defect prediction models. IEEE Trans Softw Eng 43(5):476–491

    Article  Google Scholar 

  • Zimmermann T, Nagappan N (2008) Predicting defects using network analysis on dependency graphs. In: ICSE, pp 531–540. ACM

  • Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: PROMISE 2007, p 76. IEEE

Download references

Acknowledgements

We want to acknowledge the original work of the authors in the studies, subject to the following illustrations. All studies have been selected because of their originality. However, we believe that this meta-validation of simulation-based testing is not credible on unpublished examples.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johannes Härtel.

Ethics declarations

Conflicts of interest

The authors have no conflict of interest.

Additional information

Communicated by: Nicole Novielli, Shane McIntosh, David Lo.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Mining Software Repositories (MSR)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Härtel, J., Lämmel, R. Operationalizing validity of empirical software engineering studies. Empir Software Eng 28, 153 (2023). https://doi.org/10.1007/s10664-023-10370-3

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-023-10370-3

Keywords

Navigation