Operationalizing validity of empirical software engineering studies

Härtel, Johannes; Lämmel, Ralf

doi:10.1007/s10664-023-10370-3

Operationalizing validity of empirical software engineering studies

Published: 13 November 2023

Volume 28, article number 153, (2023)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

259 Accesses
1 Citation
Explore all metrics

Abstract

Empirical Software Engineering studies apply methods, like linear regression, statistic tests, or correlation analysis, to better understand software engineering scenarios. Assuring the validity of such methods and corresponding results is challenging but critical. This is also reflected by quality criteria on the validity that are part of the reviewing process for the corresponding research results. However, such criteria are often hard to define operationally and thus hard to judge by the reviewers. In this paper, we describe a new strategy to define and communicate the validity of methods and results. We conceptually decompose a study into an empirical scenario, a used method, and the produced results. Validity can only be described as the relationship between the three parts. To make the empirical scenario fully operational, we convert informal assumptions on it into executable simulation code that leverages artificial data to replace (or complement) our real data. We can then run the method on the artificial data and examine the impact of our assumptions on the quality of results. This may operationally i) support the validity of a method for a valid result, ii) threaten the validity of a method for an invalid result if assumptions are controversial, or iii) invalidate a method for an invalid result if assumptions are plausible. We encourage researchers to submit simulations as additional artifacts to the reviewing process to make such statements explicit. Rating if a simulated scenario is plausible or controversial is subjective and may benefit from involving a reviewer. We show that existing empirical software engineering studies can benefit from such additional validation artifacts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Fig. 8

Fig. 18

The Evolution of Empirical Methods in Software Engineering

Empirical Practice in Software Engineering

Comparing the results of replications in software engineering

Article 02 February 2021

Data Availability

All artifacts and data sets are provided online on GitHub (https://github.com/topleet/MSR2022).

Notes

In R, random number generators are vertorized and start with a letter r followed by an abbreviation for the distribution family (we will see rbinom, rnorm and rpoisson).
All our reproductions of other papers are fully available online to guarantee the reproduction of this paper.

References

Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Parzen E, Tanabe K, Kitagawa G (eds) Selected papers of hirotugu akaike. Springer, pp 199–213
Chapter Google Scholar
Alali A, Kagdi HH, Maletic JI (2008) What’s a typical commit? A characterization of open source software repositories. In: ICPC, pp 182–191. IEEE Computer society
Albayrak Ö, Carver JC (2014) Investigation of individual factors impacting the effectiveness of requirements inspections: a replicated experiment. Empir Softw Eng 19(1):241–266
Article Google Scholar
Anda B, Sjøberg DIK (2005) Investigating the role of use cases in the construction of class diagrams. Empir Softw Eng 10(3):285–309
Article Google Scholar
Apa C, Dieste O, Espinosa GEG, Fonseca CER (2014) Effectiveness for detecting faults within and outside the scope of testing techniques: an independent replication. Empir Softw Eng 19(2):378–417
Article Google Scholar
Baayen RH, Davidson DJ, Bates DM (2008) Mixed-effects modeling with crossed random effects for subjects and items. J Memory Lang 59(4):390–412
Article Google Scholar
Bangash AA, Sahar H, Hindle A, Ali K (2020) On the time-based conclusion stability of cross-project defect prediction models. Empirical Software Engineering pp 1–38
Barón MM, Wyrich M, Graziotin D, Wagner S (2023) Evidence profiles for validity threats in program comprehension experiments. In: ICSE, pp 1907–1919. IEEE
Barr DJ, Levy R, Scheepers C, Tily HJ (2013) Random effects structure for confirmatory hypothesis testing: Keep it maximal. J Memory Lang 368(3):255–278
Article Google Scholar
Beheim B, Atkinson QD, Bulbulia J, Gervais W, Gray RD, Henrich J, Lang M, Monroe MW, Muthukrishna M, Norenzayan A, Purzycki BG, Shariff A, Slingerland E, Spicer R, Willard AK (2021) Treatment of missing data determined conclusions regarding moralizing gods. Nature 595(7866):1476–4687
Article Google Scholar
Bidoki NH, Schiappa M, Sukthankar G, Garibay I (2020) Modeling social coding dynamics with sampled historical data. Online Soc Netw Med 16:100070
Article Google Scholar
Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu PT (2009) Fair and balanced?: bias in bug-fix datasets. In: ESEC/SIGSOFT FSE, pp 121–130. ACM
Blythe J, Bollenbacher J, Huang D, Hui P, Krohn R, Pacheco D, Muric G, Sapienza A, Tregubov A, Ahn Y, Flammini A, Lerman K, Menczer F, Weninger T, Ferrara E (2019) Massive multi-agent data-driven simulations of the GitHub ecosystem. In: PAAMS, Lecture notes in computer science, vol 11523, pp 3–15. Springer
Boh WF, Slaughter S, Espinosa JA (2007) Learning from experience in software development: A multilevel analysis. Manag Sci 53(8):1315–1331
Google Scholar
Borges H, Hora AC, Valente MT (2016) Predicting the popularity of GitHub repositories. In: PROMISE, pp 9:1–9:10. ACM
Borle NC, Feghhi M, Stroulia E, Greiner R, Hindle A (2018) Analyzing the effects of test driven development in GitHub. Empir Softw Eng 23(4):1931–1958
Article Google Scholar
Burton A, Altman DG, Royston P, Holder RL (2006) The design of simulation studies in medical statistics. Stat Med 25(24):4279–4292
Article MathSciNet Google Scholar
Canfora G, Lucia AD, Penta MD, Oliveto R, Panichella A, Panichella S (2015) Defect prediction as a multiobjective optimization problem. Softw Test Verification Reliab 25(4):426–459
Article Google Scholar
Casalnuovo C, Devanbu PT, Oliveira A, Filkov V, Ray B (2015) Assert use in GitHub projects. In: ICSE (1), pp 755–766. IEEE Computer Society
Clyburne-Sherin A, Fei X, Green SA (2019) Computational reproducibility via containers in psychology. Meta-psychology 3
Cohen J, Cohen P, West SG, Aiken LS (2013) Applied multiple regression/correlation analysis for the behavioral sciences. Routledge
Book Google Scholar
Cosentino V, Izquierdo JLC, Cabot J (2016) Findings from GitHub: methods, datasets and limitations. In: Proceedings MSR, pp 137–141
Dias M, Bacchelli A, Gousios G, Cassou D, Ducasse S (2015) Untangling fine-grained code changes. In: SANER, pp 341–350. IEEE Computer society
Falcão F, Barbosa C, Fonseca B, Garcia A, Ribeiro M, Gheyi R (2020) On relating technical, social factors, and the introduction of bugs. In: SANER, pp 378–388. IEEE
Fang H, Lamba H, Herbsleb JD, Vasilescu B (2022) This is damn slick! estimating the impact of tweets on open source project popularity and new contributors. In: ICSE, pp 2116–2129. ACM
Gabel M, Su, Z (2010) A study of the uniqueness of source code. In: SIGSOFT FSE, pp 147–156. ACM
Gasparini A, Abrams KR, Barrett JK, Major RW, Sweeting MJ, Brunskill NJ, Crowther MJ (2020) Mixed-effects models for health care longitudinal data with an informative visiting process: A Monte Carlo simulation study. Statistica Neerlandica 74(1):5–23
Article MathSciNet Google Scholar
Gelman A, Hill J (2006) Data analysis using regression and multilevel/hierarchical models. Cambridge University Press
Book Google Scholar
Gelman A, Hill J, Vehtari A (2020) Regression and other stories. Cambridge University Press
Book MATH Google Scholar
Ghaleb TA, da Costa DA, Zou Y (2019) An empirical study of the long duration of continuous integration builds. Empir Softw Eng 24(4):2102–2139
Article Google Scholar
Harrell FE (2015) Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis, vol 2. Springer
Book MATH Google Scholar
Härtel J, Lämmel R (2020) Incremental map-reduce on repository history. In: SANER, pp 320–331. IEEE
Härtel J, Lämmel R (2022) Operationalizing threats to MSR studies by simulation-based testing. In: MSR, pp 86–97. IEEE
He Z, Peters F, Menzies T, Yang Y (2013) Learning from open-source projects: An empirical study on defect prediction. In: ESEM, pp 45–54. IEEE Computer society
Herzig K, Zeller A (2013) The impact of tangled code changes. In: MSR, pp 121–130. IEEE Computer society
Honsel, V (2015) Statistical learning and software mining for agent based simulation of software evolution. In: ICSE (2), pp 863–866. IEEE Computer society
Honsel V, Honsel D, Grabowski J (2014) Software process simulation based on mining software repositories. In: ICDM Workshops, pp 828–831. IEEE Computer society
Honsel V, Honsel D, Herbold S, Grabowski J, Waack S (2015) Mining software dependency networks for agent-based simulation of software evolution. In: ASE Workshops, pp 102–108. IEEE Computer society
Imbens GW, Rubin DB (2015) Causal inference in statistics, social, and biomedical sciences. Cambridge University Press
Book MATH Google Scholar
Iyer RN, Yun SA, Nagappan M, Hoey J (2019) Effects of personality traits on pull request acceptance. IEEE Transactions on Software Engineering
Jamie DM (2002) Using computer simulation methods to teach statistics: A review of the literature. Journal of Statistics Education 10(1)
Jbara A, Matan A, Feitelson DG (2014) High-MCC functions in the Linux kernel. Empir Softw Eng 19(5):1261–1298
Article Google Scholar
Jiarpakdee J, Tantithamthavorn C, Hassan AE (2021) The impact of correlated metrics on the interpretation of defect models. IEEE Trans Softw Eng 47(2):320–331
Article Google Scholar
Johnson J, Lubo S, Yedla N, Aponte J, Sharif B (2019) An empirical study assessing source code readability in comprehension. In: ICSME, pp 513–523. IEEE
Jolak R, Savary-Leblanc M, Dalibor M, Wortmann A, Hebig R, Vincur J, Polásek I, Pallec XL, Gérard S, Chaudron MRV (2020) Software engineering whispers: The effect of textual vs. graphical software design descriptions on software design communication. Empir Softw Eng 25(6):4427–4471
Article Google Scholar
Kamei Y, Shihab E, Adams B, Hassan AE, Mockus A, Sinha A, Ubayashi N (2013) A large-scale empirical study of just-in-time quality assurance. IEEE Trans Softw Eng 39(6):757–773
Article Google Scholar
Kochhar PS, Lo D (2017) Revisiting assert use in GitHub projects. In: EASE, pp 298–307. ACM
Martens A, Koziolek H, Prechelt L, Reussner RH (2011) From monolithic to component-based performance evaluation of software architectures - A series of experiments analysing accuracy and effort. Empir Softw Eng 16(5):587–622
Article Google Scholar
McChesney IR, Bond RR (2020) Observations on the linear order of program code reading patterns in programmers with dyslexia. In: EASE, pp 81–89. ACM
McElreath, R (2020) Statistical rethinking: A Bayesian course with examples in R and Stan. CRC press
Miller G (2006) A Scientist’s nightmare: Software problem leads to five retractions. Science 314(5807):1856–1857
Article Google Scholar
Mockus, A (2010) Organizational volatility and its effects on software defects. In: SIGSOFT FSE, pp 117–126. ACM
Mockus A, Weiss DM (2000) Predicting risk of software changes. Bell Labs Tech J 5(2):169–180
Article Google Scholar
Morris TP, White IR, Crowther MJ (2019) Using simulation studies to evaluate statistical methods. Stat Med 38(11):2074–2102
Article MathSciNet Google Scholar
Nagappan, N, Zeller, A, Zimmermann, T, Herzig, K, Murphy, B (2010) Change bursts as defect predictors. In: ISSRE, pp 309–318. IEEE Computer society
Nam J, Fu W, Kim S, Menzies T, Tan L (2018) Heterogeneous defect prediction. IEEE Trans Softw Eng 44(9):874–896
Article Google Scholar
Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: ICSE, pp 382–391. IEEE Computer society
Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR (1996) A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49(12):1373–1379
Article Google Scholar
Penta MD, Cerulo L, Guéhéneuc Y, Antoniol G (2008) An empirical study of the relationships between design pattern roles and class change proneness. In: ICSM, pp 217–226. IEEE Computer society
Posnett D, Filkov V, Devanbu, PT (2011) Ecological inference in empirical software engineering. In: ASE, pp 362–371. IEEE Computer society
Rahman F, Devanbu PT (2011) Ownership, experience and defects: a fine-grained study of authorship. In: ICSE, pp 491–500. ACM
Rahman F, Posnett D, Devanbu PT (2012) Recalling the "imprecision" of cross-project defect prediction. In: SIGSOFT FSE, p 61. ACM
Rahman MM, Roy CK, Collins JA (2016) CoRReCT: code reviewer recommendation in GitHub based on cross-project and technology experience. In: ICSE (Companion Volume), pp 222–231. ACM
Reyes RP, Dieste O, Fonseca ER, Juristo N (2018) Statistical errors in software engineering experiments: a preliminary literature review. In: ICSE, pp 1195–1206. ACM
Roberts DR, Bahn V, Ciuti S, Boyce MS, Elith J, Guillera-Arroita G, Hauenstein S, Lahoz-Monfort JJ, Schröder B, Thuiller W, Warton DI, Wintle BA, Hartig F, Dormann CF (2017) Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40(8):913–929
Article Google Scholar
Sayagh M, Kerzazi N, Petrillo F, Bennani K, Adams B (2020) What should your run-time configuration framework do to help developers? Empir Softw Eng 25(2):1259–1293
Article Google Scholar
Scholtes I, Mavrodiev P, Schweitzer F (2016) From Aristotle to Ringelmann: a large-scale analysis of team productivity and coordination in Open Source Software projects. Empir Softw Eng 21(2):642–683
Article Google Scholar
Seifer P, Härtel J, Leinberger M, Lämmel R, Staab S (2019) Empirical study on the usage of graph query languages in open source Java projects. In: SLE, pp 152–166. ACM
Seo T, Lee H (2009) Agent-based simulation model for the evolution process of open source software. In: SEKE, pp 170–177. Knowledge systems institute graduate school
Shadish WR, Cook TD, Campbell DT (2002) Experimental and quasi-experimental designs for generalized causal inference. Houghton mifflin company
Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88(422):486–494
Article MathSciNet MATH Google Scholar
Sjøberg DIK, Hannay JE, Hansen O, Kampenes VB, Karahasanovic A, Liborg N, Rekdal AC (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31(9):733–753
Article Google Scholar
Sliwerski J, Zimmermann T, Zeller A (2005) When do changes induce fixes? In: MSR. ACM
Stodden V, Seiler J, Ma Z (2018) An empirical analysis of journal policy effectiveness for computational reproducibility. Proc Natl Acad Sci USA 115(11):2584–2589
Article Google Scholar
Tan M, Tan L, Dara S, Mayeux C (2015) Online defect prediction for imbalanced data. In: ICSE (2), pp 99–108. IEEE Computer society
Tantithamthavorn C, Hassan AE (2018) An experience report on defect modelling in practice: pitfalls and challenges. In: ICSE (SEIP), pp 286–295. ACM
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Softw Eng 43(1):1–18
Article Google Scholar
Thongtanunam P, McIntosh S, Hassan AE, Iida H (2016) Revisiting code ownership and its relationship with software quality in the scope of modern code review. In: ICSE, pp 1039–1050. ACM
Tichy WF, Lukowicz P, Prechelt L, Heinz EA (1995) Experimental evaluation in computer science: A quantitative study. J Syst Softw 28(1):9–18
Article Google Scholar
Tsay J, Dabbish L, Herbsleb JD (2014) Influence of social and technical factors for evaluating contribution in GitHub. In: ICSE, pp 356–366. ACM
Tufano M, Bavota G, Poshyvanyk D, Penta MD, Oliveto R, Lucia AD (2017) An empirical study on developer-related factors characterizing fix-inducing commits. J Softw Evol Process 29(1)
Vasilescu B, Posnett D, Ray B, van den Brand MGJ, Serebrenik A, Devanbu PT, Filkov V (2015) Gender and tenure diversity in GitHub teams. In: CHI, pp 3789–3798. ACM
Vokác M (2004) Defect frequency and design patterns: An empirical study of industrial code. IEEE Trans Softw Eng 30(12):904–917
Article Google Scholar
Wood M (2005) The role of simulation approaches in statistics. Journal of Statistics Education 13(3)
Yan M, Xia X, Fan Y, Lo D, Hassan AE, Zhang X (2020) Effort-aware just-in-time defect identification in practice: a case study at Alibaba. In: ESEC/SIGSOFT FSE, pp 1308–1319. ACM
Zhang F, Hassan AE, McIntosh S, Zou Y (2017) The use of summation to aggregate software metrics hinders the performance of defect prediction models. IEEE Trans Softw Eng 43(5):476–491
Article Google Scholar
Zimmermann T, Nagappan N (2008) Predicting defects using network analysis on dependency graphs. In: ICSE, pp 531–540. ACM
Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: PROMISE 2007, p 76. IEEE

Download references

Acknowledgements

We want to acknowledge the original work of the authors in the studies, subject to the following illustrations. All studies have been selected because of their originality. However, we believe that this meta-validation of simulation-based testing is not credible on unpublished examples.

Author information

Authors and Affiliations

Vrije Universiteit Brussel, Ixelles, Belgium
Johannes Härtel
University of Koblenz, Koblenz, Germany
Ralf Lämmel

Authors

Johannes Härtel
View author publications
You can also search for this author in PubMed Google Scholar
Ralf Lämmel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Johannes Härtel.

Ethics declarations

Conflicts of interest

The authors have no conflict of interest.

Additional information

Communicated by: Nicole Novielli, Shane McIntosh, David Lo.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Mining Software Repositories (MSR)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Härtel, J., Lämmel, R. Operationalizing validity of empirical software engineering studies. Empir Software Eng 28, 153 (2023). https://doi.org/10.1007/s10664-023-10370-3

Download citation

Accepted: 13 July 2023
Published: 13 November 2023
DOI: https://doi.org/10.1007/s10664-023-10370-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Operationalizing validity of empirical software engineering studies

Abstract

Access this article

Similar content being viewed by others

The Evolution of Empirical Methods in Software Engineering

Empirical Practice in Software Engineering

Comparing the results of replications in software engineering

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Operationalizing validity of empirical software engineering studies

Abstract

Access this article

Similar content being viewed by others

The Evolution of Empirical Methods in Software Engineering

Empirical Practice in Software Engineering

Comparing the results of replications in software engineering

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation