Bayesian Data Analysis in Empirical Software Engineering: The Case of Missing Data

Torkar, Richard; Feldt, Robert; Furia, Carlo A.

doi:10.1007/978-3-030-32489-6_11

Richard Torkar³,
Robert Feldt³ &
Carlo A. Furia⁴

1838 Accesses
3 Citations

Abstract

Bayesian data analysis (BDA) is today used by a multitude of research disciplines. These disciplines use BDA as a way to embrace uncertainty by using multilevel models and making use of all available information at hand. In this chapter, we first introduce the reader to BDA and then provide an example from empirical software engineering, where we also deal with a common issue in our field, i.e., missing data. The example we make use of presents the steps done when conducting state-of-the-art statistical analysis. First, we need to understand the problem we want to solve. Second, we conduct causal analysis. Third, we analyze non-identifiability. Fourth, we conduct missing data analysis. Finally, we do a sensitivity analysis of priors. All this before we design our statistical model. Once we have a model, we present several diagnostics one can use to conduct sanity checks. We hope that through these examples, the reader will see the advantages of using BDA. This way, we hope Bayesian statistics will become more prevalent in our field, thus partly avoiding the reproducibility crisis we have seen in other disciplines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aarts AA, et al (2015) Estimating the reproducibility of psychological science. Science 349(6251):aac4716. https://doi.org/10.1126/science.aac4716
Banerjee S, Carlin B, Gelfand A (2014) Hierarchical modeling and analysis for spatial data, 2nd edn. Chapman and Hall/CRC monographs on statistics and applied probability. Taylor and Francis, Boca Raton
Google Scholar
Benjamin DJ, et al (2018) Redefine statistical significance. Nat Hum Behav 2:6–10. https://doi.org/10.1038/s41562-017-0189-z
Google Scholar
Betancourt M (2015) A unified treatment of predictive model comparison. arXiv:1506.02273
Betancourt M (2017) A conceptual introduction to Hamiltonian Monte Carlo. arXiv:1701.02434
Betancourt M (2018) Calibrating model-based inferences and decisions. arXiv:1803.08393
Bodner TE (2008) What improves with increased missing data imputations? Struct Equ Model Multidiscip J 15(4):651–675. https://doi.org/10.1080/10705510802339072
MathSciNet Google Scholar
Brooks S, Gelman A, Jones G, Meng XL (2011) Handbook of Markov chain Monte Carlo. CRC, Boca Raton
MATH Google Scholar
Bürkner PC (2017) brms: an R package for Bayesian multilevel models using Stan. J Stat Softw 80(1):1–28. https://doi.org/10.18637/jss.v080.i01
Google Scholar
Camerer CF, et al (2016) Evaluating replicability of laboratory experiments in economics. Science 351(6280):1433–1436. https://doi.org/10.1126/science.aaf0918
Google Scholar
Carpenter B, Gelman A, Hoffman M, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, Riddell A (2017) Stan: a probabilistic programming language. J Stat Softw 76(1):1–32. https://doi.org/10.18637/jss.v076.i01
Google Scholar
Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data. In: Proceedings of 5th international workshop on enterprise networking and computing in healthcare industry (IEEE Cat. No.03EX717), pp 154–165. https://doi.org/10.1109/METRIC.2003.1232464
Clarke JL, Clarke B, Yu CW (2013) Prediction in \(\mathcal {M}\)-complete problems with limited sample size. Bayesian Anal 8(3):647–690. https://doi.org/10.1214/13-BA826
MathSciNet MATH Google Scholar
Dutilh G, Vandekerckhove J, Ly A, Matzke D, Pedroni A, Frey R, Rieskamp J, Wagenmakers EJ (2017) A test of the diffusion model explanation for the worst performance rule using preregistration and blinding. Atten Percept Psychophys 79(3):713–725. https://doi.org/10.3758/s13414-017-1304-y
Google Scholar
Ehrlich K, Cataldo M (2012) All-for-one and one-for-all?: a multilevel analysis of communication patterns and individual performance in geographically distributed software development. In: Proceedings of the ACM 2012 conference on computer supported cooperative work (CSCW ’12). ACM, New York, pp 945–954. https://doi.org/10.1145/2145204.2145345
Google Scholar
Ernst NA (2018) Bayesian hierarchical modelling for tailoring metric thresholds. In: Proceedings of the 15th international conference on mining software repositories (MSR ’18). IEEE, Piscataway, pp 587–591. https://doi.org/10.1145/3196398.3196443
Google Scholar
Fernández-Diego M, de Guevara FGL (2014) Potential and limitations of the ISBSG dataset in enhancing software engineering research: a mapping review. Inf Softw Technol 56(6):527–544. https://doi.org/10.1016/j.infsof.2014.01.003
Google Scholar
Furia CA (2016) Bayesian statistics in software engineering: practical guide and case studies. arXiv:1608.06865
Furia CA, Feldt R, Torkar R (2019) Bayesian data analysis in empirical software engineering research. IEEE Trans Softw Eng. https://doi.org/10.1109/TSE.2019.2935974
Gabry J, Simpson D, Vehtari A, Betancourt M, Gelman A (2017) Visualization in Bayesian workflow. arXiv:1709.01449
Gelman A (2018) The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. Personal Soc Psychol Bull 44(1):16–23. https://doi.org/10.1177/0146167217729162
Google Scholar
Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, Rubin D (2013) Bayesian data analysis, 3rd edn. Chapman and Hall/CRC texts in statistical science. Taylor and Francis, Boca Raton
Google Scholar
Gelman A, Simpson D, Betancourt M (2017) The prior can often only be understood in the context of the likelihood. Entropy 19(10):555. https://doi.org/10.3390/e19100555
Google Scholar
Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6(6):721–741. https://doi.org/10.1109/TPAMI.1984.4767596
MATH Google Scholar
Glick JL (1992) Scientific data audit—a key management tool. Account Res 2(3):153–168. https://doi.org/10.1080/08989629208573811
Google Scholar
Hassan S, Tantithamthavorn C, Bezemer CP, Hassan AE (2017) Studying the dialogue between users and developers of free apps in the Google Play Store. Empir Softw Eng 23(3):1275–1312. https://doi.org/10.1007/s10664-017-9538-9
Google Scholar
Hill PR, Stringer M, Lokan C, Wright T (2001) Organizational benchmarking using the ISBSG data repository. IEEE Softw 18:26–32. https://doi.org/10.1109/52.951491
Google Scholar
Hu MC, Pavlicova M, Nunes EV (2011) Zero-inflated and hurdle models of count data with extra zeros: examples from an HIV-risk reduction intervention trial. Am J Drug Alcohol Abuse 37(5):367–375. https://doi.org/10.3109/00952990.2011.597280
Google Scholar
Hunter JE (2001) The desperate need for replications. J Consum Res 28(1):149–158. https://doi.org/10.1086/321953
Google Scholar
Ioannidis JPA (2005a) Contradicted and initially stronger effects in highly cited clinical research. J Am Med Assoc 294(2):218–228. https://doi.org/10.1001/jama.294.2.218
MathSciNet Google Scholar
Ioannidis JPA (2005b) Why most published research findings are false. PLoS Med 2(8):e124. https://doi.org/10.1371/journal.pmed.0020124
Google Scholar
Ioannidis JPA (2016) Why most clinical research is not useful. PLOS Med 13(6):1–10. https://doi.org/10.1371/journal.pmed.1002049
Google Scholar
Ioannidis JPA, Stanley TD, Doucouliagos H (2017) The power of bias in economics research. Econ J 127(605):F236–F265. https://doi.org/10.1111/ecoj.12461
Google Scholar
Jaynes ET (2003) Probability theory: the logic of science. Cambridge University Press, Cambridge
MATH Google Scholar
John LK, Loewenstein G, Prelec D (2012) Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol Sci 23(5):524–532. https://doi.org/10.1177/0956797611430953
Google Scholar
Keung J (2008) Empirical evaluation of Analogy-X for software cost estimation. In: Proceedings of the second ACM-IEEE international symposium on empirical software engineering and measurement (ESEM ’08). ACM, New York, pp 294–296. https://doi.org/10.1145/1414004.1414057
Google Scholar
Kruschke JK (2018) Rejecting or accepting parameter values in Bayesian estimation. Adv Methods Pract Psychol Sci 1(2):270–280. https://doi.org/10.1177/2515245918771304
Google Scholar
Lambert B (2018) A student’s guide to Bayesian statistics. SAGE, Beverly Hills
Google Scholar
Lenberg P, Feldt R, Wallgren Tengberg LG, Tidefors I, Graziotin D (2017) Behavioral software engineering—guidelines for qualitative studies. arXiv:1712.08341
Liebchen GA, Shepperd M (2008) Data sets and data quality in software engineering. In: Proceedings of the 4th international workshop on predictor models in software engineering (PROMISE ’08). ACM, New York, pp 39–44. https://doi.org/10.1145/1370788.1370799
Google Scholar
McElreath R (2015) Statistical rethinking: a Bayesian course with examples in R and Stan. CRC, Boca Raton
Google Scholar
McShane BB, Gal D, Gelman A, Robert C, Tackett JL (2017) Abandon statistical significance. arXiv:1709.07588
Menzies T, Shepperd M (2019) “Bad smells” in software analytics papers. Inf Softw Technol 112:35–47. https://doi.org/10.1016/j.infsof.2019.04.005
Google Scholar
Mittas N, Papatheocharous E, Angelis L, Andreou AS (2015) Integrating non-parametric models with linear components for producing software cost estimations. J Syst Softw 99:120–134. https://doi.org/10.1016/j.jss.2014.09.025
Google Scholar
Mockus A (2008) Missing data in software engineering. In: Shull F, Singer J, Sjøberg DIK (eds) Guide to advanced empirical software engineering. Springer, London, pp 185–200. https://doi.org/10.1007/978-1-84800-044-5_7
Google Scholar
Morey RD, Hoekstra R, Rouder JN, Lee MD, Wagenmakers EJ (2016) The fallacy of placing confidence in confidence intervals. Psychon Bull Rev 23(1):103–123. https://doi.org/10.3758/s13423-015-0947-8
Google Scholar
Myrtveit I, Stensrud E, Olsson UH (2001) Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Softw Eng 27(11):999–1013. https://doi.org/10.1109/32.965340
Google Scholar
Navarro DJ (2019) Between the devil and the deep blue sea: tensions between scientific judgement and statistical model selection. Comput Brain Behav 2(1):28–34. https://doi.org/10.1007/s42113-018-0019-z
Google Scholar
Nuzzo R (2014) Scientific method: statistical errors. Nature 506(7487):150–152. https://doi.org/10.1038/506150a
Google Scholar
Pearl J (2009) Causality: models, reasoning and inference, 2nd edn. Cambridge University Press, New York
MATH Google Scholar
Peters J, Janzing D, Schölkopf B (2017) Elements of causal inference: foundations and learning algorithms. In: Adaptive computation and machine learning. MIT Press, Cambridge
MATH Google Scholar
R Core Team (2018) R: a language and environment for statistical computing. In: R foundation for statistical computing, Vienna, Austria. https://www.R-project.org/
Rodríguez-Pérez G, Robles G, González-Barahona JM (2018) Reproducibility and credibility in empirical software engineering: a case study based on a systematic literature review of the use of the SZZ algorithm. Inf Softw Technol 99:164–176. https://doi.org/10.1016/j.infsof.2018.03.009
Google Scholar
Rosenbaum PR (1984) The consequences of adjustment for a concomitant variable that has been affected by the treatment. J R Stat Soc Ser A 147(5):656–666
Google Scholar
Rubin DB (1986) Statistical matching using file concatenation with adjusted weights and multiple imputations. J Bus Econ Stat 4:87–94. https://doi.org/10.1080/07350015.1986.10509497
Google Scholar
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, Hoboken
MATH Google Scholar
Shanks DR, et al (2013) Priming intelligent behavior: an elusive phenomenon. PLOS One 8(4):1–10. https://doi.org/10.1371/journal.pone.0056515
Google Scholar
Shepperd M, Ajienka N, Counsell S (2018) The role and value of replication in empirical software engineering results. Inf Softw Technol 99:120–132. https://doi.org/10.1016/j.infsof.2018.01.006
Google Scholar
Simpson DP, Rue H, Martins TG, Riebler A, Sørbye SH (2014) Penalising model component complexity: a principled, practical approach to constructing priors. arXiv:1403.4630
Talts S, Betancourt M, Simpson D, Vehtari A, Gelman A (2018) Validating Bayesian inference algorithms with simulation-based calibration. arXiv:1804.06788
Torkar R, Feldt R, de Oliveira Neto FG, Gren L (2017) Statistical and practical significance of empirical software engineering research: a maturity model. CoRR abs/1706.00933
Trafimow D, Marks M (2015) Editorial. Basic Appl Soc Psychol 37(1):1–2. https://doi.org/10.1080/01973533.2015.1012991
MathSciNet Google Scholar
van Buuren S (2007) Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res 16(3):219–242. https://doi.org/10.1177/0962280206074463
MathSciNet MATH Google Scholar
Vehtari A, Gelman A, Gabry J (2017) Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput 27:1413–1432. https://doi.org/10.1007/s11222-016-9696-4
MathSciNet MATH Google Scholar
Vehtari A, Gelman A, Simpson D, Carpenter B, Bürkner PC (2019) Rank-normalization, folding, and localization: an improved \(\widehat {R}\) for assessing convergence of MCMC. arXiv:1903.08008
White IR, Royston P, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30(4):377–399. https://doi.org/10.1002/sim.4067
MathSciNet Google Scholar
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2012) Experimentation in software engineering. Springer, Berlin
MATH Google Scholar
Woolston C (2015) Psychology journal bans P values. Nature 519(7541):9. https://doi.org/10.1038/519009f
Google Scholar
Yao Y, Vehtari A, Simpson D, Gelman A (2018) Using stacking to average Bayesian predictive distributions (with discussion). Bayesian Anal 13(3):917–1007. https://doi.org/10.1214/17-BA1091
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Chalmers and University of Gothenburg, Gothenburg, Sweden
Richard Torkar & Robert Feldt
Università della Svizzera Italiana, Lugano, Switzerland
Carlo A. Furia

Authors

Richard Torkar
View author publications
You can also search for this author in PubMed Google Scholar
Robert Feldt
View author publications
You can also search for this author in PubMed Google Scholar
Carlo A. Furia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Richard Torkar .

Editor information

Editors and Affiliations

Institute of Computer Science, University of Innsbruck, Innsbruck, Austria
Michael Felderer
Systems Engineering and Computer Science, Federal University of Rio de Janeiro, Rio de Janeiro, Rio de Janeiro, Brazil
Guilherme Horta Travassos

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Torkar, R., Feldt, R., Furia, C.A. (2020). Bayesian Data Analysis in Empirical Software Engineering: The Case of Missing Data. In: Felderer, M., Travassos, G. (eds) Contemporary Empirical Methods in Software Engineering. Springer, Cham. https://doi.org/10.1007/978-3-030-32489-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-32489-6_11
Published: 28 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32488-9
Online ISBN: 978-3-030-32489-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics