Abstract
Bayesian data analysis (BDA) is today used by a multitude of research disciplines. These disciplines use BDA as a way to embrace uncertainty by using multilevel models and making use of all available information at hand. In this chapter, we first introduce the reader to BDA and then provide an example from empirical software engineering, where we also deal with a common issue in our field, i.e., missing data. The example we make use of presents the steps done when conducting state-of-the-art statistical analysis. First, we need to understand the problem we want to solve. Second, we conduct causal analysis. Third, we analyze non-identifiability. Fourth, we conduct missing data analysis. Finally, we do a sensitivity analysis of priors. All this before we design our statistical model. Once we have a model, we present several diagnostics one can use to conduct sanity checks. We hope that through these examples, the reader will see the advantages of using BDA. This way, we hope Bayesian statistics will become more prevalent in our field, thus partly avoiding the reproducibility crisis we have seen in other disciplines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aarts AA, et al (2015) Estimating the reproducibility of psychological science. Science 349(6251):aac4716. https://doi.org/10.1126/science.aac4716
Banerjee S, Carlin B, Gelfand A (2014) Hierarchical modeling and analysis for spatial data, 2nd edn. Chapman and Hall/CRC monographs on statistics and applied probability. Taylor and Francis, Boca Raton
Benjamin DJ, et al (2018) Redefine statistical significance. Nat Hum Behav 2:6–10. https://doi.org/10.1038/s41562-017-0189-z
Betancourt M (2015) A unified treatment of predictive model comparison. arXiv:1506.02273
Betancourt M (2017) A conceptual introduction to Hamiltonian Monte Carlo. arXiv:1701.02434
Betancourt M (2018) Calibrating model-based inferences and decisions. arXiv:1803.08393
Bodner TE (2008) What improves with increased missing data imputations? Struct Equ Model Multidiscip J 15(4):651–675. https://doi.org/10.1080/10705510802339072
Brooks S, Gelman A, Jones G, Meng XL (2011) Handbook of Markov chain Monte Carlo. CRC, Boca Raton
Bürkner PC (2017) brms: an R package for Bayesian multilevel models using Stan. J Stat Softw 80(1):1–28. https://doi.org/10.18637/jss.v080.i01
Camerer CF, et al (2016) Evaluating replicability of laboratory experiments in economics. Science 351(6280):1433–1436. https://doi.org/10.1126/science.aaf0918
Carpenter B, Gelman A, Hoffman M, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, Riddell A (2017) Stan: a probabilistic programming language. J Stat Softw 76(1):1–32. https://doi.org/10.18637/jss.v076.i01
Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data. In: Proceedings of 5th international workshop on enterprise networking and computing in healthcare industry (IEEE Cat. No.03EX717), pp 154–165. https://doi.org/10.1109/METRIC.2003.1232464
Clarke JL, Clarke B, Yu CW (2013) Prediction in \(\mathcal {M}\)-complete problems with limited sample size. Bayesian Anal 8(3):647–690. https://doi.org/10.1214/13-BA826
Dutilh G, Vandekerckhove J, Ly A, Matzke D, Pedroni A, Frey R, Rieskamp J, Wagenmakers EJ (2017) A test of the diffusion model explanation for the worst performance rule using preregistration and blinding. Atten Percept Psychophys 79(3):713–725. https://doi.org/10.3758/s13414-017-1304-y
Ehrlich K, Cataldo M (2012) All-for-one and one-for-all?: a multilevel analysis of communication patterns and individual performance in geographically distributed software development. In: Proceedings of the ACM 2012 conference on computer supported cooperative work (CSCW ’12). ACM, New York, pp 945–954. https://doi.org/10.1145/2145204.2145345
Ernst NA (2018) Bayesian hierarchical modelling for tailoring metric thresholds. In: Proceedings of the 15th international conference on mining software repositories (MSR ’18). IEEE, Piscataway, pp 587–591. https://doi.org/10.1145/3196398.3196443
Fernández-Diego M, de Guevara FGL (2014) Potential and limitations of the ISBSG dataset in enhancing software engineering research: a mapping review. Inf Softw Technol 56(6):527–544. https://doi.org/10.1016/j.infsof.2014.01.003
Furia CA (2016) Bayesian statistics in software engineering: practical guide and case studies. arXiv:1608.06865
Furia CA, Feldt R, Torkar R (2019) Bayesian data analysis in empirical software engineering research. IEEE Trans Softw Eng. https://doi.org/10.1109/TSE.2019.2935974
Gabry J, Simpson D, Vehtari A, Betancourt M, Gelman A (2017) Visualization in Bayesian workflow. arXiv:1709.01449
Gelman A (2018) The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. Personal Soc Psychol Bull 44(1):16–23. https://doi.org/10.1177/0146167217729162
Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, Rubin D (2013) Bayesian data analysis, 3rd edn. Chapman and Hall/CRC texts in statistical science. Taylor and Francis, Boca Raton
Gelman A, Simpson D, Betancourt M (2017) The prior can often only be understood in the context of the likelihood. Entropy 19(10):555. https://doi.org/10.3390/e19100555
Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6(6):721–741. https://doi.org/10.1109/TPAMI.1984.4767596
Glick JL (1992) Scientific data audit—a key management tool. Account Res 2(3):153–168. https://doi.org/10.1080/08989629208573811
Hassan S, Tantithamthavorn C, Bezemer CP, Hassan AE (2017) Studying the dialogue between users and developers of free apps in the Google Play Store. Empir Softw Eng 23(3):1275–1312. https://doi.org/10.1007/s10664-017-9538-9
Hill PR, Stringer M, Lokan C, Wright T (2001) Organizational benchmarking using the ISBSG data repository. IEEE Softw 18:26–32. https://doi.org/10.1109/52.951491
Hu MC, Pavlicova M, Nunes EV (2011) Zero-inflated and hurdle models of count data with extra zeros: examples from an HIV-risk reduction intervention trial. Am J Drug Alcohol Abuse 37(5):367–375. https://doi.org/10.3109/00952990.2011.597280
Hunter JE (2001) The desperate need for replications. J Consum Res 28(1):149–158. https://doi.org/10.1086/321953
Ioannidis JPA (2005a) Contradicted and initially stronger effects in highly cited clinical research. J Am Med Assoc 294(2):218–228. https://doi.org/10.1001/jama.294.2.218
Ioannidis JPA (2005b) Why most published research findings are false. PLoS Med 2(8):e124. https://doi.org/10.1371/journal.pmed.0020124
Ioannidis JPA (2016) Why most clinical research is not useful. PLOS Med 13(6):1–10. https://doi.org/10.1371/journal.pmed.1002049
Ioannidis JPA, Stanley TD, Doucouliagos H (2017) The power of bias in economics research. Econ J 127(605):F236–F265. https://doi.org/10.1111/ecoj.12461
Jaynes ET (2003) Probability theory: the logic of science. Cambridge University Press, Cambridge
John LK, Loewenstein G, Prelec D (2012) Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol Sci 23(5):524–532. https://doi.org/10.1177/0956797611430953
Keung J (2008) Empirical evaluation of Analogy-X for software cost estimation. In: Proceedings of the second ACM-IEEE international symposium on empirical software engineering and measurement (ESEM ’08). ACM, New York, pp 294–296. https://doi.org/10.1145/1414004.1414057
Kruschke JK (2018) Rejecting or accepting parameter values in Bayesian estimation. Adv Methods Pract Psychol Sci 1(2):270–280. https://doi.org/10.1177/2515245918771304
Lambert B (2018) A student’s guide to Bayesian statistics. SAGE, Beverly Hills
Lenberg P, Feldt R, Wallgren Tengberg LG, Tidefors I, Graziotin D (2017) Behavioral software engineering—guidelines for qualitative studies. arXiv:1712.08341
Liebchen GA, Shepperd M (2008) Data sets and data quality in software engineering. In: Proceedings of the 4th international workshop on predictor models in software engineering (PROMISE ’08). ACM, New York, pp 39–44. https://doi.org/10.1145/1370788.1370799
McElreath R (2015) Statistical rethinking: a Bayesian course with examples in R and Stan. CRC, Boca Raton
McShane BB, Gal D, Gelman A, Robert C, Tackett JL (2017) Abandon statistical significance. arXiv:1709.07588
Menzies T, Shepperd M (2019) “Bad smells” in software analytics papers. Inf Softw Technol 112:35–47. https://doi.org/10.1016/j.infsof.2019.04.005
Mittas N, Papatheocharous E, Angelis L, Andreou AS (2015) Integrating non-parametric models with linear components for producing software cost estimations. J Syst Softw 99:120–134. https://doi.org/10.1016/j.jss.2014.09.025
Mockus A (2008) Missing data in software engineering. In: Shull F, Singer J, Sjøberg DIK (eds) Guide to advanced empirical software engineering. Springer, London, pp 185–200. https://doi.org/10.1007/978-1-84800-044-5_7
Morey RD, Hoekstra R, Rouder JN, Lee MD, Wagenmakers EJ (2016) The fallacy of placing confidence in confidence intervals. Psychon Bull Rev 23(1):103–123. https://doi.org/10.3758/s13423-015-0947-8
Myrtveit I, Stensrud E, Olsson UH (2001) Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Softw Eng 27(11):999–1013. https://doi.org/10.1109/32.965340
Navarro DJ (2019) Between the devil and the deep blue sea: tensions between scientific judgement and statistical model selection. Comput Brain Behav 2(1):28–34. https://doi.org/10.1007/s42113-018-0019-z
Nuzzo R (2014) Scientific method: statistical errors. Nature 506(7487):150–152. https://doi.org/10.1038/506150a
Pearl J (2009) Causality: models, reasoning and inference, 2nd edn. Cambridge University Press, New York
Peters J, Janzing D, Schölkopf B (2017) Elements of causal inference: foundations and learning algorithms. In: Adaptive computation and machine learning. MIT Press, Cambridge
R Core Team (2018) R: a language and environment for statistical computing. In: R foundation for statistical computing, Vienna, Austria. https://www.R-project.org/
Rodríguez-Pérez G, Robles G, González-Barahona JM (2018) Reproducibility and credibility in empirical software engineering: a case study based on a systematic literature review of the use of the SZZ algorithm. Inf Softw Technol 99:164–176. https://doi.org/10.1016/j.infsof.2018.03.009
Rosenbaum PR (1984) The consequences of adjustment for a concomitant variable that has been affected by the treatment. J R Stat Soc Ser A 147(5):656–666
Rubin DB (1986) Statistical matching using file concatenation with adjusted weights and multiple imputations. J Bus Econ Stat 4:87–94. https://doi.org/10.1080/07350015.1986.10509497
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, Hoboken
Shanks DR, et al (2013) Priming intelligent behavior: an elusive phenomenon. PLOS One 8(4):1–10. https://doi.org/10.1371/journal.pone.0056515
Shepperd M, Ajienka N, Counsell S (2018) The role and value of replication in empirical software engineering results. Inf Softw Technol 99:120–132. https://doi.org/10.1016/j.infsof.2018.01.006
Simpson DP, Rue H, Martins TG, Riebler A, Sørbye SH (2014) Penalising model component complexity: a principled, practical approach to constructing priors. arXiv:1403.4630
Talts S, Betancourt M, Simpson D, Vehtari A, Gelman A (2018) Validating Bayesian inference algorithms with simulation-based calibration. arXiv:1804.06788
Torkar R, Feldt R, de Oliveira Neto FG, Gren L (2017) Statistical and practical significance of empirical software engineering research: a maturity model. CoRR abs/1706.00933
Trafimow D, Marks M (2015) Editorial. Basic Appl Soc Psychol 37(1):1–2. https://doi.org/10.1080/01973533.2015.1012991
van Buuren S (2007) Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res 16(3):219–242. https://doi.org/10.1177/0962280206074463
Vehtari A, Gelman A, Gabry J (2017) Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput 27:1413–1432. https://doi.org/10.1007/s11222-016-9696-4
Vehtari A, Gelman A, Simpson D, Carpenter B, Bürkner PC (2019) Rank-normalization, folding, and localization: an improved \(\widehat {R}\) for assessing convergence of MCMC. arXiv:1903.08008
White IR, Royston P, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30(4):377–399. https://doi.org/10.1002/sim.4067
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2012) Experimentation in software engineering. Springer, Berlin
Woolston C (2015) Psychology journal bans P values. Nature 519(7541):9. https://doi.org/10.1038/519009f
Yao Y, Vehtari A, Simpson D, Gelman A (2018) Using stacking to average Bayesian predictive distributions (with discussion). Bayesian Anal 13(3):917–1007. https://doi.org/10.1214/17-BA1091
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Torkar, R., Feldt, R., Furia, C.A. (2020). Bayesian Data Analysis in Empirical Software Engineering: The Case of Missing Data. In: Felderer, M., Travassos, G. (eds) Contemporary Empirical Methods in Software Engineering. Springer, Cham. https://doi.org/10.1007/978-3-030-32489-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-32489-6_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32488-9
Online ISBN: 978-3-030-32489-6
eBook Packages: Computer ScienceComputer Science (R0)