Challenges of Evaluating the Quality of Software Engineering Experiments

  • Oscar DiesteEmail author
  • Natalia Juristo


Good-quality experiments are free of bias. Bias is considered to be related to internal validity (e.g., how well experiments are planned, designed, executed, and analysed). Quality scales and expert opinion are two approaches for assessing the quality of experiments. Aim: Identify whether there is a relationship between bias and quality scale and expert opinion predictions in SE experiments. Method: We used a quality scale to determine the quality of 35 experiments from three systematic literature reviews. We used two different procedures (effect size and response ratio) to calculate the bias in diverse response variables for the above experiments. Experienced researchers assessed the quality of these experiments. We analysed the correlations between the quality scores, bias and expert opinion. Results: The relationship between quality scales, expert opinion and bias depends on the technology exercised in the experiments. The correlation between quality scales, expert opinion and bias is only correct when the technologies can be subjected to acceptable experimental control. Both correct and incorrect expert ratings are more extreme than the quality scales. Conclusions: A quality scale based on formal internal quality criteria will predict bias satisfactorily provided that the technology can be properly controlled in the laboratory.


  1. 1.
    Kitchenham, B., Charters, S.: Guidelines for Performing Systematic Literature Reviews in Software Engineering. Version 2.3. EBSE Technical Report, EBSE-2007-01 (2007)Google Scholar
  2. 2.
    CRD, University of York: Systematic Reviews: CRD’s Guidance for Undertaking Reviews in Health Care. CRD, University of York, York (2009)Google Scholar
  3. 3.
    Biolchini, J., Mian, P., Natali, A., et al.: Systematic Review in Software Engineering. Technical Report ES 679/05, COPPE/UFRJ (2005)Google Scholar
  4. 4.
    Dybå, T., Dingsøyr, T.: Strength of evidence in systematic reviews in software engineering. In: 2nd International Symposium on Empirical Software Engineering and Measurement (ESEM’08), pp. 178–187. (2008)Google Scholar
  5. 5.
    Dybå, T., Dingsøyr, T.: Empirical studies of agile software development: a systematic review. Inf. Softw. Technol. 50, 833–859 (2008)CrossRefGoogle Scholar
  6. 6.
    Afzal, W., Torkar, R., Feldt, R.: A systematic review of search-based testing for non-functional system properties. Inf. Softw. Technol. 51, 957–976 (2009)CrossRefGoogle Scholar
  7. 7.
    Balk, E.M., Bonis, P.L., Moskowitz, H., et al.: Correlation of quality measures with estimates of treatment effect in meta-analyses of randomized controlled trials. JAMA 287, 2973–2982 (2002)CrossRefGoogle Scholar
  8. 8.
    Deeks, J. J., Dinnes, J., D’Amico, R., et al.: Evaluating non-randomised intervention studies. Health technology assessment (Winchester, England) JID – 9706284, (1030)Google Scholar
  9. 9.
    Emerson, J.D., Burdick, E., Hoaglin, D.C., et al.: An empirical study of the possible relation of treatment differences to quality scores in controlled randomized clinical trials. Control. Clin. Trials 11, 339–352 (1990)CrossRefGoogle Scholar
  10. 10.
    McKee, M., Britton, A., Black, N., et al.: Interpreting the evidence: choosing between randomised and non-randomised studies. BMJ 319, 312–315 (1999)CrossRefGoogle Scholar
  11. 11.
    Dieste, O.: Quantitative determination of the relationship between internal validity and bias in software engineering experiments: consequences for systematic literature reviews. In: 5th International Symposium on Empirical Software Engineering and Measurement (ESEM’11), pp. 285–294. (2011)Google Scholar
  12. 12.
    Kitchenham, B.A., Sjøberg, D.I.K., Dybå, T., et al.: Three empirical studies on the agreement of reviewers about the quality of software engineering experiments. Inf. Softw. Technol. 54, 804–819 (2012)CrossRefGoogle Scholar
  13. 13.
    Shadish, W.R., Cook, T.D., Campbell, D.T.: Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton Mifflin Company, Boston (2001)Google Scholar
  14. 14.
    Montgomery, D.C., Runger, G.C.: Applied Statistics and Probability for Engineers. Wiley, Hoboken (2010)Google Scholar
  15. 15.
    Kitchenham, B. A.: Procedures for Performing Systematic Reviews. Keele University TR/SE-0401 (2004)Google Scholar
  16. 16.
    Jüni, P., Witschi, A., Bloch, R., et al.: The hazards of scoring the quality of clinical trials for meta-analysis. JAMA 282, 1054–1060 (1999)CrossRefGoogle Scholar
  17. 17.
    Schulz, K.F., Chalmers, I., Hayes, R.J., et al.: Empirical evidence of bias. Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA 273, 408–412 (1995)CrossRefGoogle Scholar
  18. 18.
    Higgins J., Green S.: Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0. The Cochrane Collaboration (2011)Google Scholar
  19. 19.
    Petticrew, M., Roberts, H.: Systematic Reviews in the Social Sciences: A Practical Guide. Wiley-Blackwell, Oxford (2005)Google Scholar
  20. 20.
    Downs, S.H., Black, N.: The feasibility of creating a checklist for the assessment of the methodological quality both of randomised and non-randomised studies of health care interventions. J. Epidemiol. Commun. Health JID – 7909766, (1028)Google Scholar
  21. 21.
    Jadad, A.R., Moore, R.A., Carroll, D., et al.: Assessing the quality of reports of randomized clinical trials: is blinding necessary? Control. Clin. Trials 17, 1–12 (1996)CrossRefGoogle Scholar
  22. 22.
    Owens, D.K., Lohr, K.N., Atkins, D., et al.: AHRQ series paper 5: grading the strength of a body of evidence when comparing medical interventions – Agency for Healthcare Research and Quality and the Effective Health-Care Program. J. Clin. Epidemiol. 63, 513–523 (2010)CrossRefGoogle Scholar
  23. 23.
    Cook, T.D., Campbell, D.T.: Quasi-Experimentation: Design & Analysis Issues for Field Settings. Rand McNally College Pub. Co., Chicago (1979)Google Scholar
  24. 24.
    Ciolkowski, M.: What do we know about perspective-based reading? An approach for quantitative aggregation in software engineering. In: 3rd International Symposium on Empirical Software Engineering and Measurement (ESEM’09), pp. 133−144. (2009)Google Scholar
  25. 25.
    Dieste, O., Juristo, N.: Systematic review and aggregation of empirical studies on elicitation techniques. IEEE Trans. Softw. Eng. 37, 304 (2011)CrossRefGoogle Scholar
  26. 26.
    Hannay, J.E., Dybå, T., Arisholm, E., et al.: The effectiveness of pair programming: a meta-analysis. Inf. Softw. Technol. 51, 1110–1122 (2009)CrossRefGoogle Scholar
  27. 27.
    Griman, A.C.: Process for the systematic review of experiments in software engineering, Ph.D. thesis, Universidad Politécnica de Madrid, under review process (2013)Google Scholar
  28. 28.
    Hedges, L.V., Olkin, I.: Statistical Methods for Meta-Analysis. Academic, Orlando (1985)zbMATHGoogle Scholar
  29. 29.
    Worm, B., Barbier, E.B., Beaumont, N., et al.: Impacts of biodiversity loss on ocean ecosystem services: supplementary online material. Science 314, 787–790 (2006)CrossRefGoogle Scholar
  30. 30.
    Furr, R.M., Bacharach, V.R.: Psychometrics: An Introduction. SAGE, Thousand Oaks (2007)Google Scholar
  31. 31.
    Osborne, J.W., Overbay, A.: The power of outliers (and why researchers should always check for them). Pract. Assess. Res. Eval. 9 (2004)
  32. 32.
    Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85–126 (2004)zbMATHCrossRefGoogle Scholar
  33. 33.
    Carifio, J., Perla, R.J.: Ten common misunderstandings, misconceptions, persistent myths and urban legends about Likert scales and Likert response formats and their antidotes. J. Soc. Sci. 3, 106–116 (2007)Google Scholar
  34. 34.
    Richy, F., Ethgen, O., Bruyere, O., et al.: From sample size to effect-size: Small Study Effect Investigation (SSEi). Internet J. Epidemiol. 10 (2004)
  35. 35.
    Maiden, N.A.M., Rugg, G.: ACRE: selecting methods for requirements acquisition. Softw. Eng. J. 11, 183–192 (1996)CrossRefGoogle Scholar
  36. 36.
    Aranda, A., Dieste, O., Juristo, N.: Searching for the variables that influence requirements elicitation. Requir. Eng. J. (submitted 2013)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.Universidad Politécnica de MadridMadridSpain

Personalised recommendations