Equivalence hypothesis testing in experimental software engineering

Dolado, José Javier; Otero, Mari Carmen; Harman, Mark

doi:10.1007/s11219-013-9196-0

Equivalence hypothesis testing in experimental software engineering

Published: 13 March 2013

Volume 22, pages 215–238, (2014)
Cite this article

Software Quality Journal Aims and scope Submit manuscript

José Javier Dolado¹,
Mari Carmen Otero² &
Mark Harman³

706 Accesses
6 Citations
33 Altmetric
4 Mentions
Explore all metrics

Abstract

This article introduces the application of equivalence hypothesis testing (EHT) into the Empirical Software Engineering field. Equivalence (also known as bioequivalence in pharmacological studies) is a statistical approach that answers the question "is product T equivalent to some other reference product R within some range $\Updelta$?." The approach of “null hypothesis significance test” used traditionally in Empirical Software Engineering seeks to assess evidence for differences between T and R, not equivalence. In this paper, we explain how EHT can be applied in Software Engineering, thereby extending it from its current application within pharmacological studies, to Empirical Software Engineering. We illustrate the application of EHT to Empirical Software Engineering, by re-examining the behavior of experts and novices when handling code with side effects compared to side-effect free code; a study previously investigated using traditional statistical testing. We also review two other previous published data of software engineering experiments: a dataset compared the comprehension of UML and OML specifications, and the last dataset studied the differences between the specification methods UML-B and B. The application of EHT allows us to extract additional conclusions to the previous results. EHT has an important application in Empirical Software Engineering, which motivate its wider adoption and use: EHT can be used to assess the statistical confidence with which we can claim that two software engineering methods, algorithms of techniques, are equivalent.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Although it is a rare situation, it is feasible to test for equivalence with the null hypothesis as $H_0: |\mu_T-\mu_R| \le\Updelta$. However, this procedure is rarely used in practice, since it does not allow to control for the “risk of the consumer” (Hauschke et al. 2007, pp. 45-46) therefore rendering the testing for equivalence useless. See McBride (2005, section 5.3) and Cole and McBride (2004) for the practical consequences of this approach.
The value of variable y scaled from range (y _min, y _max) into range (x _min, x _max) is given by the transformation
$$ \begin{array}{c} y_{\rm scaled}=y\frac{x_{\rm max}-x_{\rm min}}{y_{\rm max}-y_{\rm min}} +\frac{x_{\rm min}\cdot{y_{\rm max}}-x_{\rm max}\cdot{y_{\rm min}}}{y_{\rm max}-y_{\rm min}}. \end{array} $$

References

Borg, M., & Pfahl, D. (2011). Do better ir tools improve the accuracy of engineers’ traceability recovery? In Proceedings of the international workshop on machine learning technologies in software engineering (MALETS ’11), (pp. 27–34).
Chen, D. G., & Peace, K. E. (2011). Clinical trial data analysis using R. Boca Raton, Florida, USA: Chapman & Hall.
Google Scholar
Chow, S. C., & Liu, J. P. (2009). Design and analysis of bioavailability and bioequivalence studies. London: Chapman & Hall.
MATH Google Scholar
Chow, S. C., & Wang, H. (2001). On sample size calculation in bioequivalence trias. Journal of Pharmacokinetics and Pharmacodynamics 28(2), 155–169.
Article Google Scholar
Chow, S. L. (1998). Precis of statistical significance: Rationale validity and utility (with comments and reply). Behavioral and Brain Sciences 21, 169–239.
Google Scholar
Cole, R., & McBride, G. (2004). Assessing impacts of dredge spoil disposal using equivalence tests: Implications of a precautionary (proof of safety) approach. Marine Ecology Progress Series 279, 63–72.
Article Google Scholar
Cribbie, R. A., Gruman, J. A., & Arpin-Cribbie, C. A. (2004). Recommendations for applying tests of equivalence. Journal of Clinical Psychology 60(1), 1–10.
Article Google Scholar
Dolado, J. J., Harman, M., Otero, M. C., & Hu, L. (2003). An empirical investigation of the influence of a type of side effects on program comprehension. IEEE Transactions on Software Engineering, 29(7), 665–670.
Article Google Scholar
EMA. (2010). Guideline on the investigation of bioequivalence. Tech. Rep. CPMP/EWP/QWP/1401/98 Rev. 1, EMA, European Medicines Agency.
Ennis, D. M., & Ennis, J. M. (2009). Hypothesis testing for equivalence defined on symmetric open intervals. Communications in Statistics—Theory and Methods 38(11), 1792–1803.
Article MATH MathSciNet Google Scholar
Ennis, D. M., & Ennis, J. M. (2010). Equivalence hypothesis testing. Food Quality and Preference 21, 253–256.
Article Google Scholar
Garrett, K. A. (1997). Use of statistical tests of equivalence (bioequivalence tests) in plant pathology. Phytopathology 87(4), 372–374.
Article Google Scholar
Harman, M. (2010). Why source code analysis and manipulation will always be important. In 10th IEEE international working conference on source code analysis and manipulation, Timisoara, Romania (pp. 7–19).
Harman, M., Hu, L., Hierons, R., Munro, M., Zhang, X., Dolado, J., Otero, M., & Wegener, J. (2002). A post-placement side-effect removal algorithm. In IEEE proceedings of the international conference on software maintenance (ICSM 2002), (pp. 2–11).
Harman, M., Hu, L., Zhang, X., & Munro, M. (2001). Side-effect removal transformation. In IEEE international workshop on program comprehension (IWPC 2001), Toronto, Canada (pp. 309–319).
Hauschke, D., Steinijans, V., & Pigeot, I. (2007). Bioequivalence studies in drug development. Methods and applications. New York: Wiley.
Book MATH Google Scholar
Hintze, J. (2000) PASS 2000. NCSS, LLC. Utah, USA: Kaysville.
Google Scholar
Hoenig, J., & Heisey, D. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician 55(1), 19.
Article MathSciNet Google Scholar
Hyslop, T., & Iglewicz, B. (2001). Alternative cross-over designs for individual bioequivalence. In Proceedings of the annual meeting of the American statistical association.
Lakhotia, K., McMinn, P., & Harman, M. (2009). Automated test data generation for coverage: Haven’t we solved this problem yet? In 4th testing academia and industry conference—practice and research techniques (TAIC PART’09), Windsor, UK (pp. 95–104).
McBride, G. B. (2005). Using statistical methods for water quality management. Issues, problems and solutions. New York: Wiley.
Book Google Scholar
Mecklin, C. (2003). A comparison of equivalence testing in combination with hypothesis testing and effect sizes. Journal of Modern Applied Statistical Methods 2(2), 329–340.
Google Scholar
Meyners, M. (2012). Equivalence tests—a review. Food Quality and Preference 26(2), 231–245.
Article Google Scholar
Miller, J., Daly, J., Wood, M., Roper, M., & Brooks, A. (1997). Statistical power and its subcomponents—missing and misunderstood concepts in empirical software engineering research. Information and Software Technology 39(4), 285–295.
Article Google Scholar
Miranda, B., Sturtevant, B., Yang, J., & Gustafson, E. (2009). Comparing fire spread algorithms using equivalence testing and neutral landscape models. Landscape Ecology 24, 587–598.
Article Google Scholar
Ngatia, M., Gonzalez, D., Julian, S. S., & Conner, A. (2010). Equivalence versus classical statistical tests in water quality assessments. Journal of Environmental Monitoring 12, 172–177.
Article Google Scholar
Ogungbenro, K., & Aarons, L. (2008). How many subjects are necessary for population pharmacokinetic experiments? Confidence interval approach. European Journal of Clinical Pharmacology 64, 705–713.
Article Google Scholar
Otero, M. C., & Dolado, J. J. (2005). An empirical comparison of the dynamic modeling in oml and uml. Journal of Systems and Software 77(2), 91 – 102.
Article Google Scholar
Piaggio, G., & Pinol, A. P. Y. (2001). Use of the equivalence approach in reproductive health clinical trials. Statistics in Medicine 20(23), 3571–3578.
Article Google Scholar
Piaggio, G., Elbourne, D. R., Altman, D. G., Pocock, S. J., & Evans, S. J. W. (2006). Reporting of noninferiority and equivalence randomized trials. An extension of the consort statement. The Journal of the American Medical Association 295(10), 1152–1160.
Article Google Scholar
Pikounis, B., Bradstreet, T. E., & Millard, S. P. (2001). Graphical insight and data analysis for the 2,2,2, crossover design. In S. P. Millard & A. Krause (Eds.), Applied atatistics in the pharmaceutical industry with case studies using S-plus (pp. 153–188). Berlin: Springer.
R Core Team. (2012). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.
Rani, S., & Pargal, A. (2004). Bioequivalence: An overview of statistical concepts. Indian Journal of Pharmacology 36(4), 209–216.
Google Scholar
Razali, R. (2008). Usability of semi-formal and formal methods integration—empirical assessments. PhD thesis, School of Electronics and Computer Science, Faculty of Engineering, Science and Mathematics, University of Southampton.
Razali, R., & Garratt, P. W. (2006). Measuring the comprehensibility of a uml-b model and a b model. In International conference on computer and information science and engineering (CISE 2006) (pp. 338–343).
Razali, R., Snook, C. F., & Poppleton, M. R. (2007a). Comprehensibility of uml-based formal model: A series of controlled experiments. In:Proceedings of the 1st ACM international workshop on empirical assessment of software engineering languages and technologies: Held in conjunction with the 22nd IEEE/ACM international conference on automated software engineering (ASE) 2007, ACM, New York, NY, USA, WEASELTech ’07 (pp. 25–30).
Razali, R., Snook, C. F., Poppleton, M. R., Garratt, P. W., & Walters, R. J. (2007b). Experimental comparison of the comprehensibility of a uml-based formal specification versus a textual one. In B. Kitchenham, P. Brereton, & M. Turner (Eds.), Proceedings of the 11th international conference on evaluation and assessment in software engineering (EASE ’07), British Computer Society (pp. 1–11).
Robinson, A. P., & Froese, R. E. (2004) Model validation using equivalence tests. Ecological Modelling 176(3-4), 349–358.
Article Google Scholar
Robinson, A. P., Duursma, R. A., & Marshall, J. D. (2005). A regression-based equivalence test for model validation: Shifting the burden of proof. Tree Physiology 25, 903–913.
Article Google Scholar
Rogers, J., Howard, K., & Vessey, J. (1993). Using significance tests to evaluate equivalence between two experimental groups. Psychological Bulletin 113(3), 553–565.
Article Google Scholar
Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics 15(6), 657–680.
Google Scholar
Siqueira, A. L., Whitehead, A., Todd, S., & Lucini, M. M. (2005). Comparison of sample size formulae for 2 × 2 cross-over designs applied to bioequivalence studies. Pharmaceutical Statistics 4, 233–243.
Article Google Scholar
Stegner, B. L., Bostrom, A. G., & Greenfield, T. K. (1996). Equivalence testing for use in psychosocial and services research: An introduction with examples. Evaluation and Program Planning 19(3), 193–198.
Article Google Scholar
Stein, J., & Doganaksoy, N. (1999). Sample size considerations for assessing the equivalence of two process means. Quality Engineering 12(1), 105–110.
Article Google Scholar
Tempelman, R. J. (2004). Experimental design and statistical methods for classical and bioequivalence hypothesis testing with an application to dairy nutrition studies. Journal of Animal Science 82(13 suppl), E162–E172.
Google Scholar
Tryon, W. W. (2001). Evaluating statistical difference, equivalence, and indeterminacy using inferential confidence intervals: An integrated alternative method of conducting null hypothesis statistical tests. Psychological Methods 6(4), 371–386.
Article Google Scholar
Van Peer, A. (2010). Variability and impact on design of bioequivalence studies. Basic & Clinical Pharmacology & Toxicology 106(3), 146–153.
Article Google Scholar
Waldhoer, T., & Heinzl, H. (2011). Combining difference and equivalence test results in spatial maps. International Journal of Health Geographics 10(1), 3.
Article Google Scholar
Wellek, S. (2010). Testing statistical hypotheses of equivalence and noninferiority, 2nd edn. Boca Raton, Florida, USA: Chapman & Hall.
Book Google Scholar
Westlake, W. J. (1976). Symmetrical confidence intervals for bioequivalence trials. Biometrics 32(4), 741–744.
Article MATH Google Scholar
Yue, L., & Roach, P. (1998). A note on the sample size determination in two-period repeated measurements crossover design with application to clinical trials. Journal of Biopharmaceutical Statistics 8(4), 577–584.
Article MATH Google Scholar

Download references

Acknowledgments

The authors are grateful to the reviewers for their helpful comments.

Author information

Authors and Affiliations

Facultad de Informática, UPV/EHU University of the Basque Country, San Sebastián, Spain
José Javier Dolado
Escuela Universitaria de Ingeniería de Vitoria-Gasteiz, UPV/EHU University of the Basque Country, Vitoria-Gasteiz, Spain
Mari Carmen Otero
CREST, University College London, London, WC1E 6BT, UK
Mark Harman

Authors

José Javier Dolado
View author publications
You can also search for this author in PubMed Google Scholar
Mari Carmen Otero
View author publications
You can also search for this author in PubMed Google Scholar
Mark Harman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José Javier Dolado.

Additional information

The data, R scripts, and other information are available at http://www.sc.ehu.es/jiwdocoj/eht/eht.htm.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dolado, J.J., Otero, M.C. & Harman, M. Equivalence hypothesis testing in experimental software engineering. Software Qual J 22, 215–238 (2014). https://doi.org/10.1007/s11219-013-9196-0

Download citation

Published: 13 March 2013
Issue Date: June 2014
DOI: https://doi.org/10.1007/s11219-013-9196-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Equivalence hypothesis testing in experimental software engineering

Abstract

Access this article

Similar content being viewed by others

Empirical Practice in Software Engineering

Comparing the results of replications in software engineering

Operationalizing validity of empirical software engineering studies

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Equivalence hypothesis testing in experimental software engineering

Abstract

Access this article

Similar content being viewed by others

Empirical Practice in Software Engineering

Comparing the results of replications in software engineering

Operationalizing validity of empirical software engineering studies

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation