Skip to main content
Log in

The secret life of test smells - an empirical study on test smell evolution and maintenance

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

In recent years, researchers and practitioners have been studying the impact of test smells on test maintenance. However, there is still limited empirical evidence on why developers remove test smells in software maintenance and the mechanism employed for addressing test smells. In this paper, we conduct an empirical study on 12 real-world open-source systems to study the evolution and maintenance of test smells, and how test smells are related to software quality. Our results show that: 1) Although the number of test smell instances increases, test smell density decreases as systems evolve. 2) However, our qualitative analysis on those removed test smells reveals that most test smell removal (83%) is a by-product of feature maintenance activities. 45% of the removed test smells relocate to other test cases due to refactoring, while developers deliberately address the only 17% of the test smell instances, consisting of largely Exception Catch/Throw and Sleepy Test. 3) Our statistical model shows that test smell metrics can provide additional explanatory power on post-release defects over traditional baseline metrics (an average of 8.25% increase in AUC). However, most types of test smells have a minimal effect on post-release defects. Our study provides insight into how developers resolve test smells and current test maintenance practices. Future studies on test smells may consider focusing on the specific types of test smells that may have a higher correlation with defect-proneness when helping developers with test code maintenance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Listing 1
Fig. 3

Similar content being viewed by others

Notes

  1. https://github.com/SPEAR-SE/TestSmellEmpirical_Data

  2. https://github.com/apache/flink/pull/4446

  3. Logistic Regression from Lrm R package.

  4. Redundancy analysis from the Hmisc R package.

  5. VIF analysis from RegClass R package.

References

  • Akiyama F (1971) An example of software system debugging. In: Freiman CV, Griffith JE, Rosenfeld JL (eds) Information processing, Proceedings of IFIP, 1971. North-Holland, pp 353–359

  • AlDanial (2019) Count lines of code. https://github.com/AlDanial/cloc

  • Ali NB, Engström E, Taromirad M, Mousavi MR, Minhas NM, Helgesson D, Kunze S, Varshosaz M (2019) On the search for industry-relevant regression testing research. Empir Softw Eng 24(4):2020–2055

    Article  Google Scholar 

  • Apache (2020) Apache jenkins. https://builds.apache.org/. Last accessed April 3, 2020

  • Athanasiou D, Nugroho A, Visser J, Zaidman A (2014) Test code quality and its relation to issue handling performance. IEEE Trans Softw Eng 40 (11):1100–1125

    Article  Google Scholar 

  • Bavota G, Qusef A, Oliveto R, Lucia AD, Binkley DW (2012) An empirical analysis of the distribution of unit test smells and their impact on software maintenance. In: 28th IEEE international conference on software maintenance, ICSM, pp 56–65. IEEE Computer Society

  • Bavota G, Qusef A, Oliveto R, De Lucia A, Binkley D (2012) An empirical analysis of the distribution of unit test smells and their impact on software maintenance. In: 2012 28th IEEE international conference on software maintenance (ICSM), pp 56–65

  • Bavota G, Qusef A, Oliveto R, De Lucia A, Binkley D (2015) Are test smells really harmful? an empirical study. Empir Softw Eng 20(4):1052–1094

    Article  Google Scholar 

  • Bird C, Nagappan N, Murphy B, Gall H, Devanbu P (2011) Don’t touch my code!: examining the effects of ownership on software quality. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on foundations of software engineering, SIGSOFT/FSE ’11, pp 4–14

  • Biyani S, Santhanam P (1998) Exploring defect data from development and customer usage on software modules over multiple releases. In: Ninth international symposium on software reliability engineering, ISSRE, pp 316–320. IEEE Computer Society

  • Bleser JD, Nucci DD, Roover CD (2019) Assessing diffusion and perception of test smells in scala projects. In: Storey MD, Adams B, Haiduc S (eds) Proceedings of the 16th international conference on mining software repositories, MSR, pp 457–467. IEEE / ACM

  • Chen T, Thomas SW, Hemmati H, Nagappan M, Hassan AE (2017) An empirical study on the effect of testing on code quality using topic models: a case study on software development systems. IEEE Trans Reliab 66(3):806–824

    Article  Google Scholar 

  • Chen T, Shang W, Nagappan M, Hassan AE, Thomas SW (2017) Topic-based software defect explanation. J Syst Softw 129:79–106

    Article  Google Scholar 

  • Chen T-H, Thomas SW, Nagappan M, Hassan A (2012) Explaining software defects using topic models. In: Proceedings of the 9th working conference on mining software repositories, MSR ’12

  • Child M, Rosner P, Counsell S (2019) A comparison and evaluation of variants in the coupling between objects metric. J Syst Softw 151:120–132

    Article  Google Scholar 

  • D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: Whitehead J, Zimmermann T (eds) Proceedings of the 7th international working conference on mining software repositories, MSR 2010 (Co-located with ICSE), Cape Town, South Africa, May 2-3, 2010, Proceedings, pp 31–41. IEEE Computer Society

  • de Pádua GB, Shang W (2018) Studying the relationship between exception handling practices and post-release defects. In: Proceedings of the 15th international conference on mining software repositories, MSR, pp 564–575

  • Deursen A, Moonen LM, Bergh A, Kok G (2001) Refactoring test code. Technical report, Amsterdam, The Netherlands, The Netherlands

  • Eck M, Palomba F, Castelluccio M, Bacchelli A (2019) Understanding flaky tests: the developer’s perspective. In: Dumas M, Pfahl D, Apel S, Russo A (eds) Proceedings of the ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/SIGSOFT FSE, pp 830–840. ACM

  • Garousi V, Küçük B (2018) Smells in software test code: a survey of knowledge in industry and academia. J Syst Softw 138:52–81

    Article  Google Scholar 

  • Harrell FE Jr (2015) Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. Springer, Berlin

  • Jiarpakdee J, Tantithamthavorn C, Hassan AE (2018) The impact of correlated metrics on defect models. arXiv:1801.10271

  • Junior NS, Soares LR, Martins LA, Machado I (2020a) A survey on test practitioners’ awareness of test smells. arXiv:2003.05613

  • Junior NS, Soares LR, Martins LA, Machado I (2020b) A survey on test practitioners’ awareness of test smells. arXiv:2003.05613

  • Kamei Y, Fukushima T, McIntosh S, Yamashita K, Ubayashi N, Hassan AE (2016) Studying just-in-time defect prediction using cross-project models. Empir Softw Eng 21(5):2072–2106

    Article  Google Scholar 

  • Knuth DE (1981) Seminumerical Algorithms, volume 2 of The Art of Computer Programming, 2nd edn. Addison-Wesley, Reading

    MATH  Google Scholar 

  • Kuhn M, Johnson K (2013) Applied predictive modeling, vol 26. Springer, Berlin

    Book  Google Scholar 

  • Lam W, Godefroid P, Nath S, Santhiar A, Thummalapenta S (2019) Root causing flaky tests in a large-scale industrial setting. In: Zhang D, Møller A (eds) Proceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis, ISSTA, pp 101–111. ACM

  • Levin S, Yehudai A (2017) The co-evolution of test maintenance and code maintenance through the lens of fine-grained semantic changes. In: 2017 IEEE International conference on software maintenance and evolution, ICSME, pp 35–46. IEEE Computer Society

  • Luo Q, Hariri F, Eloussi L, Marinov D (2014) An empirical analysis of flaky tests. In: Cheung S, Orso A, Storey M D (eds) Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, (FSE-22), pp 643–653. ACM

  • Meszaros G (2007) xUnit test patterns: Refactoring test code. Pearson Education, London

    Google Scholar 

  • Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Schäfer W, Dwyer MB, Gruhn V (eds) 30th international conference on software engineering (ICSE ), pp 181–190. ACM

  • Munson JC, Khoshgoftaar TM (1992) The detection of fault-prone programs. IEEE Trans Softw Eng 18(5):423–433

    Article  Google Scholar 

  • Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: Proceedings of the 27th international conference on software engineering, pp 284–292

  • Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Osterweil L J, Rombach H D, Soffa M L (eds) 28th international conference on software engineering (ICSE), pp 452–461. ACM

  • Palomba F, Bavota G, Penta MD, Oliveto R, Lucia AD (2014) Do they really smell bad? a study on developers’ perception of bad code smells. In: 30th IEEE international conference on software maintenance and evolution, pp 101–110. IEEE Computer Society

  • Palomba F, Nucci DD, Panichella A, Oliveto R, Lucia AD (2016) On the diffusion of test smells in automatically generated test code: an empirical study. In: Proceedings of the 9th international workshop on search-based software testing, SBST@ICSE, pp 5–14. ACM

  • Palomba F, Zanoni M, Fontana FA, Lucia AD, Oliveto R (2019) Toward a smell-aware bug prediction model. IEEE Trans Softw Eng 45(2):194–218

    Article  Google Scholar 

  • Peruma A, Almalki K, Newman CD, Mkaouer MW, Ouni A, Palomba F (2019) On the distribution of test smells in open source android applications: an exploratory study. In: Proceedings of the 29th annual international conference on computer science and software engineering, CASCON ’19, pp 193– 202

  • Peruma A, Almalki K, Newman CD, Mkaouer MW, Ouni A, Palomba F (2020) tsdetect: An open source test smells detection tool. Association for Computing Machinery, New York

    Google Scholar 

  • Pham T M-T, Yang J (2020) The secret life of commented-out source code. In: 28th IEEE/ACM international conference on program comprehension, ICSE

  • Pinto LS, Sinha S, Orso A (2012) Understanding myths and realities of test-suite evolution. In: Tracz W, Robillard M P, Bultan T (eds) 20th ACM SIGSOFT symposium on the foundations of software engineering (FSE-20), p 33. ACM

  • Piotrowski P, Madeyski L (2020) Software defect prediction using bad code smells: a systematic literature review. In: Data-Centric Business and Applications, pp 77–99

  • Qusef A, Elish MO, Binkley DW (2019) An exploratory study of the relationship between software test smells and fault-proneness, vol 7, pp 139526–139536

  • Rahman F, Devanbu PT (2011) Ownership, experience and defects: a fine-grained study of authorship. In: Taylor RN, Gall HC, Medvidovic N (eds) Proceedings of the 33rd international conference on software engineering, ICSE 2011, Waikiki, Honolulu, HI, USA, May 21-28, 2011, pp 491–500. ACM

  • Rodríguez-Pérez G, Robles G, Serebrenik A, Zaidman A, Germán DM, González-Barahona JM (2020) How bugs are born: a model to identify how bugs are introduced in software components. Empir Softw Eng 25(2):1294–1340

    Article  Google Scholar 

  • Shamshiri S, Rojas JM, Galeotti JP, Walkinshaw N, Fraser G (2018) How do automatically generated unit tests influence software maintenance?. In: 11th IEEE international conference on software testing, verification and validation, ICST, pp 250–261. IEEE Computer Society

  • Shang W, Nagappan M, Hassan AE (2015) Studying the relationship between logging characteristics and the code quality of platform software. Empir Softw Eng 20(1):1–27

    Article  Google Scholar 

  • Shi A, Bell J, Marinov D (2019) Mitigating the effects of flaky tests on mutation testing. In: Zhang D, Møller A (eds) Proceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis, ISSTA. ACM, pp 112–122

  • Shihab E, Jiang ZM, Ibrahim WM, Adams B, Hassan AE (2010) Understanding the impact of code and process metrics on post-release defects: A case study on the eclipse project. In: Proceedings of the 2010 ACM-IEEE international symposium on empirical software engineering and measurement, vol 4. ACM

  • Spadini D, Palomba F, Zaidman A, Bruntink M, Bacchelli A (2018) On the relation of test smells to software code quality. In: 2018 IEEE international conference on software maintenance and evolution (ICSME), pp 1–12

  • Spadini D, Schvarcbacher M, Oprescu A, Bruntink M, Bacchelli A (2020) Investigating severity thresholds for test smells. In: Kim S, Gousios G, Nadi S , Hejderup J (eds) MSR ’20: 17th International conference on mining software repositories, Seoul, Republic of Korea, 29-30 June, 2020. ACM, pp 311–321

  • Spínola R O, Zazworka N, Vetro A, Shull F, Seaman CB (2019) Understanding automated and human-based technical debt identification approaches-a two-phase study. J Braz Comp Soc 25(1):5:1–5:21

    Article  Google Scholar 

  • Tsantalis N, Mansouri M, Eshkevari LM, Mazinanian D, Dig D (2018) Accurate and efficient refactoring detection in commit history. In: Proceedings of the 40th international conference on software engineering, ICSE ’18. ACM, New York, pp 483–494

  • Tufano M, Palomba F, Bavota G, Penta MD, Oliveto R, Lucia AD, Poshyvanyk D (2016) An empirical investigation into the nature of test smells. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering, pp 4–15

  • Tufano M, Palomba F, Bavota G, Oliveto R, Penta MD, Lucia AD, Poshyvanyk D (2017) When and why your code starts to smell bad (and whether the smells go away). IEEE Trans Softw Eng 43(11):1063–1088

    Article  Google Scholar 

  • Van Deursen A, Moonen L, Van Den Bergh A, Kok G (2001) Refactoring test code. In: Proceedings of the 2nd international conference on extreme programming and flexible processes in software engineering (XP2001), pp 92–95

  • Wang S, Chen T-H, Hassan AE (2018) Understanding the factors for fast answers in technical q&a websites. Empir Softw Eng 23(3):1552–1593

    Article  Google Scholar 

  • Yu CS, Treude C, Aniche MF (2019) Comprehending test code: an empirical study. arXiv:1907.13365

  • Zaidman A, Rompaey BV, Demeyer S, van Deursen A (2008) Mining software repositories to study co-evolution of production test code. In: 2008 1st international conference on software testing, verification, and validation, pp 220–229

  • Zeller A (2009) Why Programs Fail - A Guide to Systematic Debugging. 2nd edn. Academic Press, Cambridge

  • Zhao X, Liang J, Dang C (2019) A stratified sampling based clustering algorithm for large-scale data. Knowl Based Syst 163:416–428

    Article  Google Scholar 

  • Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: Proceedings of the third international workshop on predictor models in software engineering, PROMISE ’07, p 9

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong Jae Kim.

Additional information

Communicated by: Andy Zaidman

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Table 9 The statistics of the regression models showing additive defect explainability of PD(TEST_ PRODUCT) + PR(TEST_PROCESS) metrics over the BASE(LOC+CHURNS+PRE+COUPLING)
Table 10 The statistics of the regression models showing additive defect explainability of PD(TEST_ PRODUCT) + PR(TEST_PROCESS) metrics over the BASE(LOC+CHURNS+PRE+COUPLING)
Table 11 The statistics of the regression models showing additive defect explainability of PD(TEST_ PRODUCT) + PR(TEST_PROCESS) metrics over the BASE(LOC+CHURNS+PRE+COUPLING)
Table 12 The statistics of the regression models showing additive defect explainability of PD(TEST_ PRODUCT) + PR(TEST_PROCESS) metrics over the BASE(LOC+CHURNS+PRE+COUPLING)
Table 13 The statistics of the regression models showing additive defect explainability of PD(TEST_ PRODUCT) + PR(TEST_PROCESS) metrics over the BASE(LOC+CHURNS+PRE+COUPLING)
Table 14 The effect size of the test smell metrics on post-release defects
Table 15 The effect size of the test smell metrics on post-release defects

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, D.J., Chen, TH.(. & Yang, J. The secret life of test smells - an empirical study on test smell evolution and maintenance. Empir Software Eng 26, 100 (2021). https://doi.org/10.1007/s10664-021-09969-1

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-021-09969-1

Keywords

Navigation