Evaluating the agreement among technical debt measurement tools: building an empirical benchmark of technical debt liabilities

Abstract

Software teams are often asked to deliver new features within strict deadlines leading developers to deliberately or inadvertently serve “not quite right code” compromising software quality and maintainability. This non-ideal state of software is efficiently captured by the Technical Debt (TD) metaphor, which reflects the additional effort that has to be spent to maintain software. Although several tools are available for assessing TD, each tool essentially checks software against a particular ruleset. The use of different rulesets can often be beneficial as it leads to the identification of a wider set of problems; however, for the common usage scenario where developers or researchers rely on a single tool, diverse estimates of TD and the identification of different mitigation actions limits the credibility and applicability of the findings. The objective of this study is two-fold: First, we evaluate the degree of agreement among leading TD assessment tools. Second, we propose a framework to capture the diversity of the examined tools with the aim of identifying few “reference assessments” (or class/file profiles) representing characteristic cases of classes/files with respect to their level of TD. By extracting sets of classes/files exhibiting similarity to a selected profile (e.g., that of high TD levels in all employed tools) we establish a basis that can be used either for prioritization of maintenance activities or for training more sophisticated TD identification techniques. The proposed framework is illustrated through a case study on fifty (50) open source projects and two programming languages (Java and JavaScript) employing three leading TD tools.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Notes

  1. 1.

    https://www.omg.org/spec/ATDM/About-ATDM

  2. 2.

    https://www.castsoftware.com/

  3. 3.

    https://www.vector.com/int/en/products/products-a-z/software/squore/squore-software-analytics-for-project-monitoring/

  4. 4.

    https://www.sonarqube.org

  5. 5.

    The term ‘class’ refers to the unit of analysis for Java projects, while the term ‘file’ refers to the unit of analysis for JavaScript projects. Throughout the paper we primarily use the term ‘class’ for simplicity, but both units of analysis are considered, accordingly.

  6. 6.

    https://shiny.rstudio.com

  7. 7.

    https://www.r-project.org

  8. 8.

    http://stains.csd.auth.gr

  9. 9.

    https://se.uom.gr/index.php/projects/technical-debt-benchmarking

  10. 10.

    https://se.uom.gr

  11. 11.

    https://ieeexplore.ieee.org

  12. 12.

    https://dl.acm.org/

  13. 13.

    https://scitools.com/

  14. 14.

    https://www.kiuwan.com/

  15. 15.

    https://www.omg.org/spec/ATDM/About-ATDM

  16. 16.

    https://doi.org/10.5281/zenodo.3966202

  17. 17.

    The idea of archetypes was developed by psychologist C. Jung in his studies about drivers of human behavior. Pearson suggested the use of 12 archetypes among which the ‘Ruler’ denotes personalities whose goal is to create a prosperous, successful family or community, while for a ‘Rebel’ (also known as Outlaw) the motto is that rules are made to be broken. In our context, the ‘Ruler’ profile denotes a community of classes sharing the same assessment by all employed tools, while the ‘Rebel’ points to tools that in some sense break the rules and identify TD items in a different way than the rest.

  18. 18.

    tool: https://se.uom.gr/index.php/projects/technical-debt-benchmarking

  19. 19.

    https://doi.org/10.5281/zenodo.3966202

  20. 20.

    The Partner archetype refers to personalities whose goal is being in a relationship with people and surroundings. In analogy, the Partner profile in our case denotes cases where two of the three tools exhibit high agreement.

  21. 21.

    https://github.com/theoam/TDBenchmarker

  22. 22.

    https://doi.org/10.5281/zenodo.3951041

  23. 23.

    https://github.com/

  24. 24.

    https://github.com/mauricioaniche/ck

  25. 25.

    https://www.omg.org/spec/ATDM/About-ATDM

References

  1. Aitchison J (1982) The statistical analysis of compositional data. J R Stat Soc Ser B Methodol 44:139–177

    MathSciNet  MATH  Google Scholar 

  2. Alves NSR, Mendes TS, de Mendonça MG, Spínola RO, Shull F, Seaman C (2016) Identification and management of technical debt: a systematic mapping study. Inf Softw Technol 70:100–121. https://doi.org/10.1016/j.infsof.2015.10.008

    Article  Google Scholar 

  3. Alves TL, Ypma C, Visser J (2010) Deriving metric thresholds from benchmark data, in: 2010 IEEE international conference on software maintenance. Presented at the 2010 IEEE international conference on software maintenance, pp. 1–10. https://doi.org/10.1109/ICSM.2010.5609747

  4. Arvedahl S (2018) Introducing Debtgrep, a tool for fighting technical debt in Base Station software, in: proceedings of the 2018 international conference on technical debt, TechDebt ‘18. ACM, New York, NY, USA, pp. 51–52. https://doi.org/10.1145/3194164.3194183

  5. Baggen R, Correia JP, Schill K, Visser J (2012) Standardized code quality benchmarking for improving software maintainability. Softw Qual J 20:287–307. https://doi.org/10.1007/s11219-011-9144-9

    Article  Google Scholar 

  6. Baldassari, B., 2013. SQuORE: a new approach to software project assessment

  7. Campbell GA, Papapetrou PP (2013) SonarQube in action, 1st edn. Manning Publications Co., Greenwich

    Google Scholar 

  8. Canhasi E, Kononenko I (2014) Weighted archetypal analysis of the multi-element graph for query-focused multi-document summarization. Expert Syst Appl 41:535–543. https://doi.org/10.1016/j.eswa.2013.07.079

    Article  Google Scholar 

  9. Chan BHP, Mitchell DA, Cram LE (2003) Archetypal analysis of galaxy spectra. Mon Not R Astron Soc 338:790–795. https://doi.org/10.1046/j.1365-8711.2003.06099.x

    Article  Google Scholar 

  10. Chatzipetrou P, Angelis L, Rovegård P, Wohlin C (2010) Prioritization of issues and requirements by cumulative voting: a compositional data analysis framework, in: 2010 36th EUROMICRO conference on software engineering and advanced applications. Presented at the 2010 36th EUROMICRO conference on software engineering and advanced applications, pp. 361–370. https://doi.org/10.1109/SEAA.2010.35

  11. Chatzipetrou P, Papatheocharous E, Angelis L, Andreou AS (2015) A multivariate statistical framework for the analysis of software effort phase distribution. Inf Softw Technol 59:149–169. https://doi.org/10.1016/j.infsof.2014.11.004

    Article  Google Scholar 

  12. Chopra K, Sachdeva M (2015) Evaluation of software metrics for software projects. Int. J. Comput. Technol. 14:5845–5853. https://doi.org/10.24297/ijct.v14i6.1915

    Article  Google Scholar 

  13. Cohen J (1968) Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol Bull 70:213–220. https://doi.org/10.1037/h0026256

    Article  Google Scholar 

  14. Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37–46. https://doi.org/10.1177/001316446002000104

    Article  Google Scholar 

  15. Conejero JM, Rodríguez-Echeverría R, Hernández J, Clemente PJ, Ortiz-Caraballo C, Jurado E, Sánchez-Figueroa F (2018) Early evaluation of technical debt impact on maintainability. J Syst Softw 142:92–114. https://doi.org/10.1016/j.jss.2018.04.035

    Article  Google Scholar 

  16. Cunningham W (1992) The WyCash portfolio management system, in: addendum to the proceedings on object-oriented programming systems, languages, and applications (addendum), OOPSLA ‘92. ACM, New York, pp 29–30. https://doi.org/10.1145/157709.157715

    Google Scholar 

  17. Curtis B, Sappidi J, Szynkarski A (2012) Estimating the principal of an Application’s technical debt. IEEE Softw 29:34–42. https://doi.org/10.1109/MS.2012.156

    Article  Google Scholar 

  18. Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36:338–347. https://doi.org/10.1080/00401706.1994.10485840

    MathSciNet  Article  MATH  Google Scholar 

  19. DeMarco T (1986) Controlling software projects: management, measurement, and estimates, 1 edition. ed. Prentice Hall, Englewood Cliffs, N.J

  20. Döhmen T, Bruntink M, Ceolin D, Visser J (2016) Towards a Benchmark for the Maintainability Evolution of Industrial Software Systems, in: 2016 Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement (IWSM-MENSURA). Presented at the 2016 Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement (IWSM-MENSURA), pp. 11–21. https://doi.org/10.1109/IWSM-Mensura.2016.014

  21. Tobias E, Pasquale LR, Shen LQ, Chen TC, Wiggs JL, Bex PJ (2015) Patterns of functional vision loss in glaucoma determined with archetypal analysis. J R Soc Interface 12:20141118. https://doi.org/10.1098/rsif.2014.1118

    Article  Google Scholar 

  22. Ernst NA, Bellomo S, Ozkaya I, Nord RL (2017) What to fix? Distinguishing between design and non-design rules in automated tools, in: 2017 IEEE international conference on software architecture (ICSA). Presented at the 2017 IEEE international conference on software architecture (ICSA), pp. 165–168. https://doi.org/10.1109/ICSA.2017.25

  23. Eugster MJA (2012) Performance profiles based on archetypal athletes. Int J Perform Anal Sport 12:166–187. https://doi.org/10.1080/24748668.2012.11868592

    Article  Google Scholar 

  24. Fernández-Sánchez C, Humanes H, Garbajosa J, Díaz J (2017). An open tool for assisting in technical debt management, in: 2017 43rd Euromicro conference on software engineering and advanced applications (SEAA). Presented at the 2017 43rd Euromicro conference on software engineering and advanced applications (SEAA), pp. 400–403. https://doi.org/10.1109/SEAA.2017.60

  25. Ferreira KAM, Bigonha MAS, Bigonha RS, Mendes LFO, Almeida HC (2012) Identifying thresholds for object-oriented software metrics. J. Syst. Softw., special issue with selected papers from the 23rd Brazilian symposium on software engineering 85, 244–257. https://doi.org/10.1016/j.jss.2011.05.044

  26. Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76:378–382. https://doi.org/10.1037/h0031619

    Article  Google Scholar 

  27. Foganholi LB, Garcia RE, Eler DM, Correia RCM, Junior CO (2015) Supporting technical debt cataloging with TD-tracker tool. Adv soft Eng 2015, 4:4–4:4. https://doi.org/10.1155/2015/898514

  28. Fontana FA, Roveda R, Vittori S, Metelli A, Saldarini S, Mazzei F (2016a) On Evaluating the Impact of the Refactoring of Architectural Problems on Software Quality, in: Proceedings of the Scientific Workshop Proceedings of XP2016, XP ‘16 Workshops. ACM, New York, NY, USA, pp. 21:1–21:8. https://doi.org/10.1145/2962695.2962716

  29. Fontana FA, Roveda R, Zanoni M (2016b) Technical debt indexes provided by tools: a preliminary discussion, in: 2016 IEEE 8th international workshop on managing technical debt (MTD). Presented at the 2016 IEEE 8th international workshop on managing technical debt (MTD), pp. 28–31. https://doi.org/10.1109/MTD.2016.11

  30. Griffith I, Reimanis D, Izurieta C, Codabux Z, Deo A, Williams B (2014) The correspondence between software quality models and technical debt estimation approaches, in: 2014 sixth international workshop on managing technical debt. Presented at the 2014 sixth international workshop on managing technical debt, pp. 19–26. https://doi.org/10.1109/MTD.2014.13

  31. Gwet KL (2014) Handbook of inter-rater reliability: the definitive guide to measuring the extent of agreement among raters, 4 edition. Ed. advanced analytics, LLC, Gaithersburg, MD

  32. Holvitie J, Leppänen V (2013) DebtFlag: technical debt management with a development environment integrated tool, in: proceedings of the 4th international workshop on managing technical debt, MTD ‘13. IEEE Press, Piscataway, pp 20–27

    Google Scholar 

  33. Izurieta C, Vetrò A, Zazworka N, Cai Y, Seaman C, Shull F (2012) Organizing the technical debt landscape, in: 2012 third international workshop on managing technical debt (MTD). Presented at the 2012 third international workshop on managing technical debt (MTD), pp. 23–26. https://doi.org/10.1109/MTD.2012.6225995

  34. Kazman R, Cai Y, Mo R, Feng Q, Xiao L, Haziyev S, Fedak V, Shapochka A (2015) A case study in locating the architectural roots of technical debt, in: 2015 IEEE/ACM 37th IEEE international conference on software engineering. Presented at the 2015 IEEE/ACM 37th IEEE international conference on software engineering, pp. 179–188. https://doi.org/10.1109/ICSE.2015.146

  35. Kendall MG (1948) Rank correlation methods, Rank correlation methods. Griffin, Oxford, England

  36. Kosti MV, Feldt R, Angelis L (2016) Archetypal personalities of software engineers and their work preferences: a new perspective for empirical studies. Empir Softw Eng 21:1509–1532. https://doi.org/10.1007/s10664-015-9395-3

    Article  Google Scholar 

  37. Li S, Wang P, Louviere J, Carson R (2003) Archetypal analysis: a new way to segment markets based on extreme individuals 6

  38. Li Z, Avgeriou P, Liang P (2015) A systematic mapping study on technical debt and its management. J Syst Softw 101:193–220. https://doi.org/10.1016/j.jss.2014.12.027

    Article  Google Scholar 

  39. Marinescu R (2012) Assessing technical debt by identifying design flaws in software systems. IBM J. res. Dev. 56:9:1–9:13. https://doi.org/10.1147/JRD.2012.2204512

    Article  Google Scholar 

  40. Martín-Fernández JA, Barceló-Vidal C, Pawlowsky-Glahn V (2003) Dealing with Zeros and missing values in compositional data sets using nonparametric imputation. Math Geol 35:253–278. https://doi.org/10.1023/A:1023866030544

    Article  MATH  Google Scholar 

  41. Martini A, Bosch J (2016) An empirically developed method to aid decisions on architectural technical debt refactoring: AnaConDebt, in: 2016 IEEE/ACM 38th international conference on software engineering companion (ICSE-C). Presented at the 2016 IEEE/ACM 38th international conference on software engineering companion (ICSE-C), pp. 31–40

  42. Mayr A, Plösch R, Körner C (2014) A benchmarking-based model for technical debt calculation, in: 2014 14th international conference on quality software. Presented at the 2014 14th international conference on quality software, pp. 305–314. https://doi.org/10.1109/QSIC.2014.35

  43. Mendes TS, Gomes FGS, Gonçalves DP, Mendonça MG, Novais RL, Spínola RO (2019) VisminerTD: a tool for automatic identification and interactive monitoring of the evolution of technical debt items. J Braz Comput Soc 25:2. https://doi.org/10.1186/s13173-018-0083-1

    Article  Google Scholar 

  44. Mittas N, Angelis L (2020) Data-driven benchmarking in software development effort estimation: The few define the bulk J Softw Evol Process n/a, e2258. https://doi.org/10.1002/smr.2258

  45. Mittas N, Karpenisi V, Angelis L (2014) Benchmarking effort estimation models using archetypal analysis, in: proceedings of the 10th international conference on predictive models in software engineering, PROMISE ‘14. ACM, New York, pp 62–71. https://doi.org/10.1145/2639490.2639502

    Google Scholar 

  46. Moliner J, Epifanio I (2019) Robust multivariate and functional archetypal analysis with application to financial time series analysis. Phys Stat Mech Its Appl 519:195–208. https://doi.org/10.1016/j.physa.2018.12.036

    MathSciNet  Article  Google Scholar 

  47. Mori A, Vale G, Viggiato M, Oliveira J, Figueiredo E, Cirilo E, Jamshidi P, Kastner C (2018) Evaluating domain-specific metric thresholds: an empirical study, in: 2018 IEEE/ACM international conference on technical debt (TechDebt). Presented at the 2018 IEEE/ACM international conference on technical debt (TechDebt), pp. 41–50

  48. Nayebi M, Cai Y, Kazman R, Ruhe G, Feng Q, Carlson C, Chew F (2019) A longitudinal study of identifying and paying down architecture debt, in: 2019 IEEE/ACM 41st international conference on software engineering: software engineering in practice (ICSE-SEIP). Presented at the 2019 IEEE/ACM 41st international conference on software engineering: software engineering in practice (ICSE-SEIP), pp. 171–180. https://doi.org/10.1109/ICSE-SEIP.2019.00026

  49. Nugroho A, Visser J, Kuipers T (2011) An empirical model of technical debt and interest, in: proceedings of the 2nd workshop on managing technical debt, MTD ‘11. Association for Computing Machinery, Waikiki, Honolulu, HI, USA, pp. 1–8. https://doi.org/10.1145/1985362.1985364

  50. Oliveira P, Lima FP, Valente MT, Serebrenik A (2014a) RTTool: a tool for extracting relative thresholds for source code metrics, in: 2014 IEEE international conference on software maintenance and evolution. Pp. 629–632. https://doi.org/10.1109/ICSME.2014.112

  51. Oliveira P, Valente MT, Lima FP (2014b) Extracting relative thresholds for source code metrics, in: 2014 software evolution week - IEEE conference on software maintenance, reengineering, and reverse engineering (CSMR-WCRE). Presented at the 2014 software evolution week - IEEE conference on software maintenance, reengineering, and reverse engineering (CSMR-WCRE), pp. 254–263. https://doi.org/10.1109/CSMR-WCRE.2014.6747177

  52. Pawlowsky-Glahn V, Egozcue JJ (2001) Geometric approach to statistical analysis on the simplex. Stoch Environ Res Risk Assess 15:384–398. https://doi.org/10.1007/s004770100077

    Article  MATH  Google Scholar 

  53. Pearson CS (2015) Awakening the heroes within: twelve archetypes to help us find ourselves and transform our world, First edn. First Pinting edition. ed. HarperOne, San Francisco

  54. Pinheiro J, Bates D (2000) Mixed-effects models in S and S-PLUS, statistics and computing. Springer-Verlag, New York

    Google Scholar 

  55. Porzio GC, Ragozini G, Vistocco D (2008) On the use of archetypes as benchmarks. Appl Stoch Models Bus Ind 24:419–437. https://doi.org/10.1002/asmb.727

    MathSciNet  Article  MATH  Google Scholar 

  56. Runeson P, Höst M (2008) Guidelines for conducting and reporting case study research in software engineering. Empir Softw Eng 14:131. https://doi.org/10.1007/s10664-008-9102-8

    Article  Google Scholar 

  57. Sadowski C, Gogh J van, Jaspan C, Söderberg E, Winter C (2015) Tricorder: building a program analysis ecosystem, in: 2015 IEEE/ACM 37th IEEE international conference on software engineering. Presented at the 2015 IEEE/ACM 37th IEEE international conference on software engineering, pp. 598–608. https://doi.org/10.1109/ICSE.2015.76

  58. Salkind NJ (ed) (2010) Encyclopedia of research design, 1, edition. edn. SAGE Publications Inc, Thousand Oaks

  59. Schmidt RC (1997) Managing Delphi surveys using nonparametric statistical techniques*. Decis Sci 28:763–774. https://doi.org/10.1111/j.1540-5915.1997.tb01330.x

    Article  Google Scholar 

  60. Scott WA (1955) Reliability of content analysis: the case of nominal scale coding. Public Opin Q 19:321–325

    Article  Google Scholar 

  61. Seiler C, Wohlrabe K (2013) Archetypal scientists. J Inf Secur 7:345–356. https://doi.org/10.1016/j.joi.2012.11.013

    Article  Google Scholar 

  62. van Solingen R, Basili V, Caldiera G, Rombach HD (2002) Goal question metric (GQM) approach, in: encyclopedia of software engineering. Am Cancer Soc. https://doi.org/10.1002/0471028959.sof142

  63. Thøgersen JC, Mørup M, Damkiær S, Molin S, Jelsbak L (2013) Archetypal analysis of diverse Pseudomonas aeruginosatranscriptomes reveals adaptation in cystic fibrosis airways. BMC Bioinformatics 14:279. https://doi.org/10.1186/1471-2105-14-279

    Article  Google Scholar 

  64. Tornhill A (2018) Assessing technical debt in automated tests with CodeScene, in: 2018 IEEE international conference on software testing, verification and validation workshops (ICSTW). Presented at the 2018 IEEE international conference on software testing, verification and validation workshops (ICSTW), pp. 122–125. https://doi.org/10.1109/ICSTW.2018.00039

  65. Tsanousa A, Laskaris N, Angelis L (2015) A novel single-trial methodology for studying brain response variability based on archetypal analysis. Expert Syst Appl 42:8454–8462. https://doi.org/10.1016/j.eswa.2015.06.058

    Article  Google Scholar 

  66. Veado L, Vale G, Fernandes E, Figueiredo E (2016) TDTool: threshold derivation tool, in: proceedings of the 20th international conference on evaluation and assessment in software engineering, EASE ‘16. ACM, New York, NY, USA, pp. 24:1–24:5. https://doi.org/10.1145/2915970.2916014

  67. Watson PF, Petrie A (2010) Method agreement analysis: a review of correct methodology. Theriogenology 73:1167–1179. https://doi.org/10.1016/j.theriogenology.2010.01.003

    Article  Google Scholar 

  68. Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction, International Series in Software Engineering. Springer US

  69. Xiao L, Cai Y, Kazman R (2014a) Titan: a toolset that connects software architecture with quality analysis, in: proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, FSE 2014. Association for Computing Machinery, Hong Kong, pp 763–766. https://doi.org/10.1145/2635868.2661677

    Google Scholar 

  70. Xiao L, Cai Y, Kazman R (2014b) Design rule spaces: a new form of architecture insight, in: proceedings of the 36th international conference on software engineering, ICSE 2014. Association for Computing Machinery, Hyderabad, pp 967–977. https://doi.org/10.1145/2568225.2568241

    Google Scholar 

  71. Yamashita A (2015) Experiences from performing software quality evaluations via combining benchmark-based metrics analysis, software visualization, and expert assessment, in: 2015 IEEE international conference on software maintenance and evolution (ICSME). Presented at the 2015 IEEE international conference on software maintenance and evolution (ICSME), pp. 421–428. https://doi.org/10.1109/ICSM.2015.7332493

  72. Zazworka N, Vetro’ A, Izurieta C, Wong S, Cai Y, Seaman C, Shull F (2014) Comparing four approaches for technical debt identification. Softw Qual J 22:403–426. https://doi.org/10.1007/s11219-013-9200-8

    Article  Google Scholar 

  73. Zuur A, Ieno EN, Walker N, Saveliev AA, Smith GM (2009) Mixed effects models and extensions in ecology with R, statistics for biology and health. Springer-Verlag, New York https://doi.org/10.1007/978-0-387-87458-6

Download references

Acknowledgements

This research is funded by the University of Macedonia Research Committee as part of the “Principal Research 2019” funding program.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Nikolaos Mittas.

Additional information

This paper has been awarded the Empirical Software Engineering (EMSE) open science badge.

Communicated by: Forrest Shull

Appendix

Appendix

In this Appendix we report a sensitivity analysis conducted to examine the variability of the classes belonging to the Max-Ruler archetype in terms of their a-coefficients. After finding the archetypes and expressing the original data points with their a-coefficients, the new transformed data are of special type. Specifically, the new variables are non-negative and their sum is fixed, equal to 1. These data are called compositional or proportional data and the new variables have an inherent correlation which raises new problems. Indeed, classical statistical methodologies should not be applied (Aitchison, 1982), since their principles are violated. Actually, there is a whole class of theories, methods and tools for analyzing such data, belonging to a special branch of statistics, the Compositional Data Analysis (CoDA) introduced by Aitchison (Aitchison, 1982). These methods have been used in the context of software engineering (Chatzipetrou et al., 2015; 2010).

Describing briefly, the sample space of the compositional data (a-coefficients in our case) is a simplex of the form \( {\sum}_{j=1}^k{a}_{ij}=1 \) with aij ≥ 0 and i = 1, …, n, as it has been already defined in Eq. (4) in the manuscript. In our study, we adopted some certain suggested CoDA methodologies and specifically the centered log-ratio transformation (clr) and an imputation method for zeros which cause problems in the analysis, known as the multiplicative replacement strategy (Martín-Fernández et al., 2003).

After this preprocessing, we evaluated for each Max-Ruler archetype and their corresponding classes, a global measure of spread proposed by Pawlowsky-Glahn and Egozcue (Pawlowsky-Glahn and Egozcue, 2001), namely the metric standard deviation (MSD) that can be computed via the following formula

$$ MSD=\sqrt{\frac{1}{k-1} MVAR(X)} $$
(6)

where \( MVAR(X)=\frac{1}{m-1}{\sum}_{i=1}^m{d}^2\left({\mathbf{x}}_m,\overline{\mathbf{x}}\right) \) represents the metric variance and m represents the number of classes for the Max-Ruler archetype.

Appendix Table 8 summarizes the MSD values for the set of the experiments conducted through sensitivity analysis following a similar approach to the analysis of percentages of high-TD classes presented above. The general intuition from the inspection of the estimated MSD values is that the spread of the classes increases as the value of the threshold a increases only for JavaScript projects, whereas the spread seems to be generally the same for Java projects. Indeed, the findings of the two LME models (beyond model vs. the model without the interaction term) indicated a statistically significant difference χ2 = 39.786, p < 0.001, and thus, the interaction term should be retained in the final model. This practically means that the spread depends on the combination of the levels of the two examined factors (Threshold and Language). To this regard, the post-hoc analysis through Tukey’s HSD test did not reveal a statistically significant difference for any pair-wise comparison conducted on the levels of factor Threshold for Java projects. In contrast, there were noted statistically significant differences for specific levels of factor Threshold for JavaScript projects forming four overlapping homogenous groups that are A = {0.60, 0.65, 0.70}, B = {0.65, 0.70, 0.75}, C = {0.70, 0.75, 0.80} and D = {0.80, 0.85, 0.90}.

Table 8 Estimated mean MSD with 95% CI for each threshold value a (sensitivity analysis)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Amanatidis, T., Mittas, N., Moschou, A. et al. Evaluating the agreement among technical debt measurement tools: building an empirical benchmark of technical debt liabilities. Empir Software Eng 25, 4161–4204 (2020). https://doi.org/10.1007/s10664-020-09869-w

Download citation

Keywords

  • Technical debt
  • Software quality
  • Archetypal analysis
  • Inter-rater agreement
  • Empirical benchmark