Springer Nature is making Coronavirus research free. View research | View latest news | Sign up for updates

Standardized code quality benchmarking for improving software maintainability


We provide an overview of the approach developed by the Software Improvement Group for code analysis and quality consulting focused on software maintainability. The approach uses a standardized measurement model based on the ISO/IEC 9126 definition of maintainability and source code metrics. Procedural standardization in evaluation projects further enhances the comparability of results. Individual assessments are stored in a repository that allows any system at hand to be compared to the industry-wide state of the art in code quality and maintainability. When a minimum level of software maintainability is reached, the certification body of TÜV Informationstechnik GmbH issues a Trusted Product Maintainability certificate for the software product.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. 1.

  2. 2.,,

  3. 3.


  4. 4.



  1. Alves, T. L., Ypma, C., & Visser, J. (2010). Deriving metric thresholds from benchmark data. In 26th IEEE international conference on software maintenance (ICSM 2010), September 12–18, 2010, Timisoara, Romania.

  2. Atos Origin. (2006). Method for qualification and selection of open source software (QSOS), version 1.6.

  3. Bijlsma, D. (2010). Indicators of issue handling efficiency and their relation to software maintainability. Master’s thesis, University of Amsterdam.

  4. Bouwers, E., & Vis, R. (2008). Multidimensional software monitoring applied to erp. In C. Makris & J. Visser (Eds.), Proceedings of 2nd international workshop on software quality and maintainability. Elsevier, ENTCS, to appear.

  5. Bouwers, E., Visser, J., & van Deursen, A. (2009). Criteria for the evaluation of implemented architectures. In 25th IEEE international conference on software maintenance (ICSM 2009) (pp. 73–82). IEEE, Edmonton, Alberta, Canada, September 20–26, 2009.

  6. Correia, J., & Visser, J. (2008a). Benchmarking technical quality of software products. In wcre ’08: Proceedings of the 2008 15th working conference on reverse engineering (pp. 297–300). IEEE Computer Society, Washington, DC, USA,

  7. Correia, J. P., & Visser, J. (2008b). Certification of technical quality of software products. In L. Barbosa, P. Breuer, A. Cerone, & S. Pickin (Eds.), International workshop on foundations and techniques bringing together free/libre open source software and formal methods (FLOSS-FM 2008) & 2nd international workshop on foundations and techniques for open source certification (OpenCert 2008) (pp. 35–51). United Nations University, International Institute for Software Technology (UNU-IIST), Research Report 398.

  8. Correia, J. P., Kanellopoulos, Y., & Visser, J. (2009). A survey-based study of the mapping of system properties to iso/iec 9126 maintainability characteristics. In 25th IEEE international conference on software maintenance (ICSM 2009) (pp. 61–70), September 20–26, 2009. Edmonton, Alberta, Canada: IEEE.

  9. Deprez, J. C., & Alexandre, S. (2008). Comparing assessment methodologies for free/open source software: OpenBRR and QSOS. In PROFES.

  10. Golden, B. (2005). Making open source ready for the enterprise: The open source maturity model, whitepaper available from,

  11. Heck, P., & van Eekelen, M. (2008). The LaQuSo software product certification model: (LSPCM). Tech. Rep. 08-03, Tech. Univ. Eindhoven.

  12. Heitlager, I., Kuipers, T., & Visser, J. (2007). A practical model for measuring maintainability. In 6th international conference on the quality of information and communications technology (QUATIC 2007) (pp. 30–39). IEEE Computer Society.

  13. International Organization for Standardization. (1996). ISO/IEC Guide 65: General requirements for bodies operating product certification systems.

  14. International Organization for Standardization. (1999). ISO/IEC 14598-1: Information technology - software product evaluation - part 1: General overview.

  15. International Organization for Standardization. (2001). ISO/IEC 9126-1: Software engineering—product quality—part 1: Quality model.

  16. International Organization for Standardization. (2004). ISO/IEC 15504: Information technology—process assessment.

  17. International Organization for Standardization. (2005a). ISO/IEC 15408: Information technology—security techniques—evaluation criteria for IT security.

  18. International Organization for Standardization. (2005b). ISO/IEC 17025: General requirements for the competence of testing and calibration laboratories.

  19. International Organization for Standardization. (2006). ISO/IEC 25051: Software engineering—software product quality requirements and evaluation (square)—requirements for quality of commercial off-the-shelf (cots) software product and instructions for testing.

  20. International Organization for Standardization. (2008). ISO/IEC 9241: Ergonomics of human-system interaction.

  21. Izquierdo-Cortazar, D., Gonzalez-Barahona, J. M., Robles, G., Deprez, J. C., & Auvray, V. (2010). FLOSS communities: Analyzing evolvability and robustness from an industrial perspective. In Proceedings of the 6th international conference on open source systems (OSS 2010).

  22. Jones, C. (2000). Software assessments, benchmarks, and best practices. Reading: Addison-Wesley.

  23. Kuipers T, & Visser, J. (2004a). A tool-based methodology for software portfolio monitoring. In M. Piattini, & M. Serrano (Eds.), Proceedings of 1st international workshop on software audit and metrics, (SAM 2004) (pp 118–128). INSTICC Press.

  24. Kuipers, T., & Visser, J. (2004b) A tool-based methodology for software portfolio monitoring. In M. Piattini, et al. (Eds.), Proceedings of 1st international workshop on software audit and metrics, (SAM 2004) (pp. 118–128). INSTICC Press.

  25. Kuipers, T., Visser, J., & de Vries, G. (2007). Monitoring the quality of outsourced software. In J. van Hillegersberg, J., et al. (Eds.), Proceedings of international workshop on tools for managing globally distributed software development (TOMAG 2007), Center for Telematics and Information Technology, Netherlands.

  26. Lokan, C. (2008). The Benchmark Release 10—project planning edition. Tech. rep., International Software Benchmarcking Standards Groups Ltd.

  27. Luijten, B., & Visser, J. (2010). Faster defect resolution with higher technical quality of software. In 4th international workshop on software quality and maintainability (SQM 2010), March 15, 2010, Madrid, Spain.

  28. McCabe, T. J. (1976). A complexity measure. In ICSE ’76: Proceedings of the 2nd international conference on software engineering (p. 407). Los Alamitos, CA, USA: IEEE Computer Society Press.

  29. OpenBRRorg. (2005). Business readiness rating for open source, request for comment. 1.

  30. Simon, F., Seng, O., & Mohaupt, T. (2006). Code quality management: Technische Qualität industrieller Softwaresysteme transparent und vergleichbar gemacht. Heidelberg, Germany: Dpunkt-Verlag.

  31. Software Improvement Group (SIG) and TÜV Informationstechnik GmbH (TÜViT). (2009). SIG/TÜViT evaluation criteria—Trusted Product Maintainability, version 1.0.

  32. Software Productivity Research. (2007). Programming languages table (version 2007d).

  33. van Deursen, A., & Kuipers, T. (2003). Source-based software risk assessment. In ICSM ’03: Proceedings of international conference on software maintenance (p. 385). IEEE Computer Society.

  34. van Hooren, M. (2009). KAS BANK and SIG - from legacy to software certified by TÜViT. Banking and Finance.

Download references

Author information

Correspondence to José Pedro Correia.

Appendix: The quality model

Appendix: The quality model

The SIG has developed a layered model for measuring and rating the technical quality of a software system in terms of the quality characteristics of ISO/IEC 9126 (Heitlager et al. 2007). The layered structure of the model is illustrated in Fig. 6.

Fig. 6

Relation between source code metrics and system subcharacteristics of maintainability (image taken from Luijten and Visser 2010)

This appendix section describes the current state of the quality model, which has been improved and further operationalized since (Heitlager et al. 2007).

Source code metrics are used to collect facts about a software system. The measured values are combined and aggregated to provide information on properties at the level of the entire system, which are then mapped into higher level ratings that directly relate to the ISO/IEC 9126 standard. These ratings are presented using a five star system (from \(\star\) to \(\star \star \star \star \star\)), where more stars mean better quality.

Source code measurements

In order to make the product properties measurable, the following metrics are calculated:

  • Estimated rebuild value The software product’s rebuild value is estimated from the number of lines of code. This value is calculated in man-years using the Programming Languages Table of the Software Productivity Research Software Productivity Research (2007). This metric is used to evaluate the volume property;

  • Percentage of redundant code A line of code is considered redundant if it is part of a code fragment (larger than 6 lines of code) that is repeated literally (modulo white-space) in at least one other location in the source code. The percentage of redundant lines of code is used to evaluate the duplication property;

  • Lines of code per unit The number of lines of code in each unit. The notion of unit is defined as the smallest piece of invokable code, excluding labels (for example a function or procedure). This metric is used to evaluate the unit size property;

  • Cyclomatic complexity per unit The cyclomatic complexity (McCabe 1976) for each unit. This metric is used to evaluate the unit complexity property;

  • Number of parameters per unit The number of parameters declared in the interface of each unit. This metric is used to evaluate the unit interfacing property;

  • Number of incoming calls per module The number of incoming invocations for each module. The notion of module is defined as a delimited group of units (for example a class or file). This metric is used to evaluate the module coupling property.

From source code measurements to source code property ratings

To evaluate measurements at the source code level as property ratings at the system level, we make use of just a few simple techniques. In case the metric is more relevant as a single value for the whole system, we use thresholding to calculate the rating. For example, for duplication we use the amount of duplicated code in the system, as a percentage, and perform thresholding according to the following values:

Rating Duplication
\(\star \star \star \star \star\) 3%
\(\star \star \star \star\) 5%
\(\star \star \star\) 10%
\(\star \star\) 20%

The interpretation of this table is that the values on the right are the maximum values the metric can have that still warrant the rating on the left. Thus, to be rated as \(\star \star \star \star \star\) a system can have no more than 3% duplication, and so forth.

In case the metric is more relevant at the unit level, we make use of so-called quality profiles. As an example, let us take a look at how the rating for unit complexity is calculated. First the cyclomatic complexity index (McCabe 1976) is calculated for each code unit (where a unit is the smallest piece of code that can be executed and tested individually, for example a Java method or a C function). The values for individual units are then aggregated into four risk categories (following a similar categorization of the Software Engineering Institute), as indicated in the following table:

Cyclomatic complexity Risk category
1–10 Low risk
11–20 Moderate risk
21–50 High risk
>50 Very high risk

For each category, the relative volumes are computed by summing the lines of code of the units that fit in that category, and dividing by the total lines of code in all units. These percentages are finally rated using a set of thresholds, defined as in the following example:

Rating  Maximum relative volume
Moderate High Very high
\(\star \star \star \star \star\) 25% 0% 0%
\(\star \star \star \star\) 30% 5% 0%
\(\star \star \star\) 40% 10% 0%
\(\star \star\) 50% 15% 5%

Note that this rating scheme is designed to progressively give more importance to categories with more risk. The first category (‘low risk’) is not shown in the table since it is the complement of the sum of the other three, adding up to 100%. Other properties have similar evaluation schemes relying on different categorization and thresholds. The particular thresholds are calibrated per property, against a benchmark of systems.

Such quality profiles have as an advantage over other kinds of aggregation (such as summary statistics like mean or median value) that sufficient information is retained to make significant quality differences between systems detectable (see Alves et al. 2010) for a more detailed discussion).

The evaluation of source code properties is first done separately for each different programming language, and subsequently aggregated into a single property rating by weighted average, according to the relative volume of each programming language in the system.

The specific thresholds used are calculated and calibrated on a periodic basis based on a large set of software systems, as described in Section 2.4.

Continuous scale

The calculation of ratings from source code metrics is described in terms of discrete quality levels. These values will need to be further combined and aggregated and for that, a discrete scale is not adequate. We thus use the discrete scale for describing the evaluation schemes, but make use of interpolation to adapt them in order to obtain ratings in a continuous scale in the interval [0.5, 5.5[. An equivalence between the two scales is established so that the behavior as described in terms of the discrete scale is preserved.

Let us consider a correspondence of the discrete scale to a continuous one where \(\star\) corresponds to 1, \(\star \star\) to 2 and so forth. Thresholding as it was described can then be seen as a step function, defined, for the example of duplication (d), as:

$$ rating(d) = \left\{ \begin{array}{ll} 5 & \hbox {if } d \leq 3\%\\ 4 & \hbox {if } 3\% < d \leq 5\%\\ 3 & \hbox {if } 5\% < d \leq 10\%\\ 2 & \hbox {if } 10\% < d \leq 20\%\\ 1 & \hbox {if } d > 20\%\\ \end{array} \right. $$

This step function can be converted into a continuous piecewise linear function as follows:

  1. 1.

    In order for the function to be continuous, the value for the point on the limit between two steps (say, for example, point 3% which is between the steps with values 4 and 5) should be between the two steps’ values (in the case of point 3% it would then be (4 + 5)/2 = 4.5). Thus, for example, rating (5%) = 3.5 and rating (10%) = 2.5;

  2. 2.

    Values between limits are computed by linear interpolation using the limit values. For example, rating (5.1%) = 3.4 and rating (7.5%) = 3.

The equivalence to the discrete scale can be established by arithmetic, round half up rounding.

This approach has the advantage of providing more precise ratings. Namely, with the first approach we have, for example, rating(5.1%) = rating(10%) = 3, whereas in the second approach we have rating(5.1%) = 3.4 ≈ 3 and rating(10%) = 2.5 ≈ 3. Thus, one can distinguish a system with 5.1% duplication from another one with 10%, while still preserving the originally described behavior.

The technique is also applied to the evaluation schemes for quality profiles of a certain property. Namely, interpolation is performed per risk category, resulting in three provisional ratings of which the minimum is taken as the final rating for that property.

From source code property ratings to ISO/IEC 9126 ratings

Property ratings are mapped to ratings for ISO/IEC 9126 subcharacteristics of maintainability following dependencies summarized in a matrix (see Table 3).

Table 3 Mapping of source code properties to ISO/IEC 9126 subcharacteristics

In this matrix, a × is placed whenever a property is deemed to have an important impact on a certain subcharacteristic. These impacts were decided upon by a group of experts and have further been studied in Correia et al. (2009).

The subcharacteristic rating is obtained by averaging the ratings of the properties where a × is present in the subcharacteristic’s line in the matrix. For example, changeability is represented in the model as affected by duplication, unit complexity and module coupling, thus its rating will be computed by averaging the ratings obtained for those properties.

Finally, all subcharacteristic ratings are averaged to provide the overall maintainability rating.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Baggen, R., Correia, J.P., Schill, K. et al. Standardized code quality benchmarking for improving software maintainability. Software Qual J 20, 287–307 (2012).

Download citation


  • Software product quality
  • Benchmarking
  • Certification
  • Standardization