1 Introduction

Code quality is a central factor in software development and there is an abundance of advice on how to write “good code,” e.g. (Green and Ledgard 2011; Martin 2008; Sutter and Alexandrescu 2004; Vermeulen et al. 2000). Despite a large body of work on software quality measurement (Fenton and Bieman 2014; Nuñez-Varela et al. 2017), there is little conclusive evidence on the relationship between software quality measurements and common software quality attributes (Jabangwe et al. 2015; Jørgensen 1999). Furthermore, many common software measures loose their predictive power when controlled for program size (Gil and Lalouche 2017; Tahir et al. 2018) and do not capture quality improvements as perceived by software developers (Al Dallal and Abdin 2018; Pantiuchina et al. 2018).

Research also shows that quality issues at the code level increase the cognitive load of developers (Fakhoury et al. 2018; Müller and Fritz 2016), which affects their performance negatively (Ko and Myers 2005; Sweller et al. 2011). Since a considerable amount of effort is dedicated to software quality during its development (Menzies et al. 2006), a better understanding of developers’ perceptions of software quality on code level might help to utilize this effort more effectively.

In this study, we investigate how software developers perceive code quality, which quality aspects they perceive as more or less important and what they recommend to achieve high code quality. The study uses interview data from 34 professional software developers, programming teachers and students with an average of 5.8 years of experience as professional software developer, which we categorized into three groups based on their number of years of experience as professional programmers. The interviews were carried out in the context of an earlier study (Börstler et al. 2018), but the majority of the data used here has neither been coded nor analyzed before (see Section 3.2 for details).

This study makes the following contributions.

  1. 1.

    A qualitative study providing insights on aspects of code quality that are considered important by developers.

  2. 2.

    A study design package to facilitate the design and execution of similar studies.

  3. 3.

    An anonymized data-set for further analysis.

The organisation of the paper is as follows. Section 2 discusses related work on defining software quality. The background and experimental methodology are presented in Section 3. Empirical results are presented and analyzed in Section 4. A discussion of the findings follows in Section 5. Section 6 summarizes and concludes the paper. Information about the study design package and the anonymized data-set can be found in Appendix B.

2 Related Work

Over the years, several software quality models and standards have been developed to provide frameworks for defining and evaluating software quality (Nistala et al. 2019; Sadeghzadeh Hemayati and Rashidi 2018). The most recent ones are ISO/IEC 25010 (ISO/IEC 2011) and ISO 5055:2021 (Curtis et al. 2022). Whereas ISO/IEC 25010 defines quality in terms of eight quality characteristics, e.g., maintainability, ISO/IEC 5055 defines quality in terms of “weaknesses” that threaten the software’s reliability, security, performance efficiency, and maintainability. Although ISO/IEC 25010 (and its predecessors) provide definitions for the quality characteristics (and their sub-characteristics), Bakota et al. (2011) argue that the quality models do not deal with ambiguity which makes it hard to measure these characteristics.

A systematic mapping study on source code metrics (Nuñez-Varela et al. 2017) covering 2010–2015 identified 226 studies using more than 300 source code metrics in total. Although many of those metrics have been around for a long time, there still is little empirical evidence that connects source code metrics to software quality attributes. Pantiuchina et al. (2018), for example, mined commits with messages that clearly showed that the developers aimed to improve the cohesion, coupling, complexity and readability of the underlying source code. Their analysis shows that most of the considered code quality metrics were not able to capture the quality improvements as perceived by the developers.

There is evidence, though, showing that there is a relationship between low level code features and code quality. Butler et al. (2010), for example, showed that flawed identifier names are associated with low code quality. Recent research in program comprehension, moreover, shows that misleading names are more problematic than meaningless names (Avidan and Feitelson 2017), but that specific one-letter variables still can convey meaning (Beniamini et al. 2017). It has also been shown that structural differences at code-level affect program understanding, for example, that for-loops are significantly harder to understand than if-statements (Ajami et al. 2017) and that “maintaining undisciplined annotations is more time consuming and error prone” than maintaining disciplined ones (Malaquias et al. 2017).

Several studies have focused on how refactoring improves code quality (Alshayeb 2009; Chatzigeorgiou and Manakos 2010; Peters and Zaidman 2012; Kim et al. 2014; Tufano et al. 2017; Al Dallal and Abdin 2018). Alshayeb (2009) quantitatively assessed the effect of refactoring activities on the external quality attributes adaptability, maintainability, understandability, reusability and testability and concludes that refactoring does not necessarily improve these quality attributes. Chatzigeorgiou and Manakos (2010); Peters and Zaidman (2012); Kim et al. (2014) examine historical software data to evaluate the lifespan and evolution of code smells and the usage of refactoring to remove these code smells. A large-scale study comprising a survey, interviews and quantitative analysis of historical versions of Windows 7 (Kim et al. 2014) revealed that only 5% of the studied modules were improved (low inter-dependencies and complexities) when they were refactored. Similarly, Tufano et al. (2017) studied the change history of over 200 open source projects and observed that only 9% of the code smells were removed as a result of refactoring operations. Al Dallal and Abdin (2018) corroborate the findings of Kim et al. and Tufano et al. that refactoring does not always improve code quality. This indicates that refactoring alone is not effective for “fixing” bad code.

Research on inspections shows a positive impact on quality (Bavota and Russo 2015; Kollanus and Koskinen 2009). Several studies have focused on code reviews to check and improve the quality of code submitted into a master code repository (Hatton 2008; Kononenko et al. 2016; Rong et al. 2014). Kononenko et al. (2016) investigated factors that influence the review time, review decision and review quality as perceived by Mozilla core developers. Their open coding of the survey questions identified “code quality” and “understanding code change/base” as the overall most important themes. An exploratory study by Sadowski et al. (2018) on modern code reviews at Google reveals that “the foremost reason for introducing code review at Google was to improve code understandability and maintainability.” Google’s mandatory code reviews involve manual readability approval of developers having readability certification (Wintus et al. 2020). This entails an expensive infrastructure. Reliable metrics for measuring or predicting readability would therefore have a significant positive impact on code quality and productivity. A recent review on modern code reviews (Davila and Nunes 2021) corroborates that “code comprehension has been the main challenge faced by developers when reviewing a code change.”

The literature lacks studies exploring practitioners’ opinions and there are few studies on lower-level aspects of source code quality, like the one presented here. Bakota et al. (2011) suggest a probabilistic approach for computing maintainability (based on the definition in ISO/IEC 9126, the predecessor of ISO/IEC 25010) with the help of input from 28 experts. An evaluation on two software systems showed that their results aligned with the expectations of experts. Correia et al. (2009) interviewed 22 experts consisting of software engineers, software quality consultants and researchers to define a mapping from 9 system properties (e.g., system size and unit complexity) to the maintainability sub-characteristics of ISO/IEC 9126 and observed that there was no consensus on the mapping. Gorla and Lin (2010) explored organizational, individual and technological factors that impact IS project managers’ perception of five software quality attributes (reliability, ease-of-use, maintainability, usefulness, and relevance). Based on the questionnaire results from 112 IS project managers, the authors conclude “that organizational factors are more important than technological factors in impacting software quality in IS projects.” Antinyan et al. (2017) carried out an online survey with 100 respondents from industry and academia to investigate how much certain code characteristics affect 4 predefined internal quality attributes as well as maintainability time. They conclude that lack of structure and nesting depth substantially increase code complexity, which subsequently influences the maintenance time of code.

Taken together, existing research may benefit from a better alignment between software developers’ perceptions of quality and the measures used to assess software quality. In this work, we therefore want to investigate software developers’ discourse about code quality and their perceptions of code quality. Our study is unique in asking for practitioners’ opinions unconditionally, i.e., without a preconceived definition or opinion about code quality.

All studies discussed in the paragraph above are mainly confirmatory and most are focused on maintainability. Except Bakota et al., they all suggest predefined models or definitions of quality characteristics and use close questions. In Bakota et al., the experts are involved in defining weights for the model. Of the four studies, two are interview-based studies with fewer respondents than our study. Furthermore, none of the four studies uses open questions and open coding.

3 Method

Our overall goal was to investigate the perceptions of code quality held by software developers and in which ways these perceptions vary with respect to the developers’ background and experience.

3.1 Research Questions

In this research, we follow an exploratory approach to elicit software developers’ perceptions of code quality unconditionally, i.e., without preconceived definitions or hypotheses about quality. We are interested in developers’ “abstract” or general views about code quality (when they talk about code quality in general terms) as well as their “concrete” views when they reason about the properties of specific code examples.

In our earlier work (Börstler et al. 2018), we focused on participants’ “abstract” perceptions of code quality. Here, we investigate in more detail how participants talk and reason about code quality when referring to actual code examples, and in which ways this “concrete discourse” differs from their “abstract perceptions”.

  • RQ1: Which abstract perceptions of code quality do software developers have?

  • RQ1.1: Which properties do they focus on when defining code quality?

  • RQ1.2: Which properties and practices do they focus on when suggesting quality improvements?

  • RQ2: Which properties related to code quality do software developers focus on in actual example code?

  • RQ3: In which ways do developers’ “abstract” perceptions of code quality (see RQ1) differ from the specific properties they find relevant in actual code examples (see RQ2)?

3.2 Interview Questions

When designing the interview questions, we took an exploratory approach to get a better understanding of the perceptions of code quality. Our goal was to explore perceptions of code quality, not to test preconceived hypotheses. The interview questions were therefore framed very carefully, so that they did not introduce bias or suggest certain answers.

The interview contained three parts comprising 11 questions in total, most of them with sub-questions: Part 1 (Q1–Q3) focused on demographics and the participants’ background and experience; part 2 (Q4) focused on quality aspects of the example(s) of code that the participants were asked to bring to the interview; and part 3 (Q5–Q11) covered further questions on code quality that were independent from Q4, and the example(s) discussed there. All questions in part 2 and part 3 (except Q6) were completely open.

Except for Q4, the questions were short free-text, numeric, or Likert-type questions. All Likert-type questions used a 7-item scale where only the end values were named explicitly. For Q4, we asked the participants to bring along example code or code snippets from their personal experience that they consider exhibit high or low quality. The rationale for this was to elicit how abstract properties or qualities of code manifest in actual first-hand code examples.

The full list of questions can be found in Appendix A. An overview of the study process and how this study relates to our earlier work (Börstler et al. 2018) can be found in Fig. 1. It should be noted that Q4 (the main part of the interview) as well as Q7 was neither coded nor analyzed in our previous work.

Fig. 1
figure 1

Overview of the study process. The lower part of the figure shows the contributions of the work presented here in relation to our earlier work

3.3 Participant Recruiting/Sampling

Ten of the eleven authors participated in interviewing. The interviewers invited candidates to participate in the study by circulating an information sheet (see the “study design package” in Appendix B.1) in the channels available to him or her.

We invited students, educators, and professional software developers for participation in the study. Students should have completed several programming courses. Educators should have several years of teaching experience with courses covering programming, software design, or software quality. Professional software developers should deal with software development for a living, i.e. regularly read, write, test or review source code or low-level designs as a significant part of their work.

Initially, we collected basic information about potential participants (current occupation, education, level of professional programming experience, gender, and country) in a shared spreadsheet. The information in the spreadsheet was used by interviewees to facilitate recruiting of participants with varying backgrounds and experience.

According to our study guide (see Appendix B.1), we defined developers as “people who actually deal with software development for a living, i.e., people who regularly read, write, test or review source code or low-level designs.” We targeted 3–5 interviews per interviewer to ensure a good geographical spread and minimize the risk for a common selection bias. The interviewers came from Sweden, Germany, USA, The Netherlands, and Finland. Other than the coordination described above, the individual interviewers applied convenience sampling.

3.4 Data Collection

We used a detailed interview guide with predefined and scripted questions. The interviews took 45–60 minutes and were either conducted in person or through video calls (using Google Hangouts, Skype, or Zoom).

The first 3 questions (Q1–Q3) were filled in by the interviewers. Question Q4, the main part of the interview, was recorded and transcribed. The remaining 7 questions (Q5–Q11) were filled in directly by the interviewees.

If the interviewees brought code examples in electronic form, we captured the screen to be able to connect the discussion to particular areas of the code. Otherwise, we made annotations on the print-outs to connect areas of code to statements in the interview.

For the transcription, we developed transcription guidelines to ensure a uniform transcription of the interviews. The guidelines were based on existing guidelines (Humble 2015) that we adapted to ensure a fluent transcript style. If possible, the interviews were held in the language the interviewees felt most comfortable in; either in a language where the interviewer was a native speaker or a language where the interviewer was highly fluent. All interview transcripts were translated to English before the analysis.

One of the co-authors of the interview script tested the questions using a pilot interview with a university teacher with professional programming experience. The pilot interview took 56 minutes and was recorded (both audio and computer screen). A part of Q4 was transcribed to test the transcription guide. After a discussion of the pilot, the co-authors of the interview guide revised the interview and transcription guides slightly. The data of the pilot study was discarded and not used for the analysis presented here. The interview script as well as the transcription guide are available for download and part of the study design package (see Appendix B.1).

3.5 Coding of Open Questions

We used open coding (Creswell and Creswell 2017) to categorize the data from the open questions.

In our earlier work (Börstler et al. 2018), two groups of co-authors independently extracted labels inductively from Q5 (definitions of code quality) and Q8 (factors/indicators of high quality), respectively. As a next step the initial labels from both groups were merged and generic categories were defined to encompass the specific data. When new categories emerged or their definition or labels changed, the whole dataset was re-categorized using the revised categories. Each label was connected to one category. This was done until all data had been categorized and no new categories emerged and/or were changed.

For the present work, we used these labels and categories as a starting point for the coding of Q4 and Q7 described below. For Q7 (recommendations for increasing code quality), we also categorized the answers into analytical and constructive approaches for quality improvement. Analytical approaches are based on assessing/observing the level of quality (e.g., static analysis), whereas constructive approaches deal with ensuring quality “by construction” (e.g., design patterns). Approaches that could not be categorized as either analytical or constructive were categorized as Other.

Question Q7 (recommendations for increasing code quality) was independently coded by four co-authors. Since Q7 added an activity aspect, new labels emerged and the original coding scheme was extended to accommodate those with as little as possible changes to existing labels and categories (see Section 4.3). The independent codings of Q7 were then merged at a common meeting together with the extended coding scheme. Differences and inconsistencies were discussed and resolved at the meeting. Inter-rater agreement (Krippendorff’s alpha, computed via ReCal Freelon 2013) for the sub-questions of Q7 were between 0.633 and 0.815 which is considered a substantial agreement.

The extended coding scheme was then used for coding Q4. The coding was done in several steps to facilitate further analysis as described below.

  1. 1.

    Mark text passages relating code properties or coding activities to quality attributes.

  2. 2.

    Summarize each such passage by “assertions” of the following form:

    $$\begin{aligned}< quality~indicator~|~activity> ~=>~<\!effect\!>; <\!artifact\!>\!. \end{aligned}$$

Detailed coding guidelines were described and then piloted independently by two of the co-authors. Two examples of such “assertions” and their corresponding text passages are shown in Fig. 2.

Fig. 2
figure 2

Excerpts from the interview transcripts of interviewee CK2 (top) and DT2 (bottom) with marked passages and corresponding assertions to the right

All interviews were then coded independently by two co-authors each, and the coding step was discussed at a common meeting. After the meeting the original coders discussed their independent codings and merged them into an agreed-upon version that was used for the analysis described in Section 4.6.

After this validation step, all assertions were imported into a spreadsheet for further analysis. Composite assertions, i.e., rows listing several quality indicators/ activities and/or effects, were broken down into multiple rows such that each combination of quality indicator/ activity and effect was placed on a separate row, i.e., a1, a2 => e1, e2 was replaced by a1 => e1, a1 => e2, a2 => e1, a2 => e2. Unclear or missing effects (e.g., “?” or “–”) were replaced by the generic effect “higher (lower) quality,” since this was the agreed-upon default for effects that were not made sufficiently specific by an interviewee. Ambiguities and questions regarding assertions were resolved with the help of the original interviewer.

The quality indicators/ activities (sources) and effects (targets) of all assertions were then categorized according to Table 3 by the first author and validated as follows:

  1. 1.

    Four interviews (of 34) were selected at random and categorized independently by four co-authors such that each interview’s quality indicator/activity and effect were independently categorized by three co-authors each (including the first author). These four interviews corresponded to 86 of the 721 assertions (12%). Inter-rater agreement was moderate for the source categories (Krippendorff’s alpha 0.434) and almost perfect for the target categories (Krippendorff’s alpha 0.848).

  2. 2.

    The independent categorizations were then compiled into one document.

  3. 3.

    Inconsistencies were discussed and resolved by “majority vote” if necessary.

  4. 4.

    The first author revised the coding of the remaining interviews accordingly.

When coding Q4, the wording of an assertion was kept close to the wording of the interviewee, as shown in the examples in Fig. 2. As described above, the assertions were then categorized according to Table 3 to facilitate further analysis. This categorization could lead to assertions where source and target belong to the same category. An interview statement like “concise code leads to better readability” would, for example, be coded as concise code => readability and then categorized as readability=> readability since both, conciseness and readability belong to category Readability.

3.6 Threats to Validity

Internal validity is concerned with the study design, in particular whether the constructs used to answer the research questions are suitable and sufficient to answer the questions.

Two of the co-authors developed a thorough study guide including information about the sampling and recruitment of participants, a detailed interview script with lists of probing questions, as well as guidelines for transcribing the recorded interviews. This guide was discussed with the remaining authors to make sure that the questions are interpreted uniformly. Templates for the consent form and an information sheet for recruiting interviewees were also prepared.

The interview script contained closed and open questions. Question Q4 was intentionally left open and the interview guide encouraged interviewers to let the interviewees talk freely. Since interviewees brought code from their personal experience to the interview, we made sure that they could relate their personal perceptions about code quality to actual code examples. Respondents’ discussions might have been different had they brought and talked about other examples in other programming languages. However, we explicitly asked them “to bring along example code from your own personal experience” and to share “personal opinions regarding source code quality and to point out pieces of code that you consider have high or low quality.” Since the study is exploratory with open questions, we do not consider this a threat to validity.

Interviewees without professional programming experience (students) were mostly self-selected, whereas study participants with professional programming experience were directly approached by the researchers. We targeted 3–5 interviews per interviewer and coordinated the recruitment of interviewees via a shared spreadsheet. Besides this coordination, the individual interviewers applied convenience sampling. Since the interviewers come from five different countries and each interviewer carried out only 3–5 interviews, we consider the risk for a common selection bias low.

During the coding of the transcripts, chances are that data is misinterpreted or coded wrongly. To mitigate this threat, we developed and piloted detailed coding guidelines (as described in Section 3.5). Question Q4 was coded in a two-step process. Step one was carried out independently by two co-authors and then validated. Step two was carried out by one author and then cross-validated independently by three co-authors with high inter-rated agreement. Question Q7 was coded independently of four co-authors with high inter-rater agreement. We therefore consider the coding reliable. Details about the coding and validation processes can be found in Section 3.5.

External validity is concerned with the generalizability of the results.

Since the sample size in this study is small and interviewees were limited to Europe and the U.S., we cannot generalize the results to software developers globally. Also, since the authors are primarily from academia, even most interviewees from professional software development have some relationship to academia. So, developers without any academic background might be underrepresented. However, within these limits we achieved a good spread of interviewees in terms of relevant experience. Our study design aimed for “theoretical saturation,” i.e., a sample size where adding more interviews is unlikely to uncover further concepts or aspects. In a review on sample sizes for saturation in qualitative research (Hennink and Kaiser 2022) conclude that “saturation can be achieved in a narrow range of interviews (9–17).” With 34 interviews in our study, we believe that the risk for not having reached saturation is low. We have not validated, though, whether we reached saturation.

There is also a risk that the examples the interviewees brought along and discussed during the interviews are not representative of what developers actually care about in practice. To mitigate that risk, we used a standardized “interviewee information sheet” to ensure that all interviewees receive the same information (see the study design package in Appendix B.1). As discussed in Section 4.2 (Example Demographics), the interviewees brought many and varied examples. We have no indications of interviewees discussing topics that did not matter to them. We therefore consider this risk low.

4 Results and Analysis

4.1 Interviewee Demographics

In total, we analyzed data from 34 semi-structured interviews with students, educators, and professional software developers from 6 countries. Of the 34 interviewees, 22 (65%) have experience as a professional software developer (9.6 years on average, median 7.3 years) and 11 (32%) have software quality assurance as one of their responsibilities. The developers with software quality assurance as one of their responsibilities have 13.8 years (median 11 years) of professional experience on average. All students had completed at least 5 courses with significant programming components (8.9 on average). A summary of the basic demographics of the participants can be found in Table 1.

For our analysis, we categorized the interviewees along their experience as professional software developers (see column Cat-Exp in Table 1): Interviewees without experience as professional software developers (N), interviewees with some experience (S) but less than the median years of experience (7.3 years) and interviewees with extensive professional experience (E), i.e., more than the median years of experience.

4.2 Example Demographics

In total, 31 of the 34 interviewees brought along 130 code examples ranging from short code snippets on paper over links to guideline documents (containing example code snippets) to actual running code (single files to whole non-trivial projects). The remaining three did not bring code examples but referred to specific guidelines from a popular trade-book on code quality (Martin 2008). Fifteen interviewees discussed (among others) code from professional code bases, i.e., code that was written by professional software developers or code available from open source code bases. Twenty interviewees discussed bad as well as good examples. The examples were written in 14 different programming languages and 12 interviewees discussed examples from at least two languages, see Table 2.

Table 1 Basic demographics of the participants in the study

4.3 Coding Scheme

In our previous work, we analyzed questions Q5 and Q8 by using codes based on key terms used in the answers and grouping them into related categories. These labels and categories were used as a starting point for the thematic coding of questions Q4 and Q7. While Q5 and Q8 focus on properties of code, Q7 asks for recommendations to improve code quality. The answers to Q7 therefore also refer to activities and not just specific code properties. The activities were eventually categorized as Analysis and Generic, respectively, as shown in the lower part of Table 3. The discussions around the open question Q4 also provided more varied answers than the answers to the more specific questions Q5 and Q8. This led to many extensions but also a few changes to our original coding scheme. We therefore also re-analyzed Q5 (see Section 4.4).

Table 2 Overview over the programming languages of the example programs discussed by interviewees (some interviewees discussed examples in several languages)
Table 3 Categories and labels used for coding

Some labels in category Miscellaneous are not properties of the code itself. However, since they have been mentioned in this context by the interviewees, we decided to keep them for reference. Similarly, labels in the Non-specific category are too general to gain additional insight. As before, we still kept them to provide a comprehensive picture of the responses. The total numbers in both categories are so small, that they do not affect our results.

4.4 Definitions of Code Quality (RQ1.1)

In question Q5, we asked interviewees (in writing) about their definition of software quality.

figure a

The answers ranged from short lists of characteristics to more elaborate answers. The shortest answer was 8 words and the longest 79 words with a median of 22 words. We categorized the answers into 116 references to defining properties in total. On average, the answers referred to 3.4 defining properties (1–6; median: 3).

Readability and structure were the most commonly used defining properties for code quality for our sample developers in total (see Fig. 3). Of the 34 developers 28 (82%) and 22 (65%), respectively, referred to readability and/or structure. All other properties were referred to by 27%–38% of our sample developers. When looking at subgroups of developers based on gender, level of experience as professional developer, or experience with responsibility for quality assurance, the rankings of properties are very similar (see Fig. 4). A notable difference is that experienced developers and those with QA responsibility rank correctness higher than the other groups. Readability and structure are consistently ranked as the topmost and second most important properties, respectively, by all subgroups.

The spearman rank correlations between groups are insignificant, except between the groups without (N) and some (S) professional experience in software development, where we found a high positive relationship (\(r_s\)=0.74, p=0.022).

figure b

4.5 Properties Targeted by Quality Improvements (RQ1.2)

In question Q7, interviewees were asked to provide suggestions for improving code quality.

figure c

Besides their top-3 recommendations interviewees could also provide “further important recommendations”. Interviewees provided 116 recommendations in total (2–4; average: 3.4).

When looking at the code properties respondents focused on within their recommendations for increasing code quality (Q7), we can see that structure and readability are again most commonly referred to (see Fig. 5). Due to the addition of categories generic and analysis, the answers for Q7 were distributed over more categories than the answers for Q5 and all “old” categories. Only category documentation increased its percentage. The group without experience from professional software development (N) accounted for most of the increase in category documentation.

Fig. 3
figure 3

Categories of terms used to define code quality (Q5). N=116, multiple mentionings were possible

Fig. 4
figure 4

Ranking of categories for defining properties of code quality (Q5) by groups of developers based on gender (Female, Male), professional experience (No, Some, Extensive) and responsibility for quality assurance (Yes, No)

When looking at subgroups of developers based on gender, level of professional experience, or experience with responsibility for quality assurance, the rankings of properties show few differences (see Fig. 6). Most notably is that developers without professional experience in software development (N) rank correctness higher than the other groups (S and E). Structure and readability are ranked among the top-3 most important properties by all subgroups. The spearman rank correlations between groups are insignificant, except between the groups without (N) and with extended (E) professional experience, which have a very high positive correlation (\(r_s\)=0.92, p<0.0001).

Fig. 5
figure 5

Categories of terms used for suggestions to improve code quality (Q7). N=116, multiple mentionings were possible. (Note: Having N=116 for Q5 and for Q7 is a coincidence)

Fig. 6
figure 6

Ranking of categories for suggestions to improve code quality (Q7) by groups of developers based on gender (Female, Male), professional experience (No, Some, Extensive) and responsibility for quality assurance (No, Yes)

We also grouped the suggestions for quality improvement into analytical and constructive approaches as described in Section 3.5, see Fig. 7. Suggestions that could not be categorized as either analytical or constructive were categorized as Other. Of the 116 recommendations, 69% are constructive, 21.6% analytical and the remaining 8.6% other. The results show a significant difference between developers without and with responsibility for quality assurance, respectively (\({\chi }^2=11.76, p=.002792\)) indicating that developers with responsibility for quality assurance have a stronger focus on analytical approaches.

Fig. 7
figure 7

Suggestions for type of quality improvement approach (Q7, analytical (bottom of column), constructive (middle) and other (top)) by groups of developers based on (from left to right) gender (Female, Male), professional experience (No, Some, Extensive) and responsibility for quality assurance (No, Yes)

figure d

4.6 Properties Focused on in Code Examples (RQ2)

In interview question Q4, interviewees discussed their personal experiences using code they were asked to bring to the interview. This provided interviewees with opportunities for discussing aspects of code quality in detail from a personal point of view.

figure e

During the coding phase, the 34 interviews were summarized into 721 “assertions” as described in Section 3.5. The overall results of this step are summarized in the heat map in Fig. 8, where the numbers in a cell (row,column) correspond to the numbers of assertions row => column. E.g., the relationship Documentation => Comprehensibility appeared 61 times, i.e., 8.4% of all relationships.

Fig. 8
figure 8

Heat map for all 721 “assertions” about quality relationships in the 34 interviews (Q4). The number in cell (row, column) corresponds to the number of assertions row => column. E.g., assertion documentation => comprehensibility occurs 61 times, whereas comprehensibility => documentation occurs 0 times

Since the number of assertions varies substantially from interview to interview (min: 5, max: 54, avg: 21.2, median: 19), we normalized the results for each interview by dividing the number of occurrences of an assertion by the number of total assertions for the particular interview. As can be seen from Fig. 9, the normalized results are very similar to the non-normalized results in Fig. 8. The top-3- and the bottom-3-ranked categories are identical in both figures (see row/column RANK).

Fig. 9
figure 9

Heat map for “assertions” about quality relationships after normalization (Q4)

Fig. 10
figure 10

Graphical overview of top-ranked categories for all interviewees (ALL), and by gender (MALE, n=28; FEMALE, n=6) for source- and target-properties and their relationships (Q4)

Figures 10, 11 and 12 show graphical overviews of the categories of the most frequently named properties in interviewees’ assertions (normalized). The graphs show that structure, comprehensibility, readability, documentation and maintainability are consistently among the top-3 for all subgroups. The normalized heat-maps that are the basis for Figs. 1012 can be found in Appendix C (Figs. 13, 14 and 15).

The sources of the assertions for both gender subgroups look very similar. However, the female subgroup focuses more on maintainability and structure as targets of the assertions than the male group and less on comprehensibility and readability. Since our sample only comprised six females, it is difficult to draw general conclusions, though.

Regarding professional experience, we can see that the top-3 source properties are the same but decrease in weight successively with experience. For the subgroup with no experience, the top-3 categories account for 82% of the targets of all assertions, whereas they account for only 70.5% and 57.9%, respectively, for the subgroups with some and extensive experience. This indicates that more experienced developers have a broader picture of code quality. An interesting observation is that the subgroup with extensive experience ranks readability higher than comprehensibility. In this subgroup, we can also find a clear relationship between readability and maintainability, whereas we identified only a single assertion readability => maintainability in the other subgroups. Another observation is that the assertion documentation => comprehensibility occurs much less frequent for developers with extensive experience. For this subgroup, documentation => comprehensibility accounts for 6% of all assertions, whereas it accounts for 10.6%–11.5% for the other experience subgroups and is the most frequent assertion for both of these groups (ignoring assertions where source and target belong to the same category).

A similar pattern can be observed between the subgroups with and without responsibility for quality assurance (see Fig. 12). Since 8 out of 11 QA responsibles also are in the subgroup with extensive experience (see Table 1), this observation could be related to overall experience.

Fig. 11
figure 11

Graphical overview of top-ranked categories grouped by interviewee experience (NO, n=11; SOME, n=11; EXTENSIVE, n=12) (Q4)

Fig. 12
figure 12

Graphical overview of top-ranked categories grouped by responsibility for quality assurance (NO QA, n=23; QA, n=11) (Q4)

When looking at individual labels (see Table 3), we found that 11.1% (80 of 721) of all labels for quality indicators dealt with comments/commenting, 7.9% (57 of 721) with naming and 6.8% (49 of 721) with formatting, indentation and whitespace. All other labels amounted to at most 4.3%.

figure f

4.7 “Abstract” Perceptions of Code Quality Versus “Concrete” Manifestations in Code (RQ3)

In RQ3, we investigated in which ways developers’ perceptions and discourses differ (or not) when talking about code quality in general (i.e., in an “abstract” way) and discussing the quality of specific code examples and the trade-offs that might manifest in those. To answer RQ3, we compared the results regarding code quality categories from results in RQ1 (based on an abstract view of the interviewees; Q5–Q8) with the results from RQ2 (based on the specific code examples provided by the interviewees; Q4). The results are not directly comparable as the answers to RQ1 are based on frequencies of occurrences of code quality aspects, whereas the answers to RQ2 also consider relationships between aspects of quality. Still, the ranks of the categories in each of the results can be used as importance indicators for the code quality categories. Thus, a qualitative comparison of the results is possible.

When defining code quality, all groups of developers consistently use readability and structure as the first and second most frequently named property (see RQ1.1, Section 4.4). Interestingly, developers with extensive experience and those with QA responsibilities rated correctness much higher than all other groups while they considered documentation much less relevant. The suggestions for quality improvements overall are less consistent but focus mainly on structure, readability and documentation (see RQ1.2, Section 4.5). Again, we can see that less experienced developers and those without QA responsibilities consider documentation more important than developers with extensive experience.

When discussing the specific code examples the interviewees brought along, they focused on structure, readability and documentation, primarily as sources in the assertions. However, the discussions showed that these properties were not seen as the ultimate quality goals but rather properties that lead to comprehensibility and maintainability (as well as readability and structure, see Figs. 1012). Interestingly, not a single developer without professional experience mentioned comprehensibility as a source in an assertion (see Fig. 14, at top, first row), whereas it ranks fourth overall (see Fig. 9). Yet, they rather considered it a consequence of other properties (see Fig. 14, at top, first column).

Comparing these results to answer RQ3, there are quality categories that are considered important on both the abstract and the concrete level, such as structure, readability and documentation. However, the developers with extensive experience and those with QA responsibility ranked documentation much lower on the abstract level than they did on the concrete examples.

Due to their high experience, they might require less documentation and therefore consider it less important in the abstract view than less experienced developers.

Less experienced developers, on the other hand, did not mention correctness as often on the abstract level as they did on the concrete examples. The code examples might have helped them recognize the importance of code correctness, whereas experienced developers immediately considered this an important aspect.

The previous findings are also reflected in citations from the interviews, e.g., reasoning about readability:

“Readability! Team projects! Basically, is the only reason. If you are writing code for yourself doing something only you need and you can do it in a day and you don’t need to return to the code, which is usually [what you] tell to yourself, but you do need to return.” (Interviewee RD1)

“You can often further simplify code, optimize, and that can sometimes go on until you reach a point where it compiles undoubtedly very quickly, but where, as a human being, you cannot read it easily anymore. So I think it is important that it still is a bit readable.” (Interviewee JJ3)

Interviewee HK2, for example, explicitly emphasized readability over other quality attributes when comparing code:

“I think I would go for ... because it’s such a simple piece of code, with such low cyclomatic complexity, so few different paths, that I think readability is more important in this case than robustness and testability.”(Interviewee HK2)

Several interviewees use terms like simplicity or elegance to describe comprehensibility:

“Describing things as plain, as simple, as well structured as possible. Focus on the real content as far as possible. ... Understandability doesn’t mean that is necessarily terse, but it means, you look at it, and you understand very quickly what it is doing.” (Interviewee HS3)

“There are many things that can be done with very elegant one-liners, that are still understandable.” (Interviewee RD4)

figure g

5 Discussion

The most important and obvious results from the RQs examined above are that developers strive for comprehensible and maintainable code. These qualities are achieved by structured, readable and documented code, especially when looking at specific code examples. These findings are in line with some of the results in the related work (see Section 2). This means that professionally developed code should be well structured, readable and appropriately documented. These aspects should be emphasized in programming and software education to prepare students for their professional careers.

figure h

Regarding what appropriately documented code means, responses were diverse. Although comments/commenting was the most frequent label for quality indicators, interviewees did not generally call for more documentation. On the one hand, comments are seen as important, e.g.,

“The other thing is comments. They are important. I find that I actually write code better if I am commenting it, so if I were writing a SQL statement for someone else that I have explain to, and they need to understand what I am doing, and I actually write out the comments for it, then I actually write the code better.” (Interviewee SH1)

On the other hand, they can also impede comprehension. For simple code

“...you actually get [a] better idea without reading the comments because when you read a comment it breaks your idea of what is happening actually in the code.” (Interviewee RD2)

Some interviewees even called for self-documenting code by using appropriate method names instead of using commenting

“And instead of adding comments here... I would consider it more reasonable to extract methods here that get meaningful names, so that you could read the code based on the method names...” (Interviewee CK2)

Finally, commented-out code is described as an issue by several interviewees, e.g.,

“It makes the whole a lot less clear, also there are all these comments, which distracts. ... it is all commented out code. It is just code from which they thought they would later reuse it, but never removed it. I think this creates quite a lot of confusion.” (Interviewee JA3)

Summarizing the findings on documentation of code, we can say that while documented code is considered important, that may not necessarily mean that adding more comments is always good. Better comprehensibility might also be achieved by meaningful naming and adequate structuring.

figure i

This corroborates the results of Plösch et al. (2014)’s survey on software documentation results, which showed that people working with software development projects consider accuracy, clarity, readability, structuredness, and understandability as especially important for software documentation.

The issue with documentation is complemented by the results across the different experience groups. While developers without professional experience rated documentation rather high to improve code quality, experienced and QA-responsible developers rated it much lower. A possible conclusion might be that while explicit documentation seems to improve aspects like comprehensibility and maintainability initially, this effect may decrease with experience. The highly experienced group rated correctness of code more important for high quality. However, focusing only on correctness might make it more difficult to onboard less experienced developers.

Another aspect that evolved from the interviews, is concerned with the level of measuring code quality. While some interviewees were concerned with micro-level quality aspects such as lines of comments, variable naming or code formatting (among others), others focused more on higher level quality aspects, like overall structure. These range from encapsulation and providing a specific, single functionality in one part of the code to the overall structure of a software project such as low coupling and decomposition.

“But, of course, you see if you look into it more deeply, that the structure is completely different...I basically do not have a real application layer there.” (Interviewee CK1)

We can conclude that code quality has to be assessed and maintained on several levels. Assuring high quality on one level does not guarantee high quality on others, e.g., nicely formatted and named code which is correct may still be arranged inappropriately within the whole software system. Similarly, a nicely decomposed and layered software architecture does not assure comprehensible and maintainable code within a single unit. Thus code quality is a complex and diverse property to achieve for a software project.

figure j

Another interesting observation is that testability of code has not been rated as high as one might have expected. More experienced developers and those with QA-responsibility mentioned this slightly more than others. Yet, it still remains in the middle to lower ranges in all groups. This also holds true when looking at the abstract perception level.

Miguel et al. (2014) reviewed common software quality models and provide a list of 50+ terms used in those models to describe qualities together with their definitions. This list only captures maintainability with a definition that is in line with our results. Furthermore, comprehensibility is captured as “appropriateness recognizability” (with the synonym “understandability”). Structure, readability and documentation are not covered. One reason might be that common software quality standards/models target software systems as products rather than the code these systems comprise. This may make it difficult for developers to relate their work to those standards in practice (Al-Kilidar et al. 2005).

6 Summary and Conclusions

In this study, we investigated how software developers perceive or understand source code quality. Overall, we analyzed and discussed data from 34 semi-structured interviews with software developers from Europe and the US. To facilitate the discussion of first-hand experiences of code quality, study participants brought along their own code examples.

Perceptions of code quality were grouped into eleven categories or themes: readability, structure, comprehensibility, maintainability, correctness, documentation, dynamic behavior, testability, correctness, maintainability, analysis, generic and miscellaneous. Since there are no empirically based categories for (developers’ perceptions of) code quality, our categories could be a good starting point for other researchers who want to study perceptions of code quality.

Our results show that developers consistently focus on structure, readability and documentation as important qualities of code to make it comprehensible and maintainable for other developers. In doing so, developers establish themselves as a group of stakeholders that might need to be considered more thoroughly in common software quality models or standards. Common quality models or standards focus on “product quality” and “quality in use” (as, e.g., ISO 25010 ISO/IEC 2011) targeting the customers or end users of software. This issue has also been noted by a study group on the ISO/IEC 25000 family of software quality standards (also known as SQuaRE) (Nakajima 2019, p. 1): “Most of the measures in ISO/IEC 25023 are quality-in-use measures since the specified measures are about external properties at runtime. There are several coding standards such as MISRA, AUTOSAR, and CISQ, which provide the checklists or rules for code to entail quality measures. For SQuaRE to be considered a strong guide for measurement of software and systems product quality, it must improve how it guides for quality measurement of internal properties.”

More research is therefore necessary to (a) define indicators that capture code properties that matter for developers, to (b) develop software quality models that are relevant for developers and to (c) define metrics that reliably measure the quality attributes that are relevant for developers. Metrics for, e.g., readability (Buse and Weimer 2010) and comprehensibility (Scalabrino et al. 2019) have been defined and studied before. However, we see a need for further empirical research in professional software development contexts.