1 Introduction

Code comments are one of the primary sources helping software maintenance. They are the most used documentary artifact after the code itself (de Souza et al. 2005). Commenting has been generally recognized as a good practice in software engineering (Tenny 1988; Jiang and Hassan 2006), and developers usually write them to explain a particular code snippet. Code comments can generally be categorized into two: documentation and inline comments (Nielebock et al. 2019). Documentation comments are written above methods or classes to describe the functionality of the method or class. Documentation comments help developers easily understand how the method or class should be used, what are the parameter types, what is the return type, etc. Inline comments, on the other hand, are written inside a method or class body; thus, their scope is usually smaller than documentation comments. Inline comments can be written for a wide range of reasons, including code explanation, the reason behind a specific decision, etc.

Several studies in the literature analyzed the effect of code comments on program comprehension and software maintainability. Woodfield et al. (1981) conducted an experiment on 48 experienced developers by showing them different types of modularized codes with and without comments. The follow-up 20-question quiz revealed that the subjects whose code contained comments were able to answer more questions correctly. In agreement to this finding, 66% of software practitioners who responded to our survey in this study agreed that comments help them understand the code better. Similar to the findings of studies conducted by Ko et al. (2006) and LaToza et al. (2006), our survey results (see Section 4) reveal that developers spend as much time reading the code as writing it. Thus, having high-quality comments is crucial for software as they improve the legibility, readability, and understandability of code. Hartzman and Austin (1993) concluded that comments are a key factor in maintaining software in the long term in their case study on a large software system. Misra et al. (2020) found out that the lack of relevant comments in a project leads to a higher average issue resolution time in their empirical study on 625 GitHub repositories.

Since code comments contain abundant information about the software, researchers have also incorporated them into various tools for solving software engineering tasks, such as automated testing (Goffi et al. 2016; Wong et al. 2015), specification inference (Blasi et al. 2018; Pandita et al. 2012; Zhong et al. 2009), and code synthesis (Allamanis et al. 2015; Gvero and Kuncak 2015; Zhai et al. 2016). Having high-quality comments may improve the performance of such tools.

Because the compilers ignore comments and developers can write them in various ways, researchers have observed quality issues (smells) in comments (Louis et al. 2018; Khamis et al. 2010; Tan et al. 2007; Misra et al. 2020; Rani et al. 2021; Wang et al. 2019), such as being redundant or inconsistent with the code. Our study focuses on the analysis of inline code comment smells. In this context, smell refers to comments that can degrade software quality or comments that do not actually help readers much in terms of code comprehension. Although there have been studies on various categorizations of code comments and quality issues in them (discussed in Section 2), to the best of our knowledge, there is no previous work on the taxonomy and thorough analysis of inline code comment smells. This study extends our prior work on the analysis of inline code comment smells (Jabrayilzade et al. 2021), where we defined the following research questions:

  • RQ1. What types of inline comment smells are there?

  • RQ2. How often does each smell type occur in practice?

To answer the RQ1, we conducted a multivocal literature review (MLR) on academic (white) and gray literature and collected a set of inline comment smells written by developers. We proposed a taxonomy of 11 inline comment smells. For addressing the RQ2, we manually categorized a subset of inline comments from three open-source Java projects and found out that smells exist in the projects with varying degrees.

In this study, in order to make the results of RQ2 in the previous study [45] more generalizable, we added one more Java and four Python open-source projects to our analysis, increasing the total number of projects to eight. We found out that the smells exist in Python projects as well. Moreover, to gather practitioners’ views on our defined taxonomy of inline comment smells as well as their impact on code comprehension and software maintainability, we added the following research question to this study:

  • RQ3. What is the perception of software practitioners about inline code comment smells?

To answer this research question, we conducted an online survey with 41 software practitioners. The respondents generally agreed with our taxonomy; however, there were smells for which respondents indicated that their existence might not be considered a smell in specific situations.

Finally to investigate the developer’s perception of inline comment smells and whether or not they find them useful to code comprehension, we included a new research question in our study:

  • RQ4. Do developers of open-source projects accept pull requests or issues to remove inline comments smells?

We submitted 53 pull requests and/or issues to the projects we labeled to answer this question. We obtained an acceptance rate of 27%. Although removing inline comment smells is inherently beneficial to the code base, open-source project developers tend to reject pull requests that do not directly fix bugs and resist altering functional source code.

The extensions in this study over our previous work (Jabrayilzade et al. 2021) are as follows:

  • Updated the methodology and scope of MLR results.

  • Enhanced our evaluation by adding five new projects.

  • Added a new research question (RQ3) to gather practitioners’ views on our taxonomy of inline comment smells.

  • Added a new research question (RQ4) to investigate the developers’ acceptance of inline comment smells in eight OSS projects.

The main contributions of the study can be summarized as follows:

  • Proposed a novel taxonomy of inline code comment smells based on a multivocal literature review and manual analysis of sample Java and Python projects.

  • Manually analyzed four Java and four Python projects to determine how commonly the smells occur in open-source projects.

  • Published the labeled dataset of inline comments consisting of 2447 comments based on the manual analysis process on the selected projects.

  • Gathered practitioners’ opinions about the inline comment smells through an online survey.

  • Investigated developers’ opinions about removing the inline comment smells from open-source projects with pull request submissions.

The rest of the paper is organized as follows. Section 2 presents the prior work on the analysis of code comments from different perspectives. Section 3 details our methodology for answering the research questions. Section 4 presents the results of the multivocal literature review, manual analysis process, and the practitioners’ survey. Section 5 revisits the research questions, compares the results of the manual analysis process and survey responses, and provides implications for researchers, practitioners, and educators. Section 6 discusses the threats to the validity of the study. Finally, Section 7 concludes the work and presents the future directions of the study.

2 Related Work

We discuss the related works under three categories: general comment quality analysis, taxonomies of comments, and studies focusing on the detection of single code comment smell type.

2.1 Quality Analysis of Code Comments

In this section, we discuss works done previously about analyzing the quality of code comments from different perspectives.

Khamis et al. (2010) developed the tool JavadocMiner that can analyze the quality of Javadoc comments. The tool uses various heuristic metrics to assess the quality of the language used in comments and the consistency between source code and comments. The authors evaluated the tool on two Java projects (ArgoUML and Eclipse) by correlating the number of bugs in the modules of the projects and the metric scores. They report that modules with the highest quality had the lowest number of reported defects. Similarly, Tan et al. (2007) explored the feasibility and benefits of automatic analysis of comments to detect bugs and bad comments in source code. Their heuristic analysis on lock-related comments in the Linux kernel detected 12 bugs, two of which were confirmed by the kernel developers. The authors also examined various open-source bug datasets and found that bad or inconsistent comments introduced various types of bugs.

Rani et al. (2021) conducted a study on whether developers follow commenting guidelines in open-source projects. According to the results of the manual analysis done on 700 class comments from 13 Java and Python projects, they concluded that, in both languages, developers follow the writing style (e.g., grammar, punctuation) and content-related comment guidelines more often than syntax (denoting a specific type of a comment) and structure (e.g., subsections in comments) types of comments.

Misra et al. (2020) explored the correlation between code comments and issues in open-source Python projects from GitHub. They first extracted 20,000 comments from 625 repositories and then semi-automatically classified 600 of those comments into relevant and auxiliary comments. Relevant comments included header, method, inline, task, and section comments. On the other hand, auxiliary comments consisted of comments such as copyright, license, or information about authors. The authors also extracted the closed and open issues from the repositories as well as calculated the average time taken (in days) to resolve the issues. Statistical analysis showed that there exists a correlation between the number of relevant comments and the number of closed issues in a repository. Moreover, repositories with a higher number of relevant comments had a lower issue resolution time.

2.2 Taxonomies of Code Comments

This section describes studies on the categorization of code comments.

Steidl et al. (2013) used machine learning techniques on Java and C/C++ projects to automatically categorize comments into seven different categories, namely copyright, header, member, inline, section, code, and task comments. Their classification algorithms were nearly 95% precise. Then they assessed four quality criteria, namely, coherence, usefulness, completeness, and consistency, to determine the quality of the comments. They first proposed two metrics for the coherence of comments and then evaluated both metrics by a survey of experienced developers.

Haouari et al. (2011) aimed to categorize comments by their position and frequency in the code with an automated approach. Then, they used 49 programmer subjects to classify the comments via their content and relevance. The subjects classified comments by their object, type, style, and quality. The study did not emphasize code comment smells and only included a quality metric that can tell whether a comment can increase code comprehension or not.

Pascarella and Bacchelli (2017) created a taxonomy of Java comments consisting of 6 top and 16 inner categories. Then, they manually analyzed six Java projects to determine how often the categories occur in practice. They found that the metadata and purpose comments were the most prominent ones. They trained supervised machine learning models for the automatic classification of comment types and achieved an average of 0.95 true positive rate for top-level categories and 0.85 rate for inner categories. However, this study only categorized comments according to their purposes and did not include a qualitative analysis.

Zhang et al. (2018) investigated seven Python open-source projects and created a taxonomy for code comments consisting of 11 categories. The authors manually labeled 330 comments, which were randomly sampled from the selected projects to create a dataset of Python comments. Usage type was the most predominant category covering almost 80% of the labeled comments. They also applied machine learning classifiers to the dataset for classifying the code comments automatically. Their results showed that the Decision Tree classifier was superior to the Naïve Bayes classifier in terms of accuracy (87% compared to 81%) and runtime efficiency.

Shinyama et al. (2018) focused on identifying and classifying local comments that led to 11 distinct categories for code comments. With a manual annotation experiment, they verified the relevance of their comment categories and found out that preconditional and postconditional comments were dominant in their selected projects. They also used a decision-tree-based classifier to automatically analyze the code comments, which had 60% precision and 80% recall scores.

Khan et al. (2021) focused on the classification and automatic detection of API documentation smells. Firstly, they created 5 smell categories by consulting the literature, and then they created a benchmark with 1000 API documentation units where at least 778 documentation units belong to one of the smell categories. A survey with 21 professional software developers validated the catalog of API documentation smells by reporting that 95% of them think those smells impact their productivity negatively. Lastly, they used machine learning models to detect the five smells automatically, where the BERT (Devlin et al. 2019) classifier had the highest F1 scores.

Wen et al. (2019) conducted a study on code-comment inconsistencies to investigate how different code changes trigger comment updates and what types of code-comment inconsistencies are fixed by developers. Authors manually labeled 500 commits from Java projects and created a taxonomy composed of 6 root categories related to code-comment inconsistencies, which are Application Logic, Code Design/Quality, Maintenance, Formatting/Readability, Copyright/License, and Others.

Padioleau et al. (2009) manually labeled 1050 comments from three operating system projects (Linux, FreeBSD, OpenSolaris), where the authors analyzed the comments in four dimensions: what is inside the comment, and for whom, where, and when they are written, and created a taxonomy for each dimension.

Zhai et al. (2020) studied automatic propagation and association of comments to code entities using program analysis. To effectively apply propagation, they created a taxonomy of comments under four entities, namely, class, method, statement, and variable. For each of these entities, they analyzed the comments from five perspectives: what, why, how-it-is-done, property (e.g., pre/post-conditions), and how-to-use. The authors also experimented with various machine learning algorithms for automatically classifying the comments into the defined categories, where random forest and decision tree algorithms had the best performance.

Table 1 Prior work on the taxonomy of code comments

Table  1 provides categories defined for code comments for the studies discussed above. The studies conducted by Pascarella and Bacchelli (2017); Zhang et al. (2018) and Khan et al. (2021) are the most relevant ones to our study. The difference between our study and the former two studies is that we focus on classifying comment smells, i.e. bad practices in commenting, whereas they focus on the general classification of comments. The latter study focuses on smells on documentation smells in contrast to inline code comment smells that we investigate in this study.

2.3 Detection of Code Comment Smells

Although there is no previous study that considers inline code comment smells from a general perspective, there are various studies on analyzing and detecting smelly comments.

Louis et al. (2018) developed a deep learning model based on a long-short term memory network (Hochreiter and Schmidhuber 1997) to detect redundant (comments that restate the code) method-level comments. The model was trained in an unsupervised manner by measuring the predictability of the given comment sentence under the language model. In an experiment on Java comments, the authors found out that the Javadoc tags had a predictability score two times more than other comments.

Blasi et al. (2021) created a tool called RepliComment for identifying comment clones in code. The tool can report instances of copy-and-paste errors in comments as well as poorly written comments. The tool achieved 79% precision in finding critical clones according to the manual analysis done by the authors on 412 issues reported by the tool from various Java projects.

There are also studies on detecting which part of the code needs comment. Louis et al. (2020) developed a deep learning-based model that achieved 74% precision in finding comment-worthy locations in code. The authors created the evaluation dataset by extracting codes that were already commented. However, the approach assumed that the existing comments had high quality, which may bias the trained model in the case of smelly comments. A similar study was conducted by Huang et al. (2019), where they applied machine learning techniques to code context-related features. The effectiveness of the approach was evaluated via a survey of practitioners. The overlapping rate was 0.72 between the comment locations predicted by the model and the practitioners.

There have been multiple studies on identifying code-comment inconsistencies (misleading comments) in the literature. The models/tools either predict such comments statically by analyzing the relations between code and the corresponding comment (Sridhara 2016; Rabbi and Siddik 2020; Iammarino et al. 2020; Corazza et al. 2018; Wang et al. 2019; Ratol and Robillard 2017) or during a change (i.e., commit) (Panthaplackel et al. 2021; Liu et al. 2020, 2021; Stulova et al. 2020), which can also take change-related features into account.

3 Methodology

In this section, we describe the mixed-methods research design for answering the research questions. Specifically, we present the MLR stages and elaborate on the quantitative analysis performed on sample projects.

3.1 RQ1. What Types Of Inline Comment Smells Are There?

To answer RQ1, we first reviewed two books written by expert software practitioners, Clean Code by Martin (2009) and The Art of Readable Code (Boswell and Foucher 2011) by Dustin Boswell and Trevor Foucher to get an initial idea about bad practices followed by developers when writing code comments. We encountered various types of comments that affect the readability and maintainability of source code. Due to the lack of research in academia and the abundance of gray literature materials (e.g., blog posts) published by software practitioners on this topic, we decided to carry an MLR according to the guidelines provided by Garousi et al. (2019). The activities conducted during the MLR are given in Fig. 1. The execution of search queries, applying inclusion/exclusion criteria, and extracting relevant smell data were conducted by two authors independently. Conflicting cases were resolved in an additional online meeting between the two authors. The MLR was conducted between 4 - 18 March 2021.

Search Strategy Google Scholar and Scopus were used for searching the academic literature as they include major publication databases (Neuhaus et al. 2006). For gray literature, the authors utilized the Google Search engine according to the suggestions of Godin et al. (2015). We defined the following search query string to retrieve materials relevant to the research question.

“Code CommentAND (“GuidelineORSmellORBad PracticeORAnti-pattern”)

The search query was defined according to previous MLR studies on identifying various anti-patterns in software engineering and we selected keywords that can capture relevant sources (Garousi and Küçük 2018; Qamar et al. 2022; Doğan and Tüzün 2022). All the results provided by the search engines were analyzed thoroughly. The authors applied inclusion/exclusion criteria to only include sources related to the code comment smells. After creating the initial pool of sources, we also manually conducted backward and forward snowballing (checking references and citations of papers) on the academic papers not to miss any related studies, as suggested by Keele et al. (2007). Backward snowballing is the application of inclusion/exclusion criteria to the references of a paper that is already in the pool of sources. In forward snowballing, we applied the inclusion/exclusion criteria for the studies citing a paper that is already in the pool of sources. We iteratively applied backward and forward snowballing until there were no more primary studies to be added.

Inclusion/Exclusion Criteria We carefully designed the inclusion/exclusion criteria to incorporate relevant and high-quality sources. We included the sources (both white and gray) that answered “yes" to all the following questions:

  • Does the source discuss bad or good practices about code comments?

  • Is the source in English, and is it fully accessible?

  • Does the source contain author information?

  • Does the source contain non-duplicate information? (not restating another source)

  • Is the source published in a peer-reviewed conference or journal? (only white literature)

Fig. 1
figure 1

Flow diagram of the MLR stages

Final Pool of Sources Applying the search query returned 383 sources from Google Scholar, 16 from Scopus, and 144 from Google Search. To avoid missing any studies, the two authors performed separate searches on the selected databases and applied inclusion/exclusion criteria to the search results. When there was a disagreement on the selection of the primary studies, the third author was consulted, and the problems were discussed until we came to a consensus. Additionally, we found nine sources from white literature as a result of backward and forward snowballing. The final list consisted of 84 relevant resources. Out of those sources, 55 of them were retrieved from gray literature and 29 from white literature.

Data Extraction After we compiled a list of 84 studies identified in the previous phase, two authors independently analyzed these studies. We used a shared spreadsheet to encode the specific smell type we encountered in the studies by giving it a name. Upon adding the smell type to the spreadsheet, each author could either use one of the previously defined categories or add a new one. After we achieved the initial set of categories, we merged the related categories. The categories were finalized after a few rounds of discussions among the authors. The table of the sources and discussed smell types in those sources is available online.Footnote 1

3.2 RQ2. How Often Does Each Smell Type Occur In Practice?

To answer how frequently code comment smells appear in software engineering practice, we manually analyzed inline comments of four Java and four Python projects from GitHub. The following sections describe the project selection, comment sampling, and manual labeling processes.

Project Selection Because there was a large number of projects to choose from, we filtered them first. We selected the top 1000 Java and top 1000 Python projects according to the number of GitHub stars to include popular projects used by the community. Next, in order to remove personal or small projects, we filtered out projects that had less than ten contributors. We then randomly selected projects having heterogeneous domains (e.g., mobile applications) and different project sizes (in terms of lines of code). While choosing the projects, we also checked if the projects were in English (according to README files) and if they were real-life projects (e.g., non-education related). Moreover, we only included active projects - projects that had at least one commit in the last month. In the end, we selected Anki-Android (a flashcard memorization app),Footnote 2Jitsi (video conferencing service),Footnote 3Moshi (JSON library),Footnote 4 and Light-4j (microservices framework)Footnote 5 projects for Java, and Requests (HTTP library),Footnote 6Scrapy (web crawling framework),Footnote 7Kivy (UI framework),Footnote 8 and Scikit-learn (machine learning framework)Footnote 9 projects for Python. The selected projects with their GitHub stars, the number of contributors, and the lines of code (LOC) are shown in Table 2. LOC shows only code lines (non-empty, non-comment) written in the project’s main language (either Java or Python).

Table 2 Details of the selected projects

Comment Sampling We used a comment parser toolFootnote 10 to find the comment locations in the Java source files of the projects. We filtered out documentation comments (beginning with "/**") to include only inline comments. For Python, we located the comments by searching for hash symbols in the source code. The number of inline comments extracted from each project is presented in Table 2. Due to the large number of inline comments in the selected projects, we found it infeasible to analyze all of them manually. Instead, we randomly sampled a statistically significant subset of the comments from the selected eight projects and analyzed them. The sample size was determined via Cochran’s sample size formula (Cochran 2007):

$$ n = \frac{z^{2}*p*(1-p)/e^{2}}{1+(\frac{z^{2}p*(1-p)}{e^{2}N})} $$

The sampling was done with a \(95\%\) confidence level and \(5\%\) margin of error. We chose \(p=0.5\) as it is unknown which proportion of comments are smell or not. The number of sampled files and comments for each project is shown in Table 2. For each project, we randomly sampled code files (ending with “.java” or “.py”) and selected all the inline comments from that file. The files were sampled until the required sample size was satisfied. The reason for using file granularity was to make it easier for the authors to label the comments with high confidence as the context of comments can be difficult to understand.

Fig. 2
figure 2

Snapshot of the tool developed for labeling the comments

Comment Labeling Process We developed a web-based tool to label the sampled comments manually. A sample snapshot of the tool is shown in Fig. 2. The tool is user authenticated so that the labelers cannot see each others’ labels. The tool presents one comment at a time and requires the labeler to categorize it into one of the smell types. Labelers are able to scroll through the code snippet, see the number of labeling they have done, edit previous labels and see the available smell categories. The tool also has a section for labelers to add their opinions about the given comment if there is something that needs to be taken into consideration. Additionally, users are required to provide information about their confidence in categorization as well as the related code part that the comment was pointing to.

The labeling process was executed by two authors independently. The authors then conducted a meeting to resolve the conflicts. In the case of disagreement, the third author was asked to provide categorization, and the final labels were determined by voting.

Before labeling the selected projects, we sampled a hundred comments from HiveFootnote 11 project to test the application and determine the average labeling time for a single comment (which was about a minute). During this process, we found an additional smell type (vague comments) that was not encountered during the MLR. The details of this smell type are discussed in Section 4.

3.3 RQ3. What is the Perception of Software Practitioners About Inline Comment Smells?

We have conducted an online survey through Google Forms to answer this research question. The survey’s goal was to determine whether software practitioners agree with our taxonomy of inline comment smells and to gather their opinions about the effect of the smells on code maintainability or comprehension.

Survey Protocol Our survey consisted of two parts. In the first part, we asked demographic questions as well as the proportion of their daily work time spent on code writing, reading, and reviewing activities. Specifically, we asked the following questions:

  1. 1.

    What is your current job title? [Text]

  2. 2.

    For how many years have you been working in the software industry? [Likert scale]

  3. 3.

    What percentage of your daily work time do you spend on writing code? [Likert scale]

  4. 4.

    What percentage of your daily work time do you spend on reading code? [Likert scale]

  5. 5.

    What percentage of your daily work time do you spend on reviewing code? [Likert scale]

In the second part, we first asked whether they agreed that inline code comments help them understand the code better (Likert scale). Afterward, for each smell type, we showed its description with a sample example and asked whether they agreed that the given type of inline comment is a smell or not, alongside their reasoning. We also asked how often they see such kind of comment and their opinion about its impact on software maintainability and code comprehension. Questions asked in the second part of the survey for each smell type are listed below:

  1. 1.

    Do you agree that the given type of comment is a smell and should not be written in the source code? [Likert scale]

  2. 2.

    Can you please explain your reasoning? [Text]

  3. 3.

    How often do you encounter this type of comment? [Likert scale]

  4. 4.

    In your opinion, how does this comment type impact the software maintainability/code comprehension? [Likert scale]

At the end of the survey, we had an open-ended section asking whether they had anything else to say about the smells or a new smell category they would like to add.

Recruitment We recruited participants through the networks of the authors and posted the survey on Twitter and LinkedIn platforms. Respondents participated in a raffle consisting of a $100 Amazon Gift Card. Before publishing the actual survey, we conducted a pilot test with five engineers from the authors’ network to ensure that the survey was clear and understandable. During the pilot test, we got feedback on the completion time (\(\approx \)12 minutes on average) of respondents and some survey parts being ambiguous. We addressed ambiguous parts in the actual survey. We did not include the responses to the pilot test in the survey results.

3.4 RQ4. Do Developers Of Open-Source Projects Accept Pull Requests Or Issues To Remove Inline Comments Smells?

We manually submitted a total number of 53 pull requests and issues to the eight projects that we used during labeling. Before submitting any pull requests, we first checked the contribution guidelines of each chosen project to make sure we did not violate the pull request structure.

For each chosen project, we opened pull requests and issues to fix the comment smells. The smells were chosen from the comments that we manually labeled in a way that we had at least one smell from each category (except attribution smell type since we did not encounter it in our labeled projects). The comments in the categories, on the other hand, were sampled randomly. By randomly choosing comments from different projects, we aimed to ensure that the comment smells represent a diverse set of scenarios found in real-world software projects. This approach helped us to capture various types of inline comment smells, including both straightforward and complex cases, which may require different degrees of understanding and effort to fix.

In each pull request, we suggested removing or shortening a comment smell. We explained the smell type and our reasoning for such a change. The pull requests we submitted consisted of different types of comment smells. However, we had different approaches while removing each type of comment. For obvious, beautification, commented-out code, and irrelevant comments, we proposed to remove the whole comment whereas comments with too much information were shortened considerably. Misleading comments were handled by creating a pull request that fixed the comment to be aligned with the code. For task smells, we opened a pull request to remove the comment from the code and created an issue in GitHub instead regarding that task. For vague comments, we opened pull requests if we understood the issue and the fix was trivial. Otherwise, we opened issues by asking the developers to elaborate and explain the comment. For non-local information smells, we opened pull requests where we moved the comment to its correct place in the code. The pull requests and issues that we opened are available in our replication package.

Table 3 Taxonomy of inline code comment smells

It is unconventional to modify different parts of the codebase at the same time, hence; we submitted our pull requests for each smell separately. Generally, modifying the unrelated parts of the codebase in a single pull request is also not welcomed by the developers. Additionally, the developers of open-source projects are likely to reject pull requests that do not fix a certain issue, so we avoided submitting too many pull requests to one project [4]. After pull request submissions, we counted the number of accepted and rejected pull requests and the ones that did not receive any feedback to report the results.

4 Results

This section discusses our findings as a result of the multivocal literature review, the manual labeling process of eight projects, and the practitioner’s survey.

Fig. 3
figure 3

Example of misleading smell from the Light-4j project

4.1 Final Taxonomy (RQ1)

The multivocal literature review led to ten types of source code commenting practices that are considered as a smell by the practitioners. We also found an additional smell type called vague comments during the labeling process of Java projects and observed this type of smell occurring in Python projects as well. We did not receive any other smell type suggestions from the practitioner’s survey. Table 3 shows the final taxonomy of these 11 smell types alongside their description, an example, and the number of sources that they were mentioned. Misleading and non-local smell types were mentioned the most and the least in our MLR sources, respectively. We present each smell type in the subsections below. We also provide example smells from the projects we labeled.

Misleading Comments that do not accurately represent what the code does are considered misleading comments. This type of smell usually happens after a developer makes a code change but forgets to update its related comment, which creates code-comment inconsistencies and could misguide the readers. This can also occur when the author forgets to update comments of a copy-pasted code. Figure 3 shows an example of a misleading comment (mentioning the removal of a file named test.json, however, the file is named differently in the following code piece).

Obvious Comments that restate what the code does without giving additional information are considered obvious comments. Such comments decrease the readability and clutter the source code. An example of obvious smell is given in Fig. 4, where the comment restates the error message in the line below.

No Comment on Non-Obvious Code According to the practitioners, it is generally a good practice to write comments to state the reasoning behind the code instead of what it does. Lack of comments on complex parts of the source code decreases understandability and affects software maintenance. We do not present an example for this smell type since we ignored such smells during the labeling process.

Commented-out code A code piece that is commented out is referred to as a commented-out code. Developers usually comment out code snippets for debugging purposes. However, such comments should not be pushed to the codebase as they clutter the source code and are generally ignored by the readers. An example of this smell from the Light-4j project is given in Fig. 5.

Irrelevant Irrelevant comments are unnecessary comments that do not intend to and do not, provide any information on the code or project. Figure 6 shows a sample irrelevant comment from the Requests project.

Task Task comments explain the potential future work or the work that was already completed. These are often “TODO” or “FIXME” types of comments. Task comments can also include comments written without such keywords (e.g., “should we optimize this?”). Although most modern code editors highlight task comments with keywords, developers sometimes forget to remove them even if the task is completed Storey et al. (2008). They often pile up and litter the source code. It is generally suggested to define such tasks in a project/issue management tool. Figure 7 shows an example task comment about optimizing a piece of code.

Too much Information This type of comment smell happens when a comment is overly verbose. Such comments often contain historical information or are very detailed but do not provide meaningful insights, which decreases readability. An example of this smell type is given in Fig. 8.

Attribution Comments that give information about who wrote the code are referred to as attribution comments. Such comments are unnecessary as this information can be retrieved from most version control systems such as Git.Footnote 12 We did not find an occurrence of attribution smell in our labeled projects.

Fig. 4
figure 4

Example of obvious smell from the Scrapy project

Beautification Sometimes, developers want to distinguish parts of a source code. They usually use a sequence of arbitrary characters to indicate those parts. These are also often called separators. Such comments may create distractions and are generally considered unnecessary. An example of beautification smell from the Anki-Android project is presented in Fig. 9, where the comment is used for indicating the location of class methods.

Non-Local Comment should describe the code near to it so that the readers can easily understand the source code. We call comments that give systemwide details or mention code part that is far away from the comment location as non-local comments. Figure 10 shows an example of this smell, where the comment mentions a call to a procedure during a specific event, which is not related to the code nearby.

Vague One of the reasons behind writing code comments is to make readers easily understand the code. Thus, comments should also be intelligible. We label comments that are difficult to comprehend as vague comments. An example of vague comment from the Kivy project is presented in Fig. 11.

figure a

4.2 Empirical Results (RQ2)

We discuss the results of the manual labeling process on eight Java and Python projects in this section. In the first stage of the process, two authors independently labeled the sampled comments followed by a conflict resolution meeting. The initial agreement was 0.53 between the two authors according to Cohen’s kappa score (Cohen 1960) and increased to 0.94 after the meeting. The remaining conflicts were resolved in an additional meeting with the third author. Table 4 shows the number of comments we found for each smell type. We excluded the smell type no comment on non-obvious code during our labeling process as it requires analyzing the code files line by line, and determining whether a given code piece needs comment is not a trivial task. The distribution of comments and smells are shown in Figs. 12 and 13, respectively. The labeled inline comment smell dataset is available online.Footnote 13

Fig. 5
figure 5

Example of commented-out code smell from the Light-4j project

The manual analysis of eight projects led to a considerable number of comment smells in the selected projects, and the most observed smell type was obvious comments. This smell type occurred the most out of all smells in all eight projects. Task comments had the second highest number of occurrences in seven out of eight projects. It only ranked third in Jitsi, a Java project, where it was surpassed by commented-out code comments.

We have observed all types of smells, except for attribution which had no occurrences at all, in at least two projects. After this smell type, the irrelevant and non-local comments had the least occurrences by only occurring in two out of eight projects. Comments with too much information occurred only in four projects. Hence, these four types of comment smells were observed considerably less than other types. The other smell types were observed occasionally, although their ratio differs from project to project.

The projects with a higher contributor count (>400) have a higher ratio of proper comments (comments that are not smells) than other projects. These projects also have higher star counts (>40.000) than the others. We also observed that all three projects (Requests, Scrapy and Scikit-learn) were developed in Python. Hence, the Python projects that we analyzed had lower percentages of comment smells than Java projects.

There are some smell types whose ratio differs significantly in projects (e.g., task). To investigate this, we analyzed GitHub repositories of the projects to see if there were guidelines about code commenting. We found instructions about writing task and beautification comments in Anki-Android guidelines,Footnote 14 as shown in Fig. 14. Moreover, Jitsi guidelinesFootnote 15 points to the old Java code conventions that reference commented-out code. On the other hand, Scikit-learn has a different approach in its code guidelinesFootnote 16 where it encourages avoiding obvious, vague, and irrelevant comments and adding comments where necessary. According to its code review guidelines, the questions that should be asked during code review are as follows: “Is the code easy to read and low on redundancy? Should variable names be improved for clarity or consistency? Should comments be added? Should comments be removed as unhelpful or extraneous?”. Also, all the Python projects refer to the PEP 8 code style guideFootnote 17 in their contribution guidelines. Figure 15 shows a snapshot from the PEP 8 guidelines mentioning comments should not contradict the code (misleading smell), should be easily understandable (vague smell), and should not state the obvious (obvious smell). Although there is a resource on Java code conventions published by Oracle,Footnote 18 it was not mentioned in the contribution guidelines of the selected Java projects.

figure b
Fig. 6
figure 6

Example of irrelevant smell from the Requests project

4.3 Survey Results (RQ3)

Initially, we got 66 responses to our survey. Since this is an online survey with anonymous respondents, we filtered out potentially unrelated / low-quality responses and kept the responses whose job title was related to software engineering. For those that have non-software engineering titles, we also analyzed their responses and observed that they have consistently not answered the open-ended questions. Figure 16 shows the roles of the respondents. The majority of them were software engineers (we combined engineer and developer roles). We also got responses from practitioners with positions such as head of product, intern, postdoc, etc. The distribution of software engineering experience of the respondents can be seen in Fig. 17, where the majority had middle to senior-level engineering experience. Respondents’ activities related to writing, reading, or reviewing code are shown in Fig. 18. They mostly spend their daily work on writing and reading code rather than conducting code reviews.

Fig. 7
figure 7

Example of task smell from the Moshi project

Fig. 8
figure 8

Example of too much information smell from the Scrapy project

Fig. 9
figure 9

Example of beautification smell from the Anki-Android project

Fig. 10
figure 10

Example of non-local smell from the Jitsi project

We asked a general question in the survey to practitioners on whether inline code comments help them understand the code better. The distribution of the responses is given in Fig. 19, where 66% of the respondents agreed that they are helpful for code comprehension.

We further evaluate the perceptions of the practitioners in three categories: smell taxonomy, smell occurrences, and impacts of smells.

4.3.1 Perceptions on Smell Taxonomy

In our survey, we asked practitioners “Do you agree that the given type of comment is a smell and should not be written in the source code?” and their answer’s reasoning for every smell type. They could choose an answer from the following options: strongly disagree, disagree, neutral, agree and strongly agree (Fig. 20).

We saw that practitioners mostly disagreed with task comments (51.2%) being considered as smells. Some of the reasoning behind task comments not being smells was that these type of comments can be useful during the development process as it reminds the engineer what task to do but should not be there in the final product. Others who agreed with it mentioned that task comments should be in the issue tracking software instead of the code. A couple of respondents made the following remarks about task comments:

“In big codebases, these todos never get done. It is better to create a ticket etc. But this can help the reader see incapabilities of the code which I think helps comprehensibility.”

“It’s usually used to grab the attention of the developer as a reminder to complete this part of the project code.”

“TODO comments are helpful if they’re used properly and maintained (handled). They can help other developers understand how the code will be evolving, or they can be helpful as reminders.”

Fig. 11
figure 11

Example of vague smell from the Kivy project

Beautification types of comments were also disagreed by a number of respondents for being considered as a smell. Some reasons include that they help with organizing the code by making it more comprehensive visually. Most practitioners agreed that it makes the code more readable in cases where the code is too long or complex. However, one participant mentioned that these comments were necessary when IDEs were not sufficient at syntax highlighting, but since IDEs improved in the previous years, beautification comments are not needed anymore. The following comments are made by the respondents about the beautification smell:

“This is a nice thing to do when your code is seen by starters who just started coding, but in industry, there should not be a need to see this comment to understand variables are declared there, etc.”

“If there are styling rules (i.e., variables first, then constructors, functions, helpers, etc.), then it is not needed. But it may be helpful if the file is too long.”

“When the class or respective file is too large, which could also be another code smell (God class), then it might be okay to use such comments.”

Table 4 Occurrences of the smell types in the selected projects

Respondents who disagreed with the non-local information type of comment being smell mentioned that they may help unfamiliar developers understand how code works. On the other hand, we got responses pointing out that this type of comment should be written in the documentation, not in the source code, and may contain sensitive information.

For other types of smells, such as commented-out code and too much information, some practitioners argued that these comments might be helpful for developers. For commented-out code comments, some argued that they are useful during debugging but should not be there in the final product whereas some said they are not necessary because they can be retrieved from version control systems. For comments with too much information, some arguments said that they can be beneficial for people by making the explanation of code very clear. One response pointed out that such comments can be useful in open-source applications where there is no space to store documentation and historical information. However, many practitioners agreed that these comments are hard to maintain and should be put in the documentation.

Fig. 12
figure 12

Distribution of comment types (all projects combined)

Fig. 13
figure 13

Distribution of smells (all projects combined)

Fig. 14
figure 14

A segment of Anki-Android code guidelines

Fig. 15
figure 15

PEP 8 guidelines on code comments

Fig. 16
figure 16

Roles of survey respondents

Another smell type that received interesting responses was irrelevant. While 73.1% of the practitioners agreed that this is a smell, there were responses indicating that they can be entertaining sometimes. We got such remarks about irrelevant smells:

”Does not contain any information regarding development and is distracting.”

”I think it is a code smell but I personally would like to see these kinds of uplifting comments if they occur only in a few places.”

The respondents generally agreed with the other types of smells according to the number of agree and strongly agree responses. The statistics of the respondents’ agreement on the comment smells are shown in Table 5.

Fig. 17
figure 17

Respondents’ years of software engineering experience

Fig. 18
figure 18

Percentages of respondents’ daily work activities related to coding

Fig. 19
figure 19

Distribution of responses to the survey question: Do you agree that inline comments help you understand code better?

Fig. 20
figure 20

Distribution of responses to the survey question: Do you agree that the given type of comment is a smell and should not be written in the source code?

Table 5 Statistics of the occurrences of the smell types observed by the survey respondents

4.3.2 Perceptions on Smell Occurrences

For every smell type in our taxonomy, practitioners were asked “How often do you encounter this type of comment?” in the survey. They were expected to choose an answer from the given options: never, rarely, sometimes, often, and nearly always. The distribution of responses can be seen in Fig. 21. According to their answers, we took the weighted average by giving the following weights respectively for each option: 0.0, 0.25, 0.5, 0.75, and 1. Then we ranked the smell types (see Table 6) according to their weighted average starting with the most encountered one. We omitted no comment on non-obvious code smell type from the table as we do not have data from our labeling process to compare to.

Fig. 21
figure 21

Distribution of responses to the survey question: How often do you encounter this type of comment?

When we compared the results of our labeling process to the survey results, we saw some differences and similarities. We have encountered obvious comments the most, but according to practitioners this type of smell ranked lower, where the majority (73.1%) said they encounter such comments rarely or never. On the other hand, task comments were the most frequently encountered smell type according to practitioners with the highest weighted average. This smell type ranked as the \(2^{nd}\) most encountered smell type during labeling.

While labeling the projects, we never encountered an attribution comment. Similarly, this type was the \(2^{nd}\) least encountered smell type according to the practitioners. In addition, irrelevant comments ranked the lowest as the least encountered smell type by practitioners.

The comparison of labeling and survey results suggests that task, vague and beautification comments occur more frequently than other smell types whereas irrelevant and attribution comments occur less frequently.

Table 6 Comparison of smell ranks obtained from the labeling process and practitioners’ survey

4.3.3 Perceptions on Impacts of Smells

Figure 22 shows the distribution of responses to the survey question about the effects of different smell types on software. In general, survey respondents found misleading comments negatively affecting software maintainability or code comprehension the most, as indicated by 90% of the respondents. The practitioners also found irrelevant, vague, and no comment on non-obvious code smells to have negative effects on software. On the other hand, task, beautification, too much information, and non-local information types of comments could positively impact software as argued by the practitioners. The other types of smells were found to have negligible effects.

figure c

4.4 Acceptance of Pull Requests and Issues

In total, we opened 40 pull requests and 13 issues for the projects. Some developers accepted the changes as it is, some were rejected due to keeping task comments in the code or not accepting minor changes to the code and some did not receive any feedback. Since we opened both a pull request and an issue for task smells, we counted the smell being fixed if either the pull request or the issue was approved. Thus, in total, we had 45 fix attempts (by combining eight task smells fixes). Out of these 45, we had 11 approved, 4 ignored, and 30 rejected pull requests or issues. Our acceptance rate was 27% based on approved fixes over the approved and rejected fixes (excluding ignored cases).

Fig. 22
figure 22

Distribution of responses to the survey question: In your opinion, how does this comment type impact the software maintainability/code comprehension?

The average response rate of the project maintainers’ feedback was around 4.5 days. In Kivy project, three of the requests were rejected almost two months after we created them, which has a great effect on the response rate. The rate without including those three requests is around 9 hours. The responses we received from the project maintainers can be summarized as follows:

  • Scikit-learn rejected all PRs and issues without reasoning.

  • Requests project accepted only one issue. They provided an explanation for an issue regarding a vague smell. They rejected the other PRs as developers replied saying that they do not accept minor changes.

  • The eight requests we submitted for the Light-4j project were all accepted. In one of the issues for this project, we pointed out a vague comment and asked them to provide more detail. They explained that the comment was in fact misleading and removed it later on.

  • Moshi project accepted to remove an obvious comment but rejected the issues on task comments as they preferred the keep their tasks in the code.

  • Jitsi project accepted to remove a comment that was misleading. However, they also rejected task comments as they preferred to have them in the code.

  • Anki-Android project either ignored or rejected the PRs we submitted to the project with no explanation.

  • Kivy project ignored three and rejected three of the six PRs we submitted.

  • Scrapy project rejected all of the PRs and issues for no reason except for one, as they found another related issue in our PR that suggested removing a task comment.

Overall, our PR acceptance rate was 27%. The number of accepted, ignored, and rejected pull requests/issues is given in Fig. 23.

figure d
Fig. 23
figure 23

Acceptance of pull requests and issues for each smell type

5 Discussion

This section discusses the potential reasons behind the results we obtained from labeling comment smells and the survey, as well as gives implications for researchers, practitioners, and educators.

5.1 Revisiting Research Questions

We revisit each of the research questions that we defined in the Introduction section to further discuss our findings and their possible explanations.

5.1.1 RQ1. What Types Of Inline Comment Smells Are There?

We have identified 11 inline comment smell types as a result of a multivocal literature review and manual analysis done on open-source Java and Python projects. We also got practitioners’ points of view on these smells through a survey. Although the practitioners generally agreed with the taxonomy, there could be cases where comments we defined as smells might not be considered a smell in certain circumstances. Practitioners indicated that task comments might be helpful to grab the attention of developers. For beautification smell, they pointed out that these comments can ease reading a code. We also encountered supportive guidelines about writing task and beautification comments in the Anki-Android project. Survey respondents also mentioned that too much information and the comment giving non-local information might help newcomers understand the project better.

Since most of the survey respondents agreed with our smell taxonomy (except task comments), we included those types in our final smell taxonomy. For task comments, because we got a reasonable amount of responses saying they are generally forgotten, we also considered it as a smell and added it to our final taxonomy.

Although we have conducted an MLR, analyzed Java and Python projects, and carried out a practitioner survey, there might be additional smell types we are unaware of in these languages as well as in other languages. More specifically, different languages might require different commenting practices depending on their ability to express (Koornhof 2015). Considering that the languages we manually analyzed are easy to read, the possibility of different smell types occurring in languages that have lower expressiveness should not be overlooked. We should also mention that our taxonomy is based on the semantics of comments. Structural smells may also exist in projects with guidelines on comment structure. For instance, PEP 8 guidelines state that the first letter of the comment should be capitalized, there should be two spaces after sentence-ending periods in multi-sentence comments, etc.

We have encountered various types of comments that could be considered smells; however, we decided to label them as not a smell due to our labeling procedure. One such case is an inline comment restating the documentation comment (duplication). Because we ignored documentation comments from labeling, such comments were labeled as not a smell. Duplicate comments may lead to code-comment inconsistencies, as indicated in the study of Blasi et al. (2021). In Python, some comments referred to what type of object should be assigned to a variable. Although we labeled them as not a smell, they could be considered a smell if a project requires type hinting. Comments referring to common language keywords (e.g., __name__ in Python) could be considered as obvious smells since developers with enough experience would know them, and they could be easily found by an online search. However, we labeled them as not a smell since they might be helpful to novices.

5.1.2 RQ2. How Often Does Each Smell Type Occur In Practice?

There can be various reasons why a smell type occurs frequently or rarely in practice. We came up with several possible reasons to explain this:

  • During labeling, we did not encounter any attribution comments in any of the projects. We encountered all the other smell types at least in two projects. This might be because version control systems such as GitHub are being used, and there is no longer a need to have attribution comments.

  • We inspected the comment guidelines of the projects, which might have an impact on which smell type occurs more or less. We have observed that Anki-Android has the highest percentage of task and beautification comments. This might have been caused by its comment guidelines regarding these smell types, which encourage contributors to use them. Jitsi’s code conventions regarding commented-out code might have led to Jitsi having the highest percentage for this smell type. Scikit-learn’s guidelines imply that obvious or vague comments should be pointed out during code review, which might be the reason why it has lower percentages of these smell types. Although the Python projects included guidelines on avoiding specific types of comments, we have observed that developers did not comply with them fully. The reason may be related to developers avoiding comment guidelines, which was also observed in the study of Rani et al. (2021) on comment styles.

  • We have observed that all four Python projects have fewer percentages of comment smells than all four Java projects. This might be due to the number of stars and contributors of the projects. We observed that all Python projects have a higher number of stars and a higher number of contributors than all Java projects, which might be the reason why we encountered fewer smells in Python projects. Additionally, we also found out that the guidelines of all four Python projects include statements regarding avoiding obvious, vague, and misleading comments in contrast to guidelines of all four Java projects that do not (see Table 7). Having guidelines that encourage avoiding these types of comment smells might also be another reason why we encountered fewer smells when we labeled Python projects. Since Python and Java are two different programming languages, language pragmatics might also have played a role in these results. However, to confirm this statement further research and studies of the two languages should be conducted.

5.1.3 RQ3. What is the Perception Of Software Practitioners About Inline Comment Smells?

Our survey results indicate that there are differing opinions on whether a comment type should be considered a smell and how often this smell type is encountered. This can be due to several reasons:

  • Practitioners who completed our survey had varying levels of software engineering experience. This difference might have affected the practitioners’ views on the smells. For example, a practitioner with 13+ years of experience is more likely to encounter a certain type of smell than a practitioner with one year of experience.

  • Practitioners are more likely to work in different sectors and workplaces. Considering that each workplace and sector can come with different work cultures, this might affect their answers in the survey. For example, task and beautification comments received an almost equal number of agreements and disagreements regarding whether they should be counted as smells. Some workplaces might be encouraging these types of comments, whereas others might not. Hence, practitioners have different views regarding these smell types.

  • All eight projects we have labeled are open-source projects. Practitioners who participated in our survey do not have to necessarily work with open-source projects, so this might have caused differences in the survey and empirical results. For example, the most encountered smell type is commented-out code according to the practitioners. Our empirical research showed that obvious comments are the most common smell type in the labeled projects.

Table 7 Projects with contribution guidelines mentioning which types of smells should be avoided (asterisks refer to indirect references)

5.1.4 RQ4. Do Developers Of Open-Source Projects Accept Pull Requests Or Issues To Remove Inline Comments Smells?

Due to the manual submission of pull requests and issues, we could not submit hundreds of them. In addition, we observed that developers respond negatively to the high number of requests that deal with similar issues as they see them as spam. Hence, we chose to attempt fixing only at most one comment from each smell type for each project. We attained an acceptance rate of 27%. On average, the acceptance of pull requests to open-source projects is 29% Medeiros et al. (2019). Although it is low, we obtained a very close result to the average.

Our results indicate that the acceptance of removing inline comment smells heavily relies on the preferences of individuals and teams. In total, we had 11 accepted pull requests or issues. 7 of them were accepted by the developers of Light-4j. This suggests that some individuals and teams are more welcoming toward these types of requests that remove an inline comment smell. 30 of the requests we opened were rejected. The fact that four of them (13.5%) were rejected due to developers preferring to maintain task comments in the code base highlights that some teams may prioritize the inclusion of tasks in the code base over the potential threats of inline comment smells. Furthermore, 8 requests were rejected (26.5%) due to developers not accepting minor changes. This indicates that some teams may be resistant to changing code that is already working even if removing the comment smell would be beneficial but a minor adjustment. Lastly, 18 requests were rejected (60%) as the developers did not provide any information regarding their decision. It is likely that they saw our PRs as spam or decided to not consider them since the PRs were not fixing a bug or adding a new functionality.

Two of the projects that we submitted requests to explained that they prefer keeping task comments in the code base. We observed a similar case in our survey where some developers mentioned that they do not consider task comments as smells. The results indicate that the definition of comment smells of programmers may not align with that of the literature due to the preferences of individuals and teams. However, according to the feedback to our requests and survey results, the comment smell types that are most likely to receive inconsistent opinions are task and beautification smells. Thus, the alignment of definitions of comments smells among programmers and in the literature can change depending on the smell types.

In general, the acceptance of removing inline comment smells is largely determined by the preferences of individuals and teams, which may be the reason for the low acceptance rate that we obtained. Although removing comment smells is inherently valuable for the code base, developers of open-source projects generally do not prefer pull requests that propose such a change. Thus, our acceptance rate indicates that removing comment smells can be beneficial for the code base though it is not a priority for the developers.

5.2 Implications for Researchers

There are various ways our findings can be utilized by researchers. In this subsection, we explain how and what type of research can be conducted about comment smells in relation to our findings.

  • We have established a taxonomy of comment smells and labeled 2447 comments. The dataset we published can be referenced by researchers to develop tools that can be used to detect comment smells and identify their types, which will contribute to increasing the readability and writability of code. The taxonomy and labeling of the comments can be utilized as a reference point to identify comment smells that occur. Understanding which types of comment smells occur the most and why can enhance the comment quality of code.

  • The eight manually labeled projects of two different languages, Java and Python, imply that there are differences in the number of smells they have. It is recommended that researchers conduct similar labeling processes for different languages to see if there are differences in the number of smells in correlation to the language. This can improve our understanding of comment smells in terms of why they occur and how their numbers can be decreased. Additionally, through the labeling of more projects, the taxonomy can be extended to include more types.

  • How our understanding of this taxonomy of comment smells can play a role in comment quality is a question that is still unanswered. Case studies can be conducted to answer this question. Software engineers who have been exposed and not exposed to the taxonomy can be compared in terms of the number of comment smells they can identify or have in their code.

  • We have conducted a survey for practitioners to validate our taxonomy. We observed that some participants did not view some of the types in our taxonomy as smells. For example, task and beautification comments improved the quality of the code according to some practitioners. Hence, approaches from people with different backgrounds or experience levels in the software industry can be further investigated.

  • Comment smells can occur at different stages of the software development cycle. For example, smells can be overlooked during code reviews. These stages can be inspected to identify which smells occur at which stages. Acquiring this knowledge can lead to finding new ways of avoiding comment smells.

5.3 Implications for Practitioners

We observed that out of eight projects, only three of them had guidelines regarding how the comments should be written. These guidelines only hinted at one to three smell types identified in this paper and did not cover writing comments as detailed as in other sections. This suggests that projects have a low number of comment guidelines.

Although these guidelines regarding the comments are relatively limited in terms of the smell types they mention, the observations imply that guidelines on comments can have an impact on the comment quality, as we have discussed in the previous section. Hence, it can be recommended that practitioners include comment guidelines in their projects to increase comment readability and writability.

Furthermore, a model can be implemented by practitioners to detect inline comment smells automatically in their projects. This would facilitate the reviewing process and help code reviewers detect comment smells. Ultimately, this would lead to an increase in the quality of the comments on the projects.

5.4 Implications for Educators

In the software engineering education curriculum, there is a lack of training in code comment writing. The proper way of using inline comments can be overlooked in the lectures.

We tried to find lecture notes regarding code comments by Google Search, which yielded results that were scarce in number. Even though we found sources that give guidelines for comments, most of them lacked detailed information regarding distinct comment and smell types. These sources include lecture notes from Stanford University (Ousterhout 2015), University of Utah (Commenting 2023), and University of Puget Sound (Chun et al. 2023) that encourage commenting on non-obvious code and avoiding obvious comments.

In introductory courses, instructors often encourage students to write too many comments, to the extent that their projects include many obvious comments that explain basic code. For example, a source from the University of Washington (of Washington 2023), specifically gives a definition for inline comments in contrast to the above-mentioned sources. However, this source does not give a direct guideline for writing an inline comment and even gives an example that can be labeled as an obvious comment. This inline comment example and the lack of guidelines can lead a student to explain obvious code, which causes the number of smells to increase in code.

This claim can further be supported by blog entries of two senior software engineers (Spertus 2021; Kunk 2011). Both blog entries agree that students feel like they must write a lot of comments for a good grade. Without the required training in comment writing, this can lead many students write comments for the sake of writing them. While this might be beneficial for extreme beginners to understand coding, it might cause a habit of explaining obvious code.

The taxonomy of comment smells can be introduced in lectures to software engineering candidates. Educators can use our study to organize a session that focuses on comment smells and their importance on code comprehension. This can raise awareness of code readability and writability, which can lead to projects having comments of better quality.

6 Threats to Validity

In the following, we discuss potential threats to the internal and external validity of the study.

Internal Validity: There are various validity threats to the design of our study. First of all, our search query defined for MLR may not capture some works on comment smells, and we may miss important sources. We mitigated this threat by deciding the keywords based on previous similar MLR studies. Execution of the defined query in Google Search (for grey literature) was conducted by two authors for the mitigation of the filter bubble. Secondly, we analyzed the sources provided by Google Search and Google Scholar. There could be academic or gray literature materials related to the study that were not indexed by these engines. To minimize the risk of any subjectivity in the MLR process, we followed the MLR guidelines of Garousi et al. (2019). During the primary study selection and data extraction, all steps were completed by two of the authors independently. When there was any conflict, the third author was consulted in the decision process.

The documentation comments in Java were filtered out by checking whether a given comment starts with “/**”. However, in some cases, there were method-level comments starting with “//” or “/*”, which were filtered out manually during the labeling process and replaced with inline comments. We also ignored the license and linter-disabling (e.g., #noqa in Python) comments during the labeling process. Thus, the number of inline comments shown in Table 2 is not exact.

We sampled the comments with file granularity, which led to 266 comments being duplicates due to the same piece of code being written in multiple places in a source file (especially in test codes). This could potentially create a bias in the analysis. However, we believe that file-level sampling helped us to label the comments with higher confidence.

Although two authors labeled the comments, there could still be a possibility of mislabeled comments. For instance, annotating a comment as misleading requires understanding the related code part completely. Moreover, distinguishing vague and non-local comments from not a smell was sometimes difficult as they required a decent amount of domain knowledge.

Furthermore, there are several threats to the validity of the survey results. To give an incentive for participation, we promised $100 Amazon Gift card to one of the participants who would be chosen randomly. Even though this helped us recruit developers from different backgrounds, it might also have created a sampling bias in the survey by attracting people who are completing it for its prize.

The survey had 52 questions that included multiple choice and open-ended questions, which are estimated to take 15-20 minutes to complete. The length of the survey might also cause biased answers because the attention span of a participant decreases after spending 10 minutes on a survey (Sen 2023). This increases the likelihood of a participant answering without thinking carefully or picking a random response.

Another potential threat to the survey is that the respondents may misunderstand the smell definitions. To mitigate this issue, we provided a description and an accompanying example for each smell type. We also conducted a pilot test with five software engineers from the authors’ network before sending the survey to the practitioners to make sure that the questions were understandable. We adjusted the questions according to their feedback to make sure that the questions and smells presented in the survey were comprehensive.

External Validity: An external validity threat related to the analysis of only eight projects is the limited generalizability of the study’s findings to other software projects or contexts. Because the sample size is small and may not be representative of the broader population of software projects, the results obtained from these eight projects may not be applicable to other projects with different characteristics, development processes, team structures, or technologies.

Lastly, we submitted 53 pull requests and/or issues, which is relatively small to make a general statement about the developers’ perception of removing code smells from open-source projects. Moreover, the pull requests were open to the personal opinions of the developers who are reviewing them as we encountered such cases. All developers have different priorities regarding the code; thus, it is difficult to generalize the results to all developers of open-source projects.

7 Conclusion and Future Work

Code comments are valuable items when it comes to documenting and explaining source code. Having low-quality comments negatively affects source code comprehension and deteriorates software quality. In this study, we present a taxonomy of inline code comment smells consisting of 11 categories. Our manual analysis of eight open-source Java and Python projects revealed that inline comment smells exist in projects, and the majority of them are obvious comments. We also published the manually labeled smell dataset online. Moreover, we conducted a survey with software practitioners to get their opinions on the inline comment smells and their effects on program comprehension and maintainability. The majority of respondents agreed with the smells defined in our taxonomy except for task comments and they found misleading comments to be the smell type that negatively affects code comprehension and maintainability the most. We submitted 53 pull requests and/or issues to remove comment smells from open-source projects in order to better understand their significance for developers, where we achieved an acceptance rate of 27%. Despite the indication that removing comment smells is beneficial, we received a low acceptance rate as developers of open-source projects tend to reject pull requests that do not necessarily fix issues and oppose changing working source code.

There are several implications of the study for software practitioners, researchers, and educators. First, researchers can build automated models to recognize comment smells in a project using our published dataset. Second, our taxonomy can be used as a baseline to be extended by researchers. Third, practitioners can include our smell taxonomy in their project contribution guidelines on code comments. Lastly, educators can raise awareness of code comprehension by having lectures about our smell taxonomy.

In future work, we plan to develop a model for automatically detecting smell types by utilizing natural language processing and machine learning techniques. Another challenge that we would also like to undertake is to investigate no comment on non-obvious code smell type and develop algorithms to find the occurrences of that smell automatically.