Toward Effective Secure Code Reviews: An Empirical Study of Security-Related Coding Weaknesses

Identifying security issues early is encouraged to reduce the latent negative impacts on software systems. Code review is a widely-used method that allows developers to manually inspect modified code, catching security issues during a software development cycle. However, existing code review studies often focus on known vulnerabilities, neglecting coding weaknesses, which can introduce real-world security issues that are more visible through code review. The practices of code reviews in identifying such coding weaknesses are not yet fully investigated. To better understand this, we conducted an empirical case study in two large open-source projects, OpenSSL and PHP. Based on 135,560 code review comments, we found that reviewers raised security concerns in 35 out of 40 coding weakness categories. Surprisingly, some coding weaknesses related to past vulnerabilities, such as memory errors and resource management, were discussed less often than the vulnerabilities. Developers attempted to address raised security concerns in many cases (39%-41%), but a substantial portion was merely acknowledged (30%-36%), and some went unfixed due to disagreements about solutions (18%-20%). This highlights that coding weaknesses can slip through code review even when identified. Our findings suggest that reviewers can identify various coding weaknesses leading to security issues during code reviews. However, these results also reveal shortcomings in current code review practices, indicating the need for more effective mechanisms or support for increasing awareness of security issue management in code reviews.


Introduction
Software security is an important focus in software development processes because it encompasses how software system sustains external threats (McGraw, 2004).Managing security issues in software products is crucial because the latent security issues, especially exploitable vulnerability, can exponentially impact end-users and require more resources to resolve if discovered in the later stage.Attempting to mitigate security issues, developers are encouraged by the ongoing shift-left concept (Migues, 2021;Weir et al., 2022) to test the new software as early as possible.
In the spirit of shifting left, numerous organizations have adopted modern code review, a software quality assurance activity for identifying and removing software defects early in the development lifecycle (Bosu, 2013;Rigby et al., 2012).Prior studies reported that code review is a potential approach for identifying and eliminating security issues at the early stage (Hein and Saiedian, 2009;Bosu et al., 2014;Assal and Chiasson, 2018).In particular, a study by Di Biase et al. (2016) observed that code review could identify well-known security issues such as Cross-Site Scripting (XSS).
Several studies have investigated the benefits of code reviews in identifying security issues (Alfadel et al., 2023;Bosu et al., 2014;Di Biase et al., 2016;Edmundson et al., 2013;Paul et al., 2021b).Still, the security issues studied by previous works were typically bounded by the types of well-known vulnerabilities such as SQL Injection and XSS.In particular, the majority of studied security issues are limited to the vulnerabilities that attackers can exploit.Since code review focuses on identifying and mitigating coding issues, we hypothesize that coding weaknesses, or faults in code, that can potentially lead to security issues may also be found and mitigated during the code review process.Moreover, coding weaknesses should fit the capability of the typical reviewers who may have limited security awareness and knowledge (Braz et al., 2022) because reviewers need to have a substantial understanding of security knowledge in order to identify the security issues during code review (Braz and Bacchelli, 2022).Our preliminary analyses (Section 5) indicate that the simple coding weaknesses such as numeric errors, insufficient input validation, or business logic errors are more frequently discussed by reviewers than the security issues that prior work regularly studied in code reviews.However, the practices of code reviews in identifying such coding weaknesses are not yet fully investigated.This includes the types of coding weaknesses that lead to security issues and the handling process of these coding weaknesses.In addition, little is known about whether the security concerns raised during code reviews are aligned with the vulnerabilities that a system may have had in the past.
Exploring these aspects would help us better understand the unrealized benefits of considering coding weaknesses during code reviews for the early prevention of software security issues.Such insight can also reveal the gaps between the current code review practice and the vulnerabilities that were known in the respective systems.On one hand, software teams could develop secure code review policies that enable them to more effectively identify and address security concerns during the code review process (Mäntylä and Lassenius, 2009).On the other hand, researchers can expand the new perspective of code review studies and understand the shortcomings in code review practices and tools.
In this work, we aim to investigate the coding weaknesses that were raised during the code review and to investigate how the code review comments that mentioned coding weaknesses were handled.We conducted our case study on OpenSSL and PHP which are large open-source systems that are prone to security issues.We chose to examine these phenomena in open-source projects due to the availability of publicly accessible datasets, the mandatory code review policy, and the past vulnerabilities of the selected projects.This decision also stems from the observation that code review outcomes in open-source communities, such as the ratio between functional defects and maintainability defects identified by reviewers, do not significantly differ from those observed in industry settings.(Mäntylä and Lassenius, 2009;Beller et al., 2014).To confirm our presumption that the discussion related to coding weaknesses is more prevalent than the explicit vulnerabilities, we conducted an initial analysis by manually annotating 400 randomly sampled code review comments from each studied project.We found that coding weaknesses could be raised in the code reviews 21 -33.5 times more often than explicit vulnerabilities.
Therefore, we conducted an empirical study to address three research questions: (RQ1) What kinds of security concerns related to coding weaknesses are often raised in code review?, (RQ2) How aligned are the raised security concerns and known vulnerabilities?, and (RQ3) How are security concerns handled in code review?.To do so, we applied a semi-automated approach to 135,560 code review comments to identify code review comments that are related to coding weaknesses.Then, we manually annotated the types of coding weaknesses for 6,146 code review comments that are related to coding weaknesses.We used the taxonomy of the Common Weakness Enumeration-CWE-699 which covers 40 categories of coding weaknesses that are related to security issues.In addition, we analyzed 378 Common Vulnerabilities Exposure (CVE) reports (101 from OpenSSL; 277 from PHP) to investigate whether the coding weaknesses raised during the code reviews aligned with the known vulnerabilities in the systems.To understand how coding weakness comments were handled during code reviews, we performed qualitative analysis to identify the handling scenarios based on the code review activities (e.g., review discussion, revisions) of the corresponding code changes.
The case study results show that coding weaknesses related to 35 out of the 40 categories in CWE-699 were raised during the code review process of OpenSSL and PHP (RQ1).For example, comments about coding weaknesses in authentication, privilege, and API were frequently raised in both studied projects.Each studied project also has unique coding weaknesses raised during code reviews, e.g., the direct security threats in OpenSSL and input data validation in PHP.These results indicate that various coding weaknesses that link to security issues were raised during the code review process, and the different software projects have different focused coding weaknesses.Known vulnerabilities in the studied projects are related to 16 weakness categories (RQ2).However, coding weaknesses related to memory buffer errors and resource management errors are the least frequently discussed coding weaknesses in OpenSSL and PHP (4%-9%), despite the high percentages of known vulnerabilities (17%-29%).These results suggest that some important coding weaknesses in a project may not be sufficiently discussed in the current code review practice.
Coding weaknesses raised during the code reviews were handled in four ways (RQ3).In many cases (39%-41%), developers attempted to solve the issues.Nevertheless, approximately a third (30%-36%) of the raised coding weaknesses were only acknowledged without immediate fixes (i.e., no additional modifications to the code changes in the reviews).We observed that some of the acknowledged concerns were agreed to be fixed in new separate code changes (10%-18%) and some were left without fixing due to disagreement about the proper solution (18%-20%).A relatively small proportion of the concerns raised (14%-26%) were clarified and dismissed through discussion.From all scenarios, we found alarming cases (6%-9%) where security issues can be introduced in the code because code changes with unresolved discussion were eventually merged.Additionally, the abandoned concerns (3%-9%) and the unsuccessfully fixed concerns (2%-4%) are also important because they can negatively affect the developer's contribution (Gerosa et al., 2021).These scenarios indicate that security concerns from coding weaknesses need a better handling process.
Based on the findings, we recommend the software projects to consider coding weakness categories (i.e., CWE-699) as a guideline for identifying coding weaknesses that can introduce security issues in code reviews.Coding weaknesses can be prioritized based on the significance, proneness, and unique set of coding weaknesses raised the past code reviews.Our work also highlights a shortcoming of the code review process in handling security concerns (i.e., unsuccessfully fixed, unresolved discussion, and unresponded) which might require future work to address.
Novelty & Contributions: To the best of our knowledge, this paper is the first to empirically investigate code review in identifying and mitigating coding weaknesses that link to security issues in addition to the well-known vulnerabilities, highlighting the potential benefits of code reviews for early prevention of potential security issues.Second, we presented a novel semi-automated approach that leverages a domain-specific pre-trained word embedding model to find potential code review comments related to security issues.Third, we examined the alignment between the known vulnerabilities in the studied systems and the coding weaknesses that were often raised during the code review process, highlighting a shortcoming of the current code review practices that some important coding weaknesses may not be sufficiently discussed.Finally, we investigated the handling process of the coding weaknesses raised during a code review which sheds light on an issue that a coding weakness can slip through the code review process and potentially become a security issue in the future.

Data Availability:
We have released the supplementary material1 of scripts for data retrieval and data analysis in this study along with the annotated data to facilitate further research.
Paper Organization: Section 2 describes the background and explores the related work.Section 3 demonstrates the examples of vulnerabilities that were caused by coding weaknesses.Section 4 explains the case study design.Section 5 explains the initial analysis method and result.Section 6 reports the results.Section 7 discusses the implications and suggestions.Section 8 clarifies the threats that may affect the validity of this study.Finally, Section 9 draws the conclusion.

Background and Related Work
In this section, we provide background on software security and the modern code review process.We then discuss the current practice and remaining challenges in code review for identifying security issues.

Software Security
Software security represents the concepts that help software products survive and operate normally under harmful attacks (McGraw, 2004).Software security issues represent a variety of problems related to software security.Software vulnerability, a type of security issue, is a ".. security flaw, glitch, or weakness found in software code that could be exploited by an attacker .." (Dempsey et al., 2020).In particular, Munaiah et al. (2017) explained that the flaws or faults in software can result in "the violation of the system's security policies".The same study also suggested that these kinds of faults should be prioritized and eliminated as early as possible, as they could constitute considerable impacts in the later stages.

Coding Weaknesses
Software security issues can be caused by the chain of common software development faults (Hoole et al., 2016), so-called coding weaknesses, such as improper data validation, miscalculation, or incorrect memory allocation.Bojanova et al. (2016) introduced the Bugs Framework that defines the relationship between the causes (coding weakness) and the consequences of various software security issues.Each security issue can originate from several causes, influenced by different attributes, and result in various consequences that could be exploited by an attacker.Bojanova and Galhardo (2023) elaborated the Bugs Framework by showing several real-world examples of how a sequence of coding weaknesses can cause a security issue.To give an example, CVE-2021-218342 presents a vulnerability in a MPEG-4 library.An attacker can exploit this vulnerability by developing a special input that triggers the buffer overflow and causes a denial of service error.Although the buffer overflow is part of the vulnerable consequence, the actual cause of the vulnerability is a coding weakness i.e., improper input validation.Similarly, Tsipenyuk et al. (2005) also identified seven groups of faults in source code that can compromise software security.For example, Path Manipulation error and Illegal Pointer Value error belong to the group Input Validation and Representation which can lead to security issues such as cross-site scripting (XSS) or SQL Injection.

Security Shift-Left
The Shift-Left concept was introduced by Smith (2001) to motivate practitioners to test the developed software products as early, and from as many aspects, as possible in order to diminish the potential defects that may introduce unexpected consequences in later stages.Carter (2017) suggested security shift-left where security awareness should be incorporated into the software development lifecycle at the early stage, allowing the incubation of synergies between security experts and software developers.The early detection of security issues also means smaller cost and effort required to fix them (McConnell, 2004), as well as reducing the impacts on the end-users.
In practice, practitioners can use various approaches to perform security shiftleft.From the organizational perspective, Weir et al. (2022) provided initial guidelines for companies that want to shift security to the left.At the management level, the companies can establish security governance, create a centralized software security group which consists of security specialists, and allocate a security satellite team to support security activities across the organization.At the project level, the project team should lead the technical aspect of security matters and request security support from the satellite team when needs arise.

Modern Code Review
Lightweight code review has been increasingly adopted in the software industry as a replacement for traditional code inspection activity (Rigby et al., 2012).A developer who composes the code changes (so-called, Author) submits the changes to the code review system.Then other developers in the project (so-called, Reviewers) can freely review the changes and provide feedback.Although code review is used for multiple purposes such as knowledge transfer or sharing of code ownership, the main objective of the activity remains to manage software defects (Bacchelli and Bird, 2013;Rigby and Bird, 2013;Thongtanunam et al., 2015).Balachandran (2013) presented various factors that affect code reviews such as the experience and available effort of reviewers.Similarly, Bacchelli and Bird (2013) reported that different levels of source code understanding among code developers affect the quality of code review feedback.In addition, the outcomes of the code review were also discussed.Mäntylä and Lassenius (2009) identified two defect classes (1) functionality and 2) evolvability) that can be discovered during code reviews of nine industrial software projects.The study of Beller et al. (2014) confirmed a similar notion based on the ConQAT and GROMACS open-source projects.In addition to these two main types of defects, Thongtanunam et al. (2015) found that code developers may also raise concerns related to traceability (i.e., the bookkeeping of code changes).Nevertheless, few studies have delved into the non-functional, yet important defects like security issues.

Code Review for Software Security
During the code review process, reviewers may follow the security code review practice (Howard, 2006) to detect various potential security issues in the code.Prior studies have investigated the well-known security issues such as overflow, cross-site scripting, SQL injection, and cross-site request forgery that can be identified during the code review process (Paul et al., 2021b;Bosu et al., 2014;Bosu and Carver, 2013;Edmundson et al., 2013;Alfadel et al., 2023;Di Biase et al., 2016).Edmundson et al. (2013) conducted a controlled study by asking developers to identify the injected vulnerable code in a web application called Anchor CMS.Paul et al. (2021b) andDi Biase et al. (2016) examined code review comments in Chromium projects.A recent study by Alfadel et al. (2023) examined code review comments in NodeJS packages.In brief, they reported similar results that developers can identify certain security issues such as cross-site scripting (XSS), but the effectiveness of security issue identification in code reviews is relatively low.
It can be seen that the types of security issues studied in prior works (Paul et al., 2021b;Bosu et al., 2014;Bosu and Carver, 2013;Edmundson et al., 2013;Alfadel et al., 2023;Di Biase et al., 2016) are still limited to the aspects of consequences from the external threats, i.e., how a threat agent (or an attacker) can exploit the system (e.g., cross-site scripting and SQL injection).To be able to identify such external threats during code reviews, reviewers are required to possess a deep understanding of security knowledge (Braz and Bacchelli, 2022) and endure high cognitive load (Gonçalves et al., 2022).
Rather than focusing on well-known security issues that are limited to the aspects of consequences from external threats or the visible vulnerabilities, little work has investigated coding weaknesses that potentially introduce security issues and vulnerabilities.Since code reviews focus on identifying coding issues in source code, identifying coding weaknesses that can lead to various exploitable vulnerabilities would be more aligned with code review practices.Therefore, investigating the types of coding weaknesses will help us better understand the benefits and capabilities of code reviews for software security, as well as highlight any deficiencies in current discussions on coding weaknesses within code reviews.Table 1 shows the key differences (i.e., studied taxonomy and methodology) between this work and the related secure code review studies.

Security Concern Handling Process in Code Review
To better understand the effectiveness of code reviews, prior studies have investigated how raised concerns were addressed during code reviews.Assal and Chiasson (2018) interviewed developers to understand their perception of security throughout the software development lifecycle.While the results show that security can be seen as a formal checkpoint in code reviews, the results did not elaborate on how the developers would respond to the security concerns raised by reviewers.A study by Kononenko et al. (2016) reported that security concerns are one of the aspects that developers consider before making changes.In particular, Han et al. (2021) investigated code-smell-related comments (e.g., violation of coding conventions) in code reviews and reported that 6% of comments were ignored by developers.Similarly, Beller et al. (2014) found that 7%-35% of the code review comments were neglected by the developers in general.
Although several studies have investigated the aspects of the code review process, little work has studied how developers respond to concerns about coding weakness, given that they are related to potential security issues, during code reviews.Referring to Section 2.2, it becomes apparent that coding weaknesses can overshadow both functionality and evolvability defects that reviewers can identify during code reviews (Mäntylä and Lassenius, 2009).An extended understanding of how developers respond to such coding weaknesses could shed light on current secure code review practices and their challenges.Such an understanding would help practitioners develop secure code review policies that address current challenges, allowing the team to execute better secure code reviews and prevent security issues in software products.

Motivating Examples
In this section, we provide motivating examples of security issues that are related to coding weaknesses which can potentially be identified by code reviews.We obtained three examples from the Common Vulnerabilities Exposure (CVE) reports for our studied systems (i.e., OpenSSL and PHP).A CVE report provides a description of the known vulnerabilities and related information including corresponding code changes (or patches), severity score, or related weaknesses that are assigned by the National Vulnerability Database (NVD) security analysts.
Example 1: Heartbleed (CVE-2014-0160) 3 in OpenSSL is a data leak vulnerability that occured due to improper input validation (CWE-20). 4Heartbleed is one of the famous security incidents in OpenSSL that affected a large number of servers between 2012 and 2014.An OpenSSL client can send a Heartbeat message to monitor the availability of a server.The server should return the received message to the client.However, in the vulnerable version (i.e., the version with the improper input validation), OpenSSL responds with the data of any length that the client specifies.Hence, sensitive information on the server can be obtained by the client.The coding weaknesses that caused such unexpected behavior is the improper input validation of the length parameter write length.As shown in the fixing patch of the Heartbleed issue (see Figure 1), the vulnerable code was fixed by ensuring that the entered length is valid.A study by Durumeric et al. (2014) suggested that if the reviewers had identified the improper input validation in the vulnerable version of OpenSSL, this vulnerability could have been prevented early.Example 2: The Integer Overflow or Wraparound weakness can trigger heap memory corruption in a vulnerable version of OpenSSL, leading to a denial of service error (CVE-2016-21065 ).The cause of this vulnerability is the integer overflow (CWE-190). 6The Integer Overflow or Wraparound weakness is a numeric error where an integer value becomes larger than the maximum size of the associated data type, forcing the system to mistakenly wrap the value around and causing the denial-of-services error.It can be seen in Figure 2 that the original condition (i + inl < b) is prone to the integer overflow because the output of the left-handside operation can exceed the range of the implicit variable type.This particular weakness was fixed by a patch that adjusted the expression by subtracting two integers, instead of adding them.Hence, in this case, the denial-of-service vulnerability from heap memory corruption was caused by the integer overflow or wraparound weakness.) is a denial-of-service vulnerability in PHP that was caused by a weakness of type Allocation of Resources Without Limits or Throttling (CWE-770)8 , which can also be considered as a Business Logic Errors (CWE-840) weakness. 9The vulnerable version of PHP can make the system unresponsive to requests when the attacker enters the lengthy input into a function that executes multiple for-loops.As seen in Figure 3, one of the loop variables (i) is incremented without being checked in the control condition.It was fixed by adding the missing exit condition to the loop when the counter reached the maximum size.
These examples demonstrate that coding weaknesses can contribute to security issues.Since code reviews focus on identifying coding issues in source code, identifying coding weaknesses that can lead to various exploitable vulnerabilities and security issues would be beneficial to code review practices.Refer to Figure 4 for a real-world example of a coding weakness i.e., exposing the internal values in er- However, little has been investigated on how often the reviewers can identify coding weaknesses that link to security issues during the code reviews, what kinds of security concerns are raised, and how they are being handled or responded to.This insight would help practitioners improve their code review practice and equip developers with a secure code review mindset that is more compatible with their technical expertise.

Case Study Design
In this section, we outline our study by describing the research questions, the selection processes of studied subjects and coding weakness taxonomy, the data collection method, and our analysis approach to answer our research questions.
As comments related to security issues are sparse in code reviews (Di Biase et al., 2016), it can be challenging to synthesize an insightful result from diverse data sources.To overcome this problem, we followed the case study method (Perry et al., 2004) to explore the real-world security concerns in selected software projects that are prone to security issues.We chose OpenSSL and PHP for our case study based on the known vulnerabilities in the past and the code review activities.

Research Questions
To understand the potential benefits of considering coding weaknesses during code reviews for early prevention of software security issues, we formulate the following research questions.
RQ1: What kinds of security concerns related to coding weaknesses are often raised in code review?Motivation: Code review is an approach that many projects use to identify and eliminate defects early before integrating the new code into the codebase.However, Braz and Bacchelli (2022) raised a concern that identifying security issues during the code review process can be challenging for developers because of the lack of security knowledge and awareness.As Bojanova and Galhardo (2023) showed that the chain of coding weaknesses can be the root cause of security issue, we hypothesize that code reviews could identify such coding weaknesses which have simpler coding patterns and require less security knowledge.Prior studies have yet to investigate the security concerns that reviewers could raise from the coding weakness perspective.Since coding weaknesses are more visible to developers than security issues, the understanding of these security concerns may provide guidance to improve security issue identification in code reviews.
RQ2: How aligned are the raised security concerns and known vulnerabilities?Motivation: While RQ1 helps us to better understand what kind of coding weaknesses can be raised during the code review process, it is still unclear whether current code review practices have focused on coding weaknesses related to the real vulnerabilities that were known in the past.Thus, we set out RQ2 to examine the alignment of vulnerabilities that the systems had and the raised coding weakness comments.The findings will highlight the types of coding weaknesses that may not be sufficiently discussed in the code reviews.This understanding could increase the reviewer's awareness of the less frequently identified coding weaknesses.
RQ3: How are security concerns handled in code review?
Motivation: Developers can respond to the raised security concerns (i.e., coding weakness raised during code reviews) in various ways in order to address the reviewer's comments and get the code accepted.Kononenko et al. (2016) found that developers consider security concerns in reviewers' comments before modifying the code.In contrast, Lenarduzzi et al. (2021) reported that security defects do not influence the acceptance decision of the proposed code changes.However, little is known when it comes to the security concerns from coding weakness comments.An extended understanding of how developers respond to these security concerns could shed more light on the remaining challenges of current secure code review practice.Hence, we set out to explore how the developers handle security concerns from coding weaknesses.

Studied Projects
We aimed to conduct a case study of the code review process of software systems that are prone to security issues.Therefore, to select suitable projects, we considered the following criteria: 1. Use C or C++ as the major programming languages-Unlike other programming languages, users of the C and C++ programming languages can access and manipulate lower-level environments (Turner, 2014) that are susceptible to security issues.2. Actively performing code reviews-Quality assurance practices such as peer code reviews can improve the quality of code and reduce defects in the code base (Bacchelli and Bird, 2013;Beller et al., 2014).3.An accessible code review history-To enable the data extraction, the complete code review data must be publicly accessible.
To find subjects prone to security issues, we obtained a list of C and C++ software systems that have records of publicly reported vulnerabilities from the works of Hazimeh et al. (2020) and Lipp et al. (2022).We obtained nine software projects: Libpng11 , Libtiff12 , Libxml213 , OpenSSL14 , PHP15 , Poppler16 , SQLite317 , Binutils 18 , and FFmpeg 19 .
We first checked if the code review history of the project is publicly available as it is essential for our analysis.We also carefully checked whether the project regularly performs code reviews for every new code change by examining the project's code repository (e.g., GitHub and Gitlab), documents on the project's websites, and code review history in other sources (e.g., mailing list).We found that OpenSSL and PHP established a public contribution policy 20 that requires the developers to create GitHub pull requests and address any reviewer's comments for all public code submissions; Libpng, Libtiff, and Libxml2 have relatively small code review history (less than 350 proposed code changes that received code review comments); SQLite3 employs a private code review process; Binutils and FFmpeg perform code review on mailing lists, which complicates the process of extracting code review information; and Poppler has a large amount of code integration without reviewers' comments.
Therefore, OpenSSL and PHP were the remaining projects that met our criteria.OpenSSL is a popular encryption library for secure communication over the Internet, and PHP is a widely used web scripting language.In terms of open-source project characteristics, OpenSSL receives more than 13k pull requests from nearly 900 active developers and gains over 23.8k stars on GitHub, while PHP, receives over 10k pull requests from over 900 active developers and gains approximately 37k stars on the same platform.In addition to a remarkable number of active developers, both projects are important to the software development communities because numerous software rely on, or implement them.As of July 2022, the number of CVE vulnerabilities reported was 215 and 662 for OpenSSL 21 and PHP 22 , respectively.Both projects regularly perform code reviews on GitHub 23 , where new proposed code changes are submitted as pull requests.

Data Collection
As we aimed to identify the types of coding weaknesses raised by reviewers and the handling process of these concerns, we needed to analyze the code review history, especially the review discussions.Therefore, we first downloaded the code review histories of the studied projects using the GitHub REST API. 24We downloaded the pull requests and all comments on the pull requests.We accessed and retrieved all the code review historical data of both projects at the end of June 2022.The earliest pull requests that we downloaded from OpenSSL and PHP were created in September 2013 and July 2011, respectively.
Then, we selected pull requests that are (1) closed, (2) proposed to upstream, i.e., the main branch, (3) comprised of at least one C or C++ file, and ( 4) have received at least one comment from a human reviewer. 25It should be noted that we considered the code review comments from (1) the main discussion of a pull request and (2) the code level because reviewers may discuss coding weaknesses in both levels.Table 2 shows the number of downloaded and selected pull requests, as well as the number of comments on the selected pull requests.

Coding Weakness Taxonomy
Our objective was to classify security concerns in code review comments from the perspective of development flaws and not from the perspective of an attacker, which has been done in previous work (Paul et al., 2021b;Bosu et al., 2014;Bosu and Carver, 2013;Edmundson et al., 2013;Di Biase et al., 2016).We aim to adopt the taxonomy that is generally applicable to the code reviews in any software system.The selected taxonomy covers the realistic security concerns from typical reviewers who may have fundamental software development expertise, but limited security knowledge and awareness (Braz and Bacchelli, 2022).Therefore, we selected an existing taxonomy based on the following criteria: -Covers the diverse coding weaknesses that are not restricted by the types of well-known security issues -Provides detailed descriptions and examples for the ease of the annotation process and the future applicability for practitioners -Focuses on the business and application logic that can be addressed in code rather than low-level aspects such as network or hardware -Unattached to a specific technology, programming language, or platform We considered four coding weakness taxonomies from industrial guidelines and standards: (1) Common Weakness Enumeration 699 (CWE-699 ) (CWE, b), (2) Common Weakness Enumeration 1000 CWE-1000 (CWE, a), (3) OWASP Code Review Guide (Owa), and (4) OWASP Application Security Verification Standard (OWASP ASVS ) (OWA).Taking into account these four taxonomies based on our criteria, we opted to use CWE-699 in our study for the following reasons.CWE-699 covers a large number of coding weaknesses that can be linked to significant security issues.In particular, CWE-699 has 40 categories containing more than 400 weaknesses that may not expose the security implications (hence, do not require deep security knowledge to identify) but still lead to vulnerabilities and can be introduced during software implementation.On the contrary, the other taxonomies have a relatively smaller number of weaknesses (e.g., 6 categories (Croft et al., 2022)).Although CWE-1000 has a higher number of weaknesses, it includes weaknesses in other layers, e.g., hardware or platform, that can be outside of the code review context.CWE-699 provides an extended description of each of its weaknesses, while the OWASP Code Review Guide, which provides a checklist of nine critical vulnerabilities, does not provide a clear description of the suggested vulnerabilities that reviewers can utilize.As CWE-699 focuses on software implementation, its weaknesses are to some extent generalizable to any technology.On the other hand, OWASP ASVS which provides 14 categories of security requirements is more specific to web applications.The simplified definitions of security weaknesses are shown in Table 3.

Study Overview
Figure 5 shows an overview of our study.To explore the coding weaknesses raised by the reviewers (RQ1), we use a semi-automated approach to identify comments that raised coding weaknesses in a pull request.In particular, we apply text analysis techniques to sort comments that have high semantic similarity with the descriptions of coding weaknesses in the CWE-699 taxonomy.Then, we manually validate and annotate the types of security weaknesses based on the CWE-699 taxonomy.The security concerns found in RQ1 will be further qualitatively analyzed for RQ2 and RQ3, along with their related pull request information and known vulnerabilities.For the alignment assessment with known vulnerabilities (RQ2), we analyze the vulnerabilities of the studied systems that were reported in the past against the security concerns raised during the code reviews.For RQ3, we qualitatively analyze the related code review comments and corresponding code changes within the pull requests to investigate the handling scenario of the raised security concerns.
The semi-automated approach (RQ1) provided dual benefits.Firstly, it addressed the challenges in identifying relevant code review comments, overcoming limitations faced by previous studies (Bosu et al., 2014;Paul et al., 2021b;Alfadel et al., 2023) that can only identify the limited set of issues in code review comments by using the keyword-based approach.Thus, it enabled the discovery of more comments related to coding weaknesses.Secondly, it mitigated the manual effort required for validating comments associated with each type of coding weak-Fig.5: Overall case study workflow ness.However, manual validation and annotation remain essential to eliminate false positives beyond the capacity of the automated process.
We explain our study approaches in the following sections: Section 4.6.1 describes the security concern identification approach for RQ1, Section 4.6.2describes the handling scenarios identification approach for RQ3, and Section 4.7 describes the known vulnerabilities alignment analysis approach for RQ2.

Security Concern Identification Approach (RQ1)
Due to a large number of code review comments (i.e., 135K comments in our dataset; see Table 2), it is not feasible for us to manually identify security comments related to coding weaknesses (so-called, security concerns).We opted to use an automated approach to support our security concern identification process.While prior works (Bosu et al., 2014;Paul et al., 2021b;Alfadel et al., 2023) used the keyword-based approach to identify the comments that mentioned vulnerabilities, their pre-defined keyword lists are limited to the specific types of security issues which do not cover all categories in CWE-699.Indeed, the keyword-based approach does not find the code review comments that contain other coding weaknesses that are not included in the keyword lists.Therefore, we resorted to employing a semi-automated approach that combines the textual analysis technique using pre-trained word-embedding model to rank comments that are semantically related to coding weaknesses and human evaluation.
Our approach comprises two steps: 1) calculating semantic similarity with the descriptions of coding weaknesses and 2) manually validating and annotating the types of security weaknesses based on the CWE-699 taxonomy.We now provide the details of each step below.

Semantic Similarity Calculation
Our underlying intuition is that code review comments that are semantically similar to the descriptions of the coding weaknesses could be considered as coding weakness security comments.Therefore, we use the cosine similarity score to measure how a code review comment is similar to a coding weakness.Cosine similarity is a widely used technique for document similarity calculation because it does not consider the length of documents.The cosine similarity score can be calculated using the following formula.

Comment-Weakness Cosine Similarity
Where C and W are the vectors of the code review comment and description of a coding weakness, respectively.Prior to creating the vectors, we prepare the comments and the description of each of the 40 categories in CWE-699 by removing hyperlinks, stopwords, numbers, and non-alphanumeric characters.We also apply SnowballStemmer (Sno) to obtain the common form of each word.Then, we generate the vector of each code review comment and 40 vectors of the descriptions of coding weaknesses by using the Gensim library (Rehurek and Sojka, 2011) and calculate the similarity scores.Therefore, each code review comment will receive 40 similarity scores.A high similarity score for a particular category indicates that the comment is more likely related to that category.
The results of cosine similarity calculation strongly depend on the vector representation of the documents.In order to select the vector representation that yields the best results, we explore two vector representation techniques.
-Term Frequency -Inverse Document Frequency (TF-IDF): We measure the cosine similarity between each code review comment against the full description of each of the CWE-699 weakness categories.We used Term Frequency -Inverse Document Frequency (TF-IDF) vectors (Tata and Patel, 2007) to represent the code review comment and the description of the weakness.A TF-IDF vector represents the significance of every word in a document (i.e., code review comment and the weakness description).The Term Frequency (TF) is the number of times that a word appears in a document over the total number of words in the document, and the Inverse Document Frequency (IDF) is the logarithm of the ratio between the number of all documents and the number of documents that contain the word.The TF-IDF score for each word is the multiplication of the TF value and IDF value.The TF-IDF vector contains the TF-IDF scores of all words in a document.-Word Embedding: Developers may use interchangeable terms during code review which may be different from the content of the weakness description.For example, login may refer to authenticate in some contexts.Conventional vector representations such as TF-IDF may obstruct the expansive definition of a word.A soft cosine similarity technique that incorporates word embedding models with cosine similarity (Ye et al., 2016;Sidorov et al., 2014) can mitigate this limitation.Instead of using TF-IDF vectors, we use word embedding models to generate vectors that represent the code review comment and the full description of weaknesses.We explore a pre-trained word embedding model in the software engineering domain, namely SO Word2Vec (Efstathiou et al., 2018).It should also be noted that we carefully check if the words are included in word-embedding model before the stemming process.This is to address the out-of-vocabulary (OOV) problem because we stem and retain the words that are not in the word-embedding model for similarity score calculation.

Manual Validation & Annotation of Security Concerns
Once we calculated semantic similarity scores for each code review comment against each of the coding weakness descriptions in the CWE-699 taxonomy, we manually validated and annotated the code review comments into the CWE-699 taxonomy.In particular, we manually analyzed the comments from those with the highest similarity scores to those with lower scores.Specifically, for each category, the comments were sorted by similarity score in descending order before manually annotating the entire body of each comment to determine whether a security concern that is relevant to the coding weaknesses was raised.Figure 6 illustrates an example of our manual validation and annotation approach.
Fig. 6: An illustration of our manual validation and annotation approach.
We performed manual validation and annotation in two rounds.
First Round-Screening The first round aimed to preliminarily remove comments that are generic or irrelevant to coding weakness (e.g., related to bookkeeping and code styling).For each CWE-699 category, we carefully read comments and determine if they are related to that category based on its description (see Table 3).Due to the large number of comments, it is not feasible to validate all the comments across the 40 categories (i.e., 40 × 135K in total).Thus, for each category, we validate the comments until reaching the saturation point, i.e., 50 consecutive comments were identified as generic or irrelevant comments.In total, we validated 9,704 scores of 6,146 code review comments from both projects.Note that one comment can have high similarity scores in multiple categories.For example, Comment#2 has high similarity scores in categories 1 and 2 (see Figure 6).
Hence, such a comment will be read and validated more than once.This screening process in the first round was done by the first author (30 categories) and a third-year PhD student, who is experienced in manual analysis, (10 categories).
Second Round-Validation The second round aimed to carefully identify security concerns from the comments that passed the first round.In particular, we determined whether the comments raised legitimate security concerns that are relevant to the coding weaknesses.Specifically, a comment was determined to contain a legitimate security concern if it was relevant to one of the CWE-699 categories and met at least one of the following two conditions: -Reviewer purposefully remarked security consequence(s).For example, the reviewer commented that "Are these implementations safe against charset based attacks?."26 which can be considered to indicate a problem with the neutralization of the input data.
-Reviewer expressed concern that can potentially point to a security issue, but did not explicitly mention security consequence(s).
For example, the reviewer commented that "[...] login which always asks for name/password but returns 'no such user' or 'wrong password' Nowadays they always return 'Bad login'"27 which may indicate an improper authentication process.
When identifying concerns, it is possible that multiple comments in the same pull request indicated the same security concern.For example, two reviewers suggested that the developer verify a concurrency issue in the code. 28In such a case, we consider these comments as a single security concern for that pull request.Similarly, a comment can also contain multiple legitimate concerns.For example, a reviewer raised concerns about the predictability of the random number generator algorithm29 which can be interpreted as both Random Number Issues and Cryptographic Issues concerns based on the CWE-699 categories.In this case, we considered that this pull request has two concerns.
The second round of annotation was done independently by the first author and the third author, who has ten years of experience in security testing.To establish a clear understanding and ensure a consistent classification in the second round, the first author and the second author performed co-classification on a small set of comments to establish a common understanding of the annotation task.Both authors independently classified the remaining comments, and the inter-rater agreement was measured.It should be noted that the authors also assessed the code changes and the code review conversation to better understand the context when necessary.We used Cohen's Kappa (Cohen, 1960) to measure the inter-rater agreement.For comments with disagreement between the two authors, all authors met and discussed how to resolve the conflicts.The code review comments that were annotated from this process would be called security concerns in the later steps.To ensure that the chance of missing the relevant comments is minimized, we also assessed the false negative rate on 400 randomly sampled unseen comments, which were left out after the saturation point in each category was reached, from each project.We found very few unseen comments (two in OpenSSL and three in PHP) that can be considered security and weakness-related.In comparison to the ratio of positively identified comments from annotated data, we considered the false negative rate insignificant.

Alignment Analysis of Known Vulnerabilities (RQ2)
To answer our RQ2, we further analyzed the security concerns raised during the code reviews obtained from RQ1 against the known vulnerabilities of the studied systems that were reported in the past.To study the past vulnerabilities in our studied projects, we used Common Vulnerabilities and Exposures (CVE) which is the collection of publicly reported vulnerabilities in software systems.We downloaded the CVE entries of OpenSSL and PHP from the CVE mirror database30 .We collected the CVE entries that were reported within the same timeframe as the analyzed code reviews (2013( -2022 in OpenSSL; in OpenSSL;2011-2022 in PHP) in PHP).We downloaded a total of 101 CVEs for OpenSSL and 277 CVEs for PHP.We then excluded the CVEs that were not assigned any CWEs since we would use the assigned CWE numbers to map into the CWE-699 taxonomy.In addition, we excluded the CVEs that have deprecated CWEs.Finally, we studied 81 and 236 CVEs for OpenSSL and PHP, respectively.Table 4 shows the number of CVEs used in this study.
To examine the alignment between the security concerns raised during the code reviews and the vulnerabilities of the studied systems that were reported in the past, we quantitatively analyzed the frequency of the concerns and the CVE entries across the 40 categories of the CWE-699 taxonomy.While each CVE entry has an assigned CWE number, it may not be directly associated with the CWE-699 categories.A clear mapping between CVE entries and CWE-699 categories is not available because a CVE entry can be assigned with any CWE number which may not be under CWE-699 categories.Therefore, we used the CWE hierarchical tree to systematically map the CVE entry and its assigned CWE number into the relevant CWE-699 categories.In particular, we considered the CVE entry to be relevant to the category A in the CWE-699 categories when -The assigned CWE has the same CWE as the category A; or -The assigned CWE has a relationship (e.g., PeerOf, CanFollow, and CanAlsoBe) with CWE of category A; or -The assigned CWE is the child of CWE of category A.
To illustrate the CWE mapping process, we provide an example for each condition as follows.For the first condition, CVE-2014-8275 31 has been assigned with CWE-310 32 which is the category Cryptographic Issues.For the second condition, CVE-2014-3670 33 is relevant to category Numeric Errors (CWE-189) 34 because it has been assigned with CWE-119 35 that has the CanFollow relationship with CWE-128 36 which belongs to category CWE-189.For the third condition, CVE-2017-12932 37 is considered as relevant to category Pointer Issues (CWE-465) 38  because it has been assigned with CWE-416 39 that is the child of CWE-825 40 which belongs to category CWE-465.
Note that an assigned CWE can be relevant to multiple CWE-699 categories.In that case, we classify that CVE entry as relevant to every related CWE-699 category.If the assigned CWEs of a CVE cannot be mapped based on the heuristic above, we manually map them into the relevant CWE-699 categories based on the descriptions.In particular, 29 (out of 81 for OpenSSL) and 81 (out of 236 for PHP) CVEs require a manual mapping of CWEs based on the descriptions.Table 5 shows a list of CWEs that we manually mapped into the CWE-699 categories.

Handling Process Identification (RQ3)
To understand how developers handle security concerns (i.e., code review comments that mention coding weaknesses that could lead to security issues) (RQ3), we employed a lightweight coding method, similar to the approach used by Gousios et al. (2014), to analyze what happens after security concerns are raised.We analyzed the security concerns that were manually identified in RQ1.
Our aim was to identify the handling scenario based on the code review activities that occurred after security concerns were raised.To do so, for each security concern, we read the discussion in the pull request, including the developer responses and the subsequent comments by the reviewer or other reviewers.In addition, similar to prior work (Rahman et al., 2017), if the security concern points to particular lines of code, we checked whether the developer subsequently modified the associated lines of code to address the concern.Then, we summarized the observation for each security concern.Finally, the summarized observations were sorted into groups on the basis of their thematic similarities and a handling scenario theme was defined for each group.
The manual analysis in RQ3 was done by the first author and the second author.We analyzed the handling scenarios in three steps.In the first step, the first author summarized the discussion in a pull request into a brief note describing the handling of each security concern.In the second step, the first author reviewed the notes and categorized them into distinct groups based on thematic similarities.In the last step, the second author reviewed the groups before discussing the disagreed groups with the first author.Then, the first author refined the groups according to the mutual agreement.Following refinement, the first author revisited the notes and ensured that they fit with the refined groups.We repeated the second and last step in multiple iterations until no further changes were needed (i.e., no new groups emerged, and all notes remained in the original groups).Finally, the second author manually validated 10% of the security concerns to confirm the correctness of the annotated scenarios.

Preliminary Analysis
In this section, we present two preliminary analyses to provide the logical ground for our main case study.The goal of the first preliminary analysis (PA1) is to examine whether reviewers tend to raise coding weaknesses related to security issues more frequently than explicitly discussing the vulnerabilities.The second analysis (PA2) aims to preliminarily evaluate the effectiveness of our semi-automated approach (see Section 4.6.1) to calculate semantic similarity scores for the code comments that contain coding weaknesses.Dataset: We conducted the two preliminary analyses based on a sample dataset.We randomly sampled 400 code review comments from each of the studied projects (i.e., OpenSSL and PHP).This sample size should allow us to generalize conclusions with a confidence level of 95% and a confidence interval of 5% (Triola, 2009).

PA1: Prevalence of Coding Weakness Comments
The motivating examples in Section 3 show that coding weaknesses can lead to security issues.Since code review focuses on identifying and mitigating issues in source code (Mäntylä and Lassenius, 2009;Bacchelli and Bird, 2013), it is possible that code review may be able to identify such coding weaknesses.To confirm this, we assess the degree to which the coding weaknesses are discussed in code reviews.In particular, we analyze whether reviewers more frequently discussed coding weaknesses than vulnerabilities.

Approach
From the sampled dataset, we manually classified code review comments into three groups: 1) comments that mentioned a coding weakness, 2) comments that explicitly mentioned a vulnerability, and 3) other comments that are not related to coding weaknesses and vulnerabilities.We consider that a code review comment mentioned a coding weakness when it is related to coding weaknesses listed in the CWE-699.A code review comment is considered as mentioning a vulnerability when it is related to the types of exploitable vulnerabilities obtained from prior studies (Di Biase et al., 2016;Paul et al., 2021b) i.e., Race Condition, Buffer and Integer Overflow, Improper Access, Cross Site-Scripting (XSS) and Cross-Site Request Forgery (CSRF), Denial of Service (DoS) and Crash, Information Leakage, Command and SQL Injection, Format String, Encryption, and common vulnerability keywords such as attack, bypass, back-door, breach, trojan, spyware, virus, ransom, malware, worm, and sniffer.Note that one code review comment can be classified into multiple categories.For example, a comment '[..] we ensure that when the 'while' loop ends, there are always at least 2 more slots available in the output buffer without overrunning it [..]' 41 is related to a vulnerability (i.e., buffer overflow) as well as a coding weakness (Incorrect Calculation of Buffer Size (CWE-131)).Hence, this comment is classified as mentioning vulnerability and coding weakness.

Results
Our preliminary result shows that coding weaknesses were raised more often than vulnerabilities during the code review.Table 6 shows the number of code review comments that mentioned a coding weakness, a vulnerability, and others.From 400 sampled code review comments for each studied project, we identified 67 comments related to coding weaknesses and 2 comments related to vulnerabilities in PHP; and 84 comments related to coding weaknesses and 4 comments related to vulnerabilities in OpenSSL.The amount of code review comments that mentioned vulnerabilities align with the findings of Di Biase et al. ( 2016) who found that 1% of the code review comments identified vulnerabilities.
Table 6 shows that the number of comments that mentioned a coding weakness is 21 -33.5 times higher than the number of comments that mentioned a vulnerability.In addition, we observed that reviewers sometimes point out a potential impact on the system when raising a coding weakness.For example, a reviewer commented that '[..] bad things will happen if the object gets freed before the ctx has finished using it [..]'42 which is related to the 'Use After Free' weakness CWE-416 and express a concern on its potential impact (i.e., "bad things will happen").
These findings indicate that code review comments related to coding weaknesses are more prevalent than code review comments related to vulnerabilities, highlighting that coding weaknesses which are the faults in software development can be identified and discussed during the code review process.

PA2: Preliminary Evaluation of our Security Concern Identification Approach
Since we cannot manually identify code review comments that contain coding weaknesses in the entire code review comment dataset (i.e., 135K comments; see Table 2), we opt to use a semi-automated approach to identify comments, as explained in Section 4.6.In particular, we measure the cosine similarity score of each code review comment and the descriptions of coding weakness categories and we manually validate the comments with high cosine similarity scores until reaching the saturation point, i.e., 50 consecutive comments are identified as generic or irrelevant comments.In this work, we explore two well-known vector representation techniques (i.e., TF-IDF and word embedding) when measuring cosine similarity.We did not use the keyword search like prior works (Bosu et al., 2014;Paul et al., 2021a,b) because their pre-defined keyword lists are limited and may not cover all coding weaknesses.Hence, we set out this preliminary analysis to evaluate the effectiveness of our approach compared to the keyword search and examine which vector representation can produce the similarity scores that better distinguish the code review comments that contain coding weaknesses from the irrelevant code review comments.

Approach
We conducted our preliminary evaluation based on the sampled dataset and our manual classification in PA1.We considered the comments that mentioned coding weakness as coding weakness comments group, and the other comments as noncoding weakness comments group.We pre-processed code review comments in the sampled dataset and the combined descriptions of coding weaknesses in all CWE-699 categories with the method described in Section 4.6.1.Then, we generated TF-IDF and word embedding vectors of the code review comments and the combined descriptions.Finally, we calculated the similarity score between the vectors.
To measure the effectiveness of our approach, we adopted the effort-aware evaluation concept (Kamei et al., 2013;Verma et al., 2016).We measured top- k precision, recall, and F1-score where k is the number of comments with the highest similarity scores.While the value of k approximates the effort required for our manual validation, the top-k precision shows the proportion of coding weakness security comments in the top-k over the non-coding weakness comments; the topk recall shows the percentage of coding weakness security comments that can be identified at the top-k; and the top-k F1-score shows the single score that represent both top-k precision and top-k recall.For the keyword search, we measured the precision, recall, F1-score and of the code review comments that were identified by a set of vulnerability keywords from previous secure code review studies (Bosu et al., 2014;Paul et al., 2021a,b).
To evaluate the two vector representation techniques, we examine which technique produces similarity scores for coding weakness comments higher than the scores for non-coding weakness comments.Thus, we used the one-sided Mann-Whitney-Wilcoxon test to examine the statistical difference in the similarity scores between the two groups of code review comments.We also used Cliff's |δ| effect size to estimate the magnitude of the difference in scores from each group.

Results
As shown in Table 7, we found that our approach with word embedding vectors achieved the highest top-k F1-score in OpenSSL and PHP for all k ∈ (20, 40, 60, 80, 100) with the top-k F1-score of 0.16 -0.58, while our approach with TF-IDF achieved the top-k F1-score of 0.14 -0.47.Table 7 also shows that our approach achieves higher F1-score than the keyword search.The keyword search retrieved 16 and 13 comments that contain one of the vulnerability keywords, which achieves an F1-score of 0.28 for OpenSSL and 0.25 for PHP.Moreover, we observe that the keyword search did not identify some types of coding weaknesses that can introduce vulnerability such as Pointer Issues (CWE-465).For example, the keyword approach could not identify a comment "The object can't be referenced after free obj, only dtor obj" 43 which is related to the 'NULL Pointer Dereference' weakness .This result shows that our approach using cosine similarity can identify more coding weakness comments than the keyword search.
For the performance of similarity score calculation, Table 8 shows the results of the one-sided Mann-Whitney-Wilcoxon test and Cliff's |δ| effect size between the similarity scores of the coding weakness comments and the non-coding weakness comments.We found that similarity scores of coding weakness security comments are significantly higher than non-coding weakness security comments (p-value < 0.05) when using TF-IDF and word embedding vectors.In addition, we found that the difference in the similarity scores the word embedding vectors has a large effect size (|δ| ≥ 0.474 (Romano et al., 2006)) for both OpenSSL and PHP, while the difference in the similarity scores from TF-IDF vector has a large effect size for OpenSSL and a medium effect size for PHP.This suggests that the similarity scores based on the word embedding vectors can better differentiate coding weakness comments from their counterparts than the similarity scores based on the TF-IDF vectors.This finding is consistent with the top-k precision, recall and F1-scores shown in Table 7, i.e., at the same k value, using word embedding vectors achieves a higher score than using TF-IDF vectors.Our preliminary evaluation shows that our approach with the word embedding technique 1) achieves a higher recall than the TF-IDF technique and the keyword search and 2) can better distinguish the coding weakness comments.Therefore, in this study, we used the word embedding technique to calculate the similarity scores to help us manually identify the coding weakness comments in the remaining dataset.

Case Study Results
We report the empirical results based on the code review comments identified by the semi-automated approach; and answer the three research questions in this section, followed by a summary of our findings.
RQ1: What kinds of security concerns related to coding weaknesses are often raised in code review?Table 9 shows the number of identified code review comments and aggregated security concerns.From the 135K code review comments in the dataset, we manually read 3,570 OpenSSL and 2,576 PHP comments with the highest cosine similarity scores until reaching the saturation point (i.e., 50 consecutive irrelevant comments).As described in Section 4.6.2, in the first iteration we removed irrelevant comments (e.g., related to bookkeeping and code styling), resulting in 232 and 148 comments.Subsequently, the first and the third author independently determined whether the comments raised legitimate security concerns and could be classified into one of the coding weakness categories, resulting in 202 and 128 comments.To simplify the results, we aggregated comments within the same pull request that were classified into the identical coding weakness category into singular security concern.In total, we identified 188 security concerns from 202 comments in 164 pull requests in OpenSSL and 123 security concerns from 128 comments in 100 pull requests in PHP.Note that one pull request can have multiple concerns with different coding weakness categories.The manual annotation process by the first and the third author achieved the inter-rater agreement (Cohen, 1960) κ = 0.70 and κ = 0.84 for OpenSSL and PHP, which can be interpreted (McHugh, 2012) as substantial (0.61 ≥ |κ| ≥ 0.81) and almost perfect (|κ| > 0.81), respectively.
Table 10 shows the number of identified security concerns across the 40 coding weakness categories of CWE-699.The numbers in parentheses indicate the CWE category number of the coding weakness.We found that in OpenSSL and PHP, identified security concerns were related to 35 out of 40 coding weakness categories of CWE-699, suggesting that diverse types of coding weaknesses can be discovered during the code review process.The bold text in Table 10 highlights the top ten coding weaknesses that were frequently raised in each project and the ‡ symbol indicates the concerns that were frequently raised in both OpenSSL and PHP.We found that six coding weaknesses, i.e., Authentication Errors (CWE-1211), API / Function Errors (CWE-1228), Privilege Issues (CWE-265), Behavioral Problems (CWE-438), Cryptographic Issues (CWE-310) and Random Number Issues (CWE-1213), were among the top ten concerns in both OpenSSL and PHP.Additionally, we observe that several coding weaknesses were frequently raised in a particular project.This may suggest that while reviewers in OpenSSL and PHP share a set of common concerns, they can have a specific focus on particular security aspects as well.
Below, we present common security concerns across both projects and projectspecific security concerns.
Common security concerns in OpenSSL and PHP: The first two common security concerns are related to users and rights, i.e., Authentication Errors (CWE-1211) and Privilege Issues (CWE-265) coding weaknesses.Authentication Errors (CWE-1211) are related to the failure to properly verify the identification of the rightful actors who can gain access to the system.For example, as shown in Figure 7, we observed that a reviewer noticed that the program does not verify whether the certificate is trusted or not: "[...]The certificate in question is now detached from its provenance, we don't know whether it came from the trust store, or from the peer-supplied untrusted chain![...] ". 44 Privilege Issues (CWE-265) are related to the improper management of critical privileges assigned to users or ob-44 https://github.com/openssl/openssl/pull/13770#discussion_r555847704The bold text indicates the top 10 concerns in each project.‡ indicates the concerns that were frequently raised in both OpenSSL and PHP.
jects.For example, a reviewer mentioned that the developer did not use the correct approach to verify that the user has sufficient privileges to execute a script. 45nother two common security concerns are related to coding weaknesses about the functionality of the system, i.e., API/Function Errors (CWE-1228) and Behavioral Problems (CWE-438).API/Function Errors (CWE-1228) covers the use of dangerous functions or the exposing of the functions that allow unwanted actors to execute restricted actions.For example, as shown in Figure 8, we observed that a reviewer commented that the result of the format string function to the same input variable can be potentially harmful: "[...]Using the same variable as both input and output for spprintf looks dangerous.Are you sure it is safe?". 46Behavioral Problems (CWE-438) refer to code that may cause unexpected behavior in the software system.For example, a reviewer noticed that the code can look for the required files in incorrect directories if the program is compiled in different environments. 47oncerns related to the cryptographic process, i.e., Cryptographic Issues (CWE-310) and Random Number Issues (CWE-1213), were also common in both OpenSSL and PHP.Cryptographic Issues (CWE-310) covers the proper use of encryption algorithms and cryptographic keys to ensure system and data security.For instance, as shown in Figure 9, a developer responded to a reviewer's suggestion that the lengths of the cryptographic keys can be dynamic and cannot be restricted to a fixed value by saying "[...]HMAC keys can be variable length so SHA256 DIGEST LENGTH doesn't seem like the right answer here". 48Random Number Issues (CWE-1213) account for the process of obtaining sufficient ran-Fig.8: A security concern in category API/Function Errors (CWE-1228) in domness, which is essential for robust data encryption.For example, a reviewer suggested that the library should have an inlet for external entropy to increase randomness in the random number generator.49Fig. 9: A security concern in category Cryptographic Issues (CWE-310).The developer had clarified the concern that was raised by a reviewer.
Including the six common coding weaknesses, there are 21 types of coding weaknesses that were raised in both projects.In particular, security concerns related to coding weaknesses in category Audit/Logging Errors (CWE-1210), Information Management Errors (CWE-199), Concurrency Issues (CWE-557), Memory Buffer Errors (CWE-1218), Business Logic Errors (CWE-840), and Resource Locking Problems (CWE-411) are among the top 20 categories in both projects.Security concerns in these categories may also be considered common concerns to some extent.
The previous code review works (Alfadel et al., 2023;Paul et al., 2021b;Di Biase et al., 2016;Bosu et al., 2014;Edmundson et al., 2013) reported that reviewers can identify security issues in various degrees based on the different application domains and the programming languages.However, the studied security issues are frequently bounded by well-known vulnerabilities that are associated with security consequences such as SQL Injection, XSS, or Denial of service.Our results further reveal that reviewers can commonly discuss more extensive coding weaknesses that can introduce those vulnerabilities from the development perspective.For ex-ample, the discussion regarding API / Function Errors (CWE-1228), Behavioral Problems (CWE-438), Cryptographic Issues and Random Number Issues (CWE-1213) have not been previously reported.
Project-specific security concerns: In addition to common security concerns, understanding project-specific concerns would allow us to gain better insight into the secure code review practices in each project.We observed that in OpenSSL, a library that provides encryption functionalities to its dependent systems, reviewers seem to focus on preventing direct security threats that are related to encryption, e.g., Key Management Errors (CWE-255) and Communication Channel Errors (CWE-417).For example, a reviewer discussed the causes of timing-attack, which can reveal the type of cryptographic key used in secure communication with the attacker. 50  On the other hand, in PHP, a programming language for web applications, reviewers rather focus on security related to data controlling, e.g., Data Validation Issues (CWE-1215) and the versatility of language, e.g., Pointer Issues (CWE-465) and Type Errors (CWE-136).Also, it seems that PHP reviewers are concerned with Documentation Issues (CWE-1225), which are rarely recognized in a security context (Alfadel et al., 2023).For example, a developer explained to a reviewer that a function should not declare to accept any type of parameters if it intends to raise TypeError when the user inputs the parameters of incorrect types, e.g., to avoid Denial of Service vulnerability. 51In another case, a reviewer noticed that a function does not implement a randomization algorithm that it claims to use in the document. 52These types of security concerns highlight the importance of input management and documentation in PHP.
Lastly, for the coding weakness types that were rarely raised, it may be because these issues are irrelevant to the application domains of the systems.We did not observe any concerns related to Lockout Mechanism Errors (CWE-1216), as it can cause an overly restrictive authentication policy, which is not applicable in both projects.Similarly, no concerns related to User Interface Security Issues (CWE-355) were found, as OpenSSL and PHP do not have an elaborate user interface.Therefore, it is less likely that reviewers would raise this type of concern.
50 https://github.com/openssl/openssl/pull/16944#issuecomment-957022300 51https://github.com/php/php-src/pull/5847#discussion_r454239763 52https://github.com/php/php-src/pull/1681#issuecomment-187120620Summary: A software project can have particular security concerns from coding weaknesses.In to common security concerns across the two studied projects, OpenSSL has additional concerns related to direct security threats which are associated with encryption and PHP has additional concerns related to data controlling and the versatility of language such as data type and pointer.
RQ2: How aligned are the raised security concerns and known vulnerabilities?Based on the mapping of known vulnerabilities to related coding weaknesses, as explained in Section 4.7, we find that the known vulnerabilities of OpenSSL and PHP during the studied period are related to 16 coding weakness categories.We answer this question by comparing the percentages of the known vulnerabilities and the raised security concerns that we found in RQ1 ( Fig. 10: The distribution of known vulnerabilities and the security concerns across the coding weakness categories (the CWE-699 taxonomy).Note that the other categories that did not occur in known vulnerabilities are omitted.
Figure 10 shows that nine coding weakness categories in OpenSSL and six coding weakness categories in PHP have a high proportion of known vulnerabilities in the past, but are less frequently discussed in code reviews.For instance, the top two coding weakness categories that have the highest proportion of known vulnerabilities are Memory Buffer Errors (CWE-1218;21% in OpenSSL and 29% in PHP) and Management Errors (CWE-399; 21% in OpenSSL and 17% in PHP).However, these two coding weakness categories have a relatively low proportion of security concerns raised in the code reviews (4% -9%).Similarly, 6% -12% of the known vulnerabilities are related to Business Logic Errors (CWE-840), File Handling Errors (CWE-1219), and Pointer Issues (CWE-465) which were rarely discussed in the code review (only 1% -7% of the security concerns).
Moreover, we observe that OpenSSL has three coding weaknesses that are lessfrequently discussed in code reviews i.e., Information Management Errors (CWE-199) (17% of known vulnerabilities; 3% of security concerns), Cryptographic Issues (CWE-310) (7% of known vulnerabilities; 4% of security concerns), and Data Neutralization Issues (CWE-137) (2% of known vulnerabilities; 0% of security concerns).In particular, the lower number of security concerns about Data Neutralization Issues align with the observation of Braz et al. (2021) that developers may not be aware of the consequences of improper input validation, as well as the case of Heartbleed as shown in Figure 1.
On the other hand, coding weaknesses in six categories in both OpenSSL and PHP were more frequently discussed than the known vulnerabilities.Coding weaknesses related to Authentication Errors (CWE-1211), String Errors (CWE-133), Type Errors (CWE-136), Concurrency Issues (CWE-557), Data Processing Errors (CWE-19), and Behavioral Problems (CWE-438) which occurred in 4% of known vulnerabilities in OpenSSL and PHP were discussed by 22%-23% of security concerns in both projects.
Despite the low frequency of the security concerns compared to the known vulnerabilities, all of the coding weakness categories of the known vulnerabilities, except for Numeric Errors (CWE-189) were discussed in the code review as shown in Figure 10.This finding suggests that reviewers may be able to identify these kinds of coding weakness, but require more attention.
Summary: Coding weaknesses related to memory, resource management, numeric errors, business logic, file handling, and pointer are less frequently discussed compared to the known vulnerabilities in both projects.Nevertheless, almost every type of coding weaknesses related to known vulnerabilities can be identified and discussed in code reviews.
RQ3: How are security concerns handled in code review?Through our qualitative analysis of code review activities in pull requests with security concerns described in Section 4.8, we identified eight scenarios in which security concerns were handled, and we grouped them into four main themes.Table 11 shows the number of concerns in each handling scenario.53Below, we describe each handling scenario.
C1. Fix attempted (39% in OpenSSL; 41% in PHP): For the largest group of security concerns, we observed that developers attempted to address the concern by modifying code (i.e., committing new changes to the pull request).However, the attempt can be either successfully fixed (C1.1; 37% for both projects) or un- successfully fixed (C1.2; 4% for OpenSSL and 2% for PHP).The successfully fixed scenario (C1.1) refers to cases where reviewers accepted to merge the pull request after the code was modified.For example, a reviewer requested that the developer adjust the memory handling process. 54The developer further inquired and eventually modified the code and committed a new change to the same pull request.The unsuccessfully fixed scenario (C1.2) refers to cases where reviewers declined or did not respond to the new changes that the developers made, which eventually led to pull request rejection.For example, reviewers suggested an alternate approach for cleaning up the memory; the developer made a fix, but never received further feedback. 55 C2. Acknowledged (30% in OpenSSL; 36% in PHP): For a third of the security concerns raised, we observed that security concerns were acknowledged by the developer or other reviewers but were not fixed in the same pull request.We observed that the concerns were not fixed in the same pull request because they will be fixed elsewhere (C2.1; 10% for OpenSSL and 18% for PHP) or due to an unresolved discussion (C2.2;20% for OpenSSL and 18% for PHP).In particular, for the fix-elsewhere scenario (C2.1), the reviewers and developers discussed the raised concern and agreed that the necessary fixes should be made in new pull requests.We find that around half (55%) of the security concerns in this scenario were eventually merged in both projects ( 11 20 for OpenSSL and 12 22 for PHP).For example, a reviewer noticed the use of stale pointer and suggested a fix.The developer then replied, "[...] Ok.I'll prepare a pull request (but not right away) and request your review.". 56However, it is not possible to confirm whether all security concerns in the C2.1 scenario were later fixed as promised.
For the unresolved discussion scenario (C2.2), the developers and reviewers cannot find an agreeable direction to address the concern.The discussions in this scenario tend to be more rambling and involve several sub-concerns, hindering 54 https://github.com/openssl/openssl/pull/7611#discussion_r238410834 55https://github.com/openssl/openssl/pull/5495#issuecomment-370178440 56https://github.com/openssl/openssl/pull/7455#discussion_r227727671reviewers from reaching an agreeable resolution.This could be due to different understandings and perspectives between reviewers.For reviewers and developers discussed the resolution while aiming to maintain compliance with security standards.However, due to the equivocal interpretation of the standards, the discussion cannot reach an agreeable resolution. 57Another example is that a reviewer raised a concern about the certificate authentication process and requested a modification. 58The other reviewers, including the developer, agreed that the concern was valid but multiple opinions on the solutions.The pull request with the concern was eventually merged without any changes.Indeed, we found that 16 pull requests in OpenSSL and 5 pull requests in PHP which contain 16 ( 16 36 = 44% for OpenSSL) and 7 ( 7 22 = 31% for PHP) concerns in C2.2 were eventually merged without any evidence that the concerns were addressed.
It should be noted that the reviewer's workload may affect the code review outcomes.We found that a significant portion of reviewers (54% in OpenSSL and 17% in PHP) engaged in unresolved discussions (C2.2) are classified as highworkload reviewers i.e., reviewed over 100 pull requests in each respective project.We hypothesize that workload, characterized by the volume of code reviews, as discussed in prior research (Ruangwan et al., 2019), could influence the quality of code review process.However, future work can be conducted to further investigate this phenomenon.
C3. Dismissed (15% in OpenSSL; 26% in PHP): In this scenario, the developer and reviewers discussed the security concerns raised, and the security concerns were dismissed.We observed that the discussions eventually concluded that the concern was a false concern (C3.1; 13% for OpenSSL and 7% for PHP) or acceptable by design choice (C3.2; 24% for OpenSSL and 7% for PHP).Specifically, the false concern scenario (C3.1) is related to cases in which developers or other reviewers offered an explanation to invalidate the security concerns.For example, a reviewer raised a concern about leaking sensitive data. 59Then, the developer replied to the comment to explain that the implementation is not leaking sensitive data "[...] %s given part shouldn't be added for values (but only for types) since they might contain sensitive data", which was agreed by the reviewer.The design choice scenario (C3.2) refers to cases where security concerns were dismissed by other factors such as performance trade-off, maintainability, or system design (Zanaty et al., 2018).For example, a developer responded that a change in the data-neutralizing process was a valid concern as raised by the reviewer; however, it did not affect the application logic. 60The reviewer finally agreed and approved the pull request.We also observed that 20 pull requests in OpenSSL and 4 pull requests in PHP which contain 21 ( 21 24 = 88% for OpenSSL) and 4 ( 4 8 = 78% for PHP) concerns in scenario C3.2 were eventually merged.
C4. Unresponded (3% in OpenSSL; 9% in PHP): There were a few cases where security concerns did not receive any responses nor activities logged in the pull request.This was in part due to out-of-context (C4.1; 2% for PHP) or unknown & inactivity (C4.2; 3% for OpenSSL and 7% for PHP).The out-of-57 https://github.com/openssl/openssl/pull/5402#issuecomment-369513189 58https://github.com/openssl/openssl/pull/4848#issuecomment-363287772 59https://github.com/php/php-src/pull/5340#discussion_r402406437 60https://github.com/openssl/openssl/pull/12251#discussion_r445535130context scenario (C4.1) refers to cases where security concerns drift away from the current discussion or the of the code changes.For example, a reviewer raised a security concern about insufficient check of input and instantly volunteered to create a new change request that fixes the problem, however, the developer and other reviewers did not respond to the concern. 61The unknown & inactivity scenario (C4.2) refers to cases where security concerns were simply disregarded without a clear reason.For example, a reviewer remarked suspicious use of pointer but the developer did not respond and the pull request was eventually rejected. 62Summary: For many of the security concerns raised, the developers attempted to fix them within the same pull request.However, 30%-36% of the security concerns raised were only acknowledged and merged without immediate fixes and 15%-26% of the concerns were dismissed or not responded to.

Discussion
In this section, we discuss the implications of our results and provide practical recommendations for practitioners and potential future work.
1) Various coding weaknesses that may lead to security issues can be raised during code reviews.Our first preliminary analysis (PA1) in Section 5 shows that coding weaknesses were raised in the code review process 21 -33.5 times more often than explicit vulnerabilities.This finding supports our intuition that the reviewers tend to focus on issues in source code.Therefore, it is more natural for the reviewers to identify coding weaknesses than security issues.This implication aligns with the previous work (Gonçalves et al., 2022) that the cognitive load required for code reviews is lower if the reviewers already have the relevant knowledge.
Indeed, our RQ1 shows that the raised security concerns in code reviews of OpenSSL and PHP cover nearly 90% of the CWE-699 weakness types (i.e., 35 out of 40 categories, see Table 10).This confirms our presumption that a variety of coding weaknesses can be raised by reviewers during the code review process.As shown in the motivating examples in Section 3, such coding weaknesses can lead to security issues.It can be implied that the coding weaknesses that may introduce security issues can potentially be identified during the code review process although the weaknesses did not yet explicitly expose the vulnerable outcomes (Braz et al., 2021).Our manual observations from RQ1 also show that the code changes may potentially be vulnerable if the author did not address the raised security concerns.For instance, Figure 7 shows that vulnerabilities such as CVE-2008-4989  Recommendation: we found that coding weaknesses can be identified in code reviews, our findings suggest that practitioners and/or other software projects could adopt the coding weaknesses taxonomy (i.e., CWE-699) to assist code reviews.A list of coding weaknesses should help the team increase the awareness of the potential problems that can lead to security issues without requiring deep security knowledge.A recent controlled experiment of Braz et al. (2022) has shown that a code review checklist could help reviewers better find security issues.Hence, one of the possible ways to adopt the coding weaknesses taxonomy for code reviews is to incorporate it into a code review checklist.Future work should investigate the effectiveness and practicality of using coding weaknesses as a code review checklist for identifying and mitigating security issues during the code review process.Moreover, as coding weakness are more frequently discussed than the security issue, coding weakness can also be an effective proxy for understanding secure code review practices.
2) Coding weaknesses related to the known vulnerabilities of the systems are not frequently discussed in code reviews.Our RQ2 shows that some types of coding weaknesses were less frequently discussed compared to the known vulnerabilities (see Figure 10).In particular, we found that Memory Buffer Errors (CWE-1218) and Resource Management Errors (CWE-399) are the least frequently discussed coding weaknesses in OpenSSL and PHP (4%-9%), albeit the high percentages of known vulnerabilities (17%-29%).Furthermore, our motivating examples in Section 3 highlighted that such coding weaknesses can lead to a serious vulnerability.For example, OpenSSL's Heartbleed is a known vulnerability related to weakness Out-of-bounds Read (CWE-125) which is a type of memory buffer error.These coding weaknesses were rarely discussed maybe because they are generic and easy to be overlooked.Hence, the reviewers may have failed to notice them.To mitigate this problem, the reviewers should be aware of these latent coding weaknesses in order to properly prioritize them in the code reviews.
In addition to the known vulnerabilities, our RQ1 indicates that the security concerns in code reviews can vary from project to project.Particularly, OpenSSL reviewers were concerned about direct security threats (e.g., Authentication Errors (CWE-1211 and Random Number Issues (CWE-1213)), while PHP reviewers were more concerned about data controlling (e.g., Type Errors (CWE-136)).As OpenSSL is an encryption library for secure communication and PHP is a programming language, it can be implied that the application domain may correlate with the coding weaknesses that reviewers can raise.This finding also supports our results that coding weaknesses such as User-interface Security Issues (CWE-355) and Encapsulation Issues (CWE-1227) were neither found in our results nor appear in the known vulnerabilities because they are less related to the application domains of the studied projects.
Recommendation: Our findings suggest that it is essential to identify the specific coding weaknesses that are significant, highly prone to introduce security issues, and relevant to the application domain of the projects.Thus, rather than reviewing all types of coding weaknesses, a selected set of coding weaknesses can be prioritized for effective code reviews.Prioritization of coding weaknesses during code reviews can be based on vulnerabilities and the unique concerns of the projects that were raised in the past.Future work can investigate a systematic approach for identifying and prioritizing the types of important coding weaknesses for individual projects in this context.
3) Not all the raised security concerns were addressed within the same code review process.The security concern handling scenarios identified in our RQ3 reveal a shortcoming in the code review process.Our results show that approximately a third of the security concerns from coding weaknesses (30%-36%, see C2 in Table 11) were acknowledged without fixes in the process.We observed that developers promised to fix some of the acknowledged concerns in the new independent code changes (10%-18%), but some concerns were left without fixing due to disagreement about the proper solution (18%-20%).Nevertheless, approximately half of the unresolved concerns (6%-9%) were eventually merged.This result implies a possible risk that security issues can slip through the code review process into the software product.The incomplete code reviews or unclean code changes that contain security concerns related to coding weaknesses should be held from merging until all security concerns are resolved.Otherwise, the remaining coding weaknesses in code changes can become security issues in the future.This implication is consistent with the findings of the prior work which reported that relentless and inconclusive discussion could impact the code review quality (Kononenko et al., 2015), and the incomplete code reviews and the unsuccessfully fixed can negatively affect the developer's contribution (Gerosa et al., 2021).
Recommendation: Code reviews with security concerns should be escalated if the final resolutions cannot be agreed upon before merging.Security experts or experienced developers should be included in such code reviews to investigate complex security concerns.In addition, the mechanisms to notify the reviewers of the incomplete code reviews or the insufficiently addressed security concerns could reduce the risk that security issues will slip through the code review process into the software product.Our suggestion aligns with Wessel et al. (2020) who reported that the adoption of an automated mechanism such as code review bots can increase the number of merged pull requests, and, hence, reduce the number of abandoned code reviews.Kudrjavets et al. (2022) also observed that the automated bots can remind the developers of the pending tasks in the code review process without inciting negative feelings.Hence, future work should investigate an approach to identify incomplete code reviews or the insufficiently addressed security concerns to help developers increase awareness.

Threats to Validity
We discuss potential threats to the validity of our study.

Internal Validity
During the manual annotation to identify security concerns, code review comments can be ambiguous or require more information to understand.In such cases, we decided to preserve the precision of the manual annotation by considering the ambiguous or unclear-context comments as irrelevant to coding weakness.However, as the annotation process was conducted categorically, it may be susceptible to the biases of the annotator.To mitigate this, the comments were independently validated by the third author (Section 4.6.2).Additionally, if the comments are relevant to multiple categories (i.e., receiving high similar scores in multiple categories), they were also annotated and validated multiple times.During the validation of handling scenarios in RQ3 (see Section 4.8), we encountered a few instances of disagreement.We attribute this discrepancy to the limitations inherent in code review data and a potential lack of expertise in the project.We were aware that some weaknesses in the CWE-699 taxonomy are not considered harmful from a security perspective.Thus, we regularly consulted the extended description in CWE-699 to ensure that the security concerns in question can lead to vulnerabilities.We were also aware that several categories in CWE-699 may share similar weaknesses.For example, weaknesses in the Random Number Issues (CWE-1213) category are also listed in the Cryptographic Issues (CWE-310) category.Nevertheless, we only identified three security concerns that shared both coding weaknesses.

Construct Validity
We used an automated text-based approach to facilitate our manual annotation process.The performance of the automated approach can be suboptimal due to the limited vocabulary in the documents.We tried to mitigate this concern by including CWE's alternate terms that developers might use.It should also be noted that the selection of word-embedding techniques can impact the possibility of finding relevant code review comments.We carefully selected the word-embedding model pre-trained in the software engineering domain to reduce the potential issues.In the manual annotation process, we read only comments that have high similarity scores (i.e., reading and doing manual analysis until reaching the saturation point).It is possible that some of the unread comments may also contain coding weaknesses.
For RQ2, we analyzed the alignment of known vulnerabilities and security concerns by observing the distribution of related weaknesses.It is worth noting that CWE assignments for CVE are based on the security expert's judgment.Therefore, they can be subjective.Additionally, CVE records can be updated.Hence, our analysis is limited by the abstract observations at the time of data collection.For RQ3, we found two PHP pull requests with a long thread of discussions (100-300 comments).Although we were able to locate the identified comments, it is difficult to observe the handling scenarios, i.e., whether the issue was eventually addressed by developers or not.To avoid misinterpretation of the handling process based on these code review activities, we decided to drop these two pull requests from the results of our RQ3.We tried to minimize this problem by manually checking the final code change and the developer's reactions.However, there is no effective solution to completely mitigate this issue.For transparency, we released the dataset used in this study in our supplementary materials.
Finally, the quality of the studied datasets can affect the validity of the results.Although the projects primarily conduct code reviews on GitHub, we cannot guarantee that our datasets include every code review in each project because some code reviews may not be documented.

External Validity
While increasing the number of studied projects may strengthen the generalizability of the findings, expanding the studied subjects is not a trivial task.This is because there are a limited number of projects that fit our selection criteria e.g., the size of projects (the small projects may not have sufficient security discussion (Di Biase et al., 2016)), the past vulnerabilities (for comparing the alignment of past vulnerabilities), the availability of code review data, and the mandatory code review policy.Furthermore, Nagappan et al. (2013) also suggested that indiscriminately increasing the sample size in software engineering study may not necessarily improve the generalizability.
During the annotation process, we observed that both studied projects have several special traits due to a different application domain.The findings based on these two projects may include aspects that may not apply to other software projects.Thus, the analysis of the studied dataset does not allow us to draw conclusions for all open-source projects.Nevertheless, we carefully selected two distinct projects for this study that differ in nature and potential security issues.PHP is a general-purpose scripting language that may face a wide range of varying levels of security threats depending on its usage.OpenSSL is a library with a primary focus on security.Hence, we believe that security issues present in both of these projects are also relevant to other software projects within similar application domains.Further studies are required to confirm this hypothesis.
As our findings are based on the snapshot of code review datasets until June 2022, the recency of the data can be a concern.To mitigate this issue, we analyzed the coding weaknesses in the newly collected code review datasets between June 2023 and February 2024 from both projects, which comprise 6,365 code review comments and 1,427 pull requests in total.We found no major difference in the prevalence of coding weakness discussion between the two datasets.In particular, nine categories remain in the top 10 categories of OpenSSL, and six categories remain in the top 10 categories of PHP.65However, we cannot guarantee whether the results will be sustained in future code reviews.

Conclusion
To understand the potential benefits of code reviews in identifying security issues from the coding weakness aspect, we conducted an empirical case study to investigate the security concerns raised during code reviews, their alignment with past vulnerabilities, and their handling process.We manually validated and annotated the raised security concerns into the software development perspective of the Common Weakness Enumeration (CWE-699).Then, we performed a qualitative analysis to investigate the alignment to the known vulnerabilities the handling process of the raised concerns.
Based on the data from two large open-source projects, namely OpenSSL and PHP, our initial analysis indicates that coding weaknesses are 21 -33.5 times more frequently discussed than vulnerabilities.From our case study results, we found that code reviews can raise security concerns from diverse coding weaknesses, accounting for 35 out of 40 categories in the CWE taxonomy.Some security concerns are consistent across the two studied projects, such as authentication and privilege, API and behavior, and random number and cryptographic, while others are unique to the projects, such as direct security threats for OpenSSL and data validation for PHP.The coding weaknesses in six weakness categories, e.g., memory errors, file handling, and numeric errors, are less frequently raised compared to the frequency of known vulnerabilities.Furthermore, there is a chance that coding weaknesses may slip through the code review process as only 37% of security concerns were fixed within the same code review process.The remaining cases resulted in no responses, dismissal, fixes in other areas, or unsuccessful resolutions.Our finding also highlighted an important shortcoming where 6%-9% of the code changes were merged although the security concerns were not addressed.
Our study confirms that code reviews can identify coding weaknesses that may introduce security issues.Practitioners may focus on finding certain coding weaknesses when performing code reviews.However, checking all types of the coding weaknesses is not necessary.Each project can prioritize the important coding weaknesses by analyzing their past vulnerabilities and code review comments.Incomplete code reviews that contain coding weaknesses should be carefully monitored because they can slip through the code review process and may become security issues in the future.For future work, we encourage researchers to investigate the practical effectiveness of the coding weakness-guided code review as well as develop a systematic approach to identify the project-specific coding weaknesses to assist the practitioners on a larger scale.

Fig. 4 :
Fig. 4: Reviewer identified coding weakness regarding the exposing of internal values in the error messages, which is relevant to Information Management Errors (CWE-199).

Table 1 :
Key differences between prior secure code review studies and this work.

Table 2 :
Number of pull requests and code review comments

Table 3 :
A list of coding weaknesses of the CWE-699 taxonomy

Table 3 :
A list of coding weaknesses of the CWE-699 taxonomy

Table 3 :
A list of coding weaknesses of the CWE-699 taxonomy

Table 4 :
Number of CVEs in this study

Table 6 :
Number of classified code review in the sampled dataset from each project (n=400).Note that one code review comment can be classified into multiple categoties.

Table 7 :
recall,-score of our approach, using TF-IDF and Word Embedding, and Precision, Recall, and F1-score of the keyword search approach.

Table 8 :
Results of one-sided Mann-Whitney-Wilcoxon tests and Cliff's delta between the similarity scores of the coding weakness and non-coding weakness comments in the studied projects (n=400).

Table 9 :
Numbers code review comments that mentioned coding weaknesses and security concerns from our manual analysis.

Table 10 :
Number of identified security concerns in categories.

Table 11 :
Identified security concerns' scenarios and their distribution.The complete data set, scenario distribution by category, can be found in the data release package.