Do explicit review strategies improve code review performance? Towards understanding the role of cognitive load

Code review is an important process in software engineering – yet, a very expensive one. Therefore, understanding code review and how to improve reviewers’ performance is paramount. In the study presented in this work, we test whether providing developers with explicit reviewing strategies improves their review effectiveness and efficiency. Moreover, we verify if review guidance lowers developers’ cognitive load. We employ an experimental design where professional developers have to perform three code review tasks. Participants are assigned to one of three treatments: ad hoc reviewing, checklist, and guided checklist. The guided checklist was developed to provide an explicit reviewing strategy to developers. While the checklist is a simple form of signaling (a method to reduce cognitive load), the guided checklist incorporates further methods to lower cognitive demands of the task such as segmenting and weeding. The majority of the participants are novice reviewers with low or no code review experience. Our results indicate that the guided checklist is a more effective aid for a simple review,while the checklist supports reviewers’ efficiency and effectiveness in a complex task. However, we did not identify a strong relationship between the guidance provided and code review performance. The checklist has the potential to lower developers’ cognitive load, but higher cognitive load led to better performance possibly due to the generally low effectiveness and efficiency of the study participants. Data and materials: https://doi.org/10.5281/zenodo.5653341. Registered report: https://doi.org/10.17605/OSF.IO/5FPTJ.


Introduction
Code review is a widely used (Rigby and Bird 2013;Bacchelli and Bird 2013;Gousios et al. 2014;Sadowski et al. 2018) software engineering practice in which one or more reviewers inspect a code change written by a peer (Bacchelli and Bird 2013;MacLeod et al. 2017) to improve software quality (Baum et al. 2017a), find defects (Baum and Schneider 2016), and transfer knowledge (Bacchelli and Bird 2013).
Performing efficient and useful code reviews is an expensive and time-consuming task (Cohen 2010), therefore improving developers' performance during code review is of great interest. Performance in the context of code review is often defined as how many defects are found (effectiveness) in the code change under review and in how much time (efficiency) (Biffl 2000).
The mentally challenging nature of reviewing code is one of the reasons why code review is expensive (Pascarella et al. 2018;Bacchelli and Bird 2013;Baum 2019). To find defects, developers need to process a vast amount of information related to the code change, to its rationale, to its context in the whole codebase, and to its implications for software quality (Pascarella et al. 2018). Understanding a change-set to review (e.g., a pull request (Gousios et al. 2014)) and its context is one of the main challenges of code review (Tao et al. 2012;Bacchelli and Bird 2013).
The cognitive resources (e.g., working memory capacity) are available to developers during code review can impact review performance. For example, working memory capacity is helpful to find delocalized (Dunsmore et al. 2003) defects (i.e., defects that can only be identified by inspecting non-contiguous parts in a program) ). However, working memory is a limited resource (Paas et al. 2003) and the cognitive load that a task poses on the cognitive system can deplete the available capacity, thus leading to cognitive overload and poor performance (Paas et al. 2003;Matthews et al. 2019).
In recent years, researchers devised approaches to support the code review process, such as visualizations (Tymchuk et al. 2015;Oosterwaal et al. 2016), optimizations of the order in which review files are displayed (Baum et al. 2017b;Baum 2019), and untangling of unrelated changes in a changeset under review (Barnett et al. 2015;Dias et al. 2015;Tao and Kim 2015). These aim at increasing developers' review performance. Although most of these approaches do not directly aim to reduce developers' cognitive load, they do improve reviewers' ability to understand the change-set under review and navigate it-activities that require high cognitive resources.
Existing tools to support developers and improve their code review performance do not guide developers on how to perform the review, even though this kind of guidance could help to lower required cognitive resources (Mayer and Moreno 2003). Rather, to give this kind of support to reviewers, researchers investigated reading techniques for formal code inspection (Fagan 2002;Basili et al. 1996). These techniques guide developers in how to inspect the code searching for defects (Baum 2019). 1 A reading technique for code review used in industry (Baum 2019;Gutha 2015;Gridnev 2017;Carver 2003) is checklist-based reading (Fagan 2002). A checklist guides developers in what and how to review by providing explicit instructions: For instance, a checklist might ask developers to "check an issue for each method" (Kamsties and Lott 1995). Checklists explicitly aim to aid developers in performing complex tasks by systematizing their activity, thus lowering the cognitive load of the task (Kamsties and Lott 1995;LaToza et al. 2020). However, the relationship between checklists and lowered cognitive load has yet to be empirically tested. Moreover, checklists provide a basis to develop an executable reviewing strategy: Automatizing the flow of its items, a checklist can be turned into a step-by-step strategy.
In this paper, we present a study we designed to explore how to assist developers in decreasing the complexity and cognitive challenges of code review, focusing specifically on checklist-based code review. Our aim is to test whether implementing a reviewing strategy using additional methods of cognitive load reduction in a code review tool leads to improved review effectiveness and efficiency. We use checklists as a code reading technique and we aim to improve the code review performance by (1) providing the steps on how to execute the review, (2) strengthening the tool-support to systematically execute the checklist, and (3) making the review more focused and flexible to fit better the change-based characteristics of modern code review. To this aim, we developed a guided checklist, whose step-by-step execution is supported by a tool and reflects the content of the specific review. We measure how this method compares to a normal checklist and a control group. Particularly, we investigate whether these approaches improve performance through lowering cognitive load and assess the usability of the implemented guidance approaches.
The research design of this study was accepted as a Registered Report at MSR'20 (Gonçalves et al. 2020). Accordingly, we conducted an experiment with 70 developers who performed three review tasks. The experiment has three treatments: (1) ad hoc reviewing, 2 (2) checklist-based reviewing, and (3) guided checklist-based reviewing (which uses further means of reducing cognitive load). After each review, we measured developers' cognitive load.
The majority (71.6%) of the developers who eventually took part in our experiment does not commonly practice code review-they can be considered as novice reviewers.
The participants achieved low review effectiveness and efficiency regardless of the treatment to which they were assigned, therefore limiting the strength of our results. Nevertheless, we provide an initial indication on the relationships between guidance, code review performance, and cognitive load. Our results show that the guided checklist performs better in a simpler task: Using a regression model, we identified a statistically significant relationship between the use of the guided checklist and review effectiveness in the small review task (Small Change). The checklist, instead, seems to increase our participants' review effectiveness and efficiency in the more complex tasks: We identified the existence of a relationship between the use of the checklist and higher review effectiveness and efficiency in one of the large review tasks (Large Change B).
Moreover, we observed that a higher cognitive load is linked to better performance. This contradicts our expectations. This result might have been caused by the generally low review performance of the participants and could indicate that investing cognitive resources is actually needed to perform well for novice reviewers.

Background and Related Work
Over the years, substantial research has been dedicated to improving developers' performance during peer code review. Some approaches focus on giving developers information on the context of a review change-set: e.g., employing visualizations to show the structure of the code (Tymchuk et al. 2015) and finding potential issues with the change, based on similar changes in the codebase (Zhang et al. 2015). Other approaches focus on simplifying complex review change-sets by decomposing them into groups of related changes (Tao and Kim 2015;Barnett et al. 2015;Dias et al. 2015). Code review is a cognitively demanding task (Baum 2019;Pascarella et al. 2018;Bacchelli and Bird 2013). For this reason, researchers devised approaches to lower reviewers' cognitive load during code review. For instance, Baum et al. (2017b) proposed to order review changes based on their relations (instead of using the alphabetical order of the file names as done by popular code review tools, such as Gerrit and GitHub), as a way to lower the effort developers need to put in understanding of the construction, connections, and logic of the changes to review.
In the following section, we expand on the role that cognitive load plays during code review and present how current tools help to reduce reviewers' cognitive load.

Cognitive Load and Reviews
Working memory is the part of human memory in charge of storing short-term information in processing tasks. It remains stable throughout a person's life and cannot be significantly trained or improved (Dobbs and Rule 1989). Research found evidence that working memory capacity is linked to the capacity of finding delocalized defects during code review . In fact, finding delocalized defects requires simultaneous cognitive processing of different parts of the code.
Cognitive load refers to the amount of working memory used while performing a task (Paas et al. 2003). Once the cognitive load exceeds one's working memory capacity, their performance in the task lowers considerably (Matthews et al. 2019). More difficult tasks (e.g., more challenging code reviews) pose a higher cognitive load and deplete working memory capacity faster. Supporting people in using less working memory capacity while performing their tasks can prevent them from reaching working memory overload. Moreover, this kind of support might also help those with lower working memory capacity to perform well in complex tasks (Bannert 2002).
When it comes to processing information, there are three types of cognitive load at play that contribute to the total cognitive load and potential overload. Since it is important that the cumulative cognitive load does not exceed the working memory capacity (Paas et al. 2003), the goal should be to minimize the cognitive load caused by processing the information related to efficiently solving a task (intrinsic and extraneous load) and free capacity for the load used for dedicated and focused performance (germane load) (Bannert 2002).

Intrinsic load:
The intrinsic load relates to the complexity of a task. It refers to the amount of interacting elements that must be simultaneously handled by the working memory. The intrinsic load can be lowered by simplifying the task or reducing the amount of interacting elements. The human mind can deal with intrinsic cognitive load by storing information in the long-term memory and retrieving it only when needed or by automating repeated cognitive processes and behaviors. For this reason, experience is fundamental to reach efficiency in a task (van Bruggen et al. 2002). Tools that contribute to lowering the intrinsic load in software development help with these functions -by storing information and providing it in the right moment (LaToza et al. 2020) or by automating repetitive tasks (Rafi et al. 2012). Some tools to support code review simplify unnecessary processing of interacting elements by partitioning changes into smaller related portions (Tao and Kim 2015;Barnett et al. 2015) or by providing a summary through visualizations (Tymchuk et al. 2015). Extraneous load: The extraneous (ineffective) load is caused by the need to process unnecessary or unrelated information, thus harming the performance. For instance, the need to switch contexts/documents, to understand unclear documentation, and to search for information without available pointers are situations impacting the extraneous cognitive load.
Checklists can also be employed to reduce reviewers' extraneous cognitive load. They help developers to focus their attention on the specific areas of code that need inspection (Gutha 2015;Gridnev 2017;Rong et al. 2012). Germane load: The germane (effective) load comes from the effort put in solving a task.
This type of load is helpful for developers' effectiveness and efficiency. It is related to motivation (higher determination also poses higher germane load) but also to previous knowledge about the issue (less effort is needed to solve the task if the developer already has the needed knowledge). In practice, this type of cognitive load can be created, for instance, by introducing gamification in code reviews to improve the interest and motivation of developers (Khandelwal et al. 2017;Unkelos-Shpigel and Hadar 2015).
Intrinsic cognitive load is the most difficult to manipulate, as often there is no choice in how complex the materials necessary for completing a task are. This limits the working memory available to deal with sub-optimal information inputs or that can be put in motivated and dedicated performance. Therefore, the effect of lowering the ineffective load and raising the effective one is particularly important when dealing with challenging reviews (Bannert 2002).
As shown in Fig. 1, the cognitive load for an individual is determined by their characteristics (e.g., available knowledge), the characteristics of the task (e.g., complexity, type of problem), and their interaction (experience with a specific type of problem). Therefore, to lower the cognitive load in a task, interventions can be done to improve individuals' abilities, adjust tasks to pose a lower cognitive load, or optimize the fit between the needs and possibilities of individuals and the tasks they are performing.
With respect to measuring and assessing, cognitive load is conceptualized through the affected factors represented on the right-hand side in Fig. 1 (Paas and Van Merriënboer 1994). Mental load represents the demands posed on the cognitive system by the task itself, while mental effort represents the cognitive demands consciously allocated to solving the task. Conceptual framework for understanding and assessing cognitive load, adapted from Paas and Van Merriënboer (1994) Several methods have been proposed to lower mental load , i.e., the cognitive demands of tasks. Each of them addresses different types of cognitive demands. Apart from the demands originated from the complexity of the information (essential to the task) and the demands caused by processing the incidental (unessential) information, it is demanding to hold the information in working memory for a long time (Mayer and Moreno 2003). Table 1 presents a list of methods to reduce mental load from Mayer and Moreno (2003). Among these methods, some have already been integrated into tools to support code review. For instance, tool support such as visualizations or change ordering incorporate methods such as parallel processing of verbal and graphical information, reducing visual scanning, segmenting information, or reducing the processing of unessential information for the review.
In summary, mental load and mental effort both contribute to the overall cognitive load and potential cognitive overload. Making review less demanding on available cognitive resources can help to prevent cognitive overload. Therefore, in this study we aim to prevent cognitive overload by reducing the mental load code review is posing on the developers' mind. We measure cognitive load as the goal concept and refer only to this concept in the following sections.

Code Inspection Reading Techniques and Strategic Reviewing
This study aims to explore whether the use of an explicit strategy to review code improves review performance. The idea of defined processes and steps for reviewing is integral to formal code inspection (Ebad 2017).
Over the years, multiple reading techniques have been developed to guide developers in inspecting code and other types of documents (Ebad 2017). Some code reading techniques are simple checklists that focus reviewers on certain aspects to ensure that these are checked, while others offer an explicit step-by-step guide to follow to review the artifact at hand (Baum 2019). However, modern code review does not commonly apply formal inspection reading techniques (Baum et al. 2017a), focusing more on the advantages offered by the use of review-specific tools (e.g., Gerrit 3 , Microsoft CodeFlow (Greiler 2021), Facebook's Phabricator 4 , and Atlassian Crucible 5 ), as also described in Section 2.1.
Checklists are an example of a reading technique that has been used not only for code inspection (Thelin et al. 2003), but also for other types of code review (Gutha 2015;Gridnev 2017;Rong et al. 2012). Checklists utilize signaling (see Table 1) and are thought to improve performance through lowering cognitive load (Kamsties and Lott 1995). They have been found to be an efficient aid for finding defects (Rong et al. 2012), but are outperformed by reading techniques that follow a specific reviewing scenario (Abdelnabi et al. 2004;Denger et al. 2004). This suggests that guidance that shows reviewers how to proceed with the review by further signaling cues for what to look for, where, and when may be beneficial. Nevertheless, the positive effect of explicit strategies is not supported by all studies (McMeekin et al. 2009;Lanubile et al. 2004) and checklists seem to be better accepted by reviewers compared to reading scenarios .
The importance of defined cognitive processes and their systematic execution is recognized when aiding software development tasks like debugging (LaToza et al. 2020; Ko et al.

Off-Loading
Allow parallel processing of verbal and graphical information.

Segmenting
Present information in segments rather than as a continuous unit.

Essential
Partitioning Changes (Tao and Kim 2015) . x

Pre-training
Provide knowledge of names and behaviors of system components.

Weeding
Eliminate or reduce processing of unessential information for the task.

Signaling
Provide cues for how to process the material.
Essential + Incidental Review suggestions (Zhang et al. 2015) x x

Aligning
Place related information together to reduce visual scanning.

Eliminating Redundancy
Avoid unnecessary repeating of information.

Synchronizing
Present verbal and graphical information simultaneously to minimize the need to hold information in working memory.

Individualizing
Take into account individual resources and abilities, such as working memory or learning style.
Essential + Representational Holding . . 2019). These strategies for executing programming tasks take advantage of the functionalities that tools provide to lower developers' cognitive load by storing and managing the information needed to solve the issues. By building on methods to reduce mental load, we have developed a tool-supported reviewing strategy to assist developers in improving their code review performance.

Research Questions
Supporting developers with reading techniques (e.g., checklist) has been found to be an efficient way to help reviewers find defects during code inspection (Biffl 2000). However, checklists were found to be less effective compared to reading techniques that offer guidance on how to review, such as Systematic Order-based Reading (Abdelnabi et al. 2004). Assisting developers in defining and executing strategies for software development tasks (e.g., debugging) increases developers' productivity (LaToza et al. 2020).
The positive outcome of previous research suggests that incorporating methods to reduce developers' cognitive load (e.g., signaling (Mayer and Moreno 2003)) can positively affect review performance. In this study, we investigate whether code review efficiency and effectiveness can be improved using additional methods to reduce developers' cognitive load. We compare a tool-supported systematic guidance on how to perform a review (guided checklist) to guidance only on what to look for in the review (checklist) and to an ad hoc review where developers perform the review according to their own process. Our first research question is the following: We formalize our research question into the following hypotheses: H 1.1 : There are differences in review effectiveness between ad hoc review, checklist, and guided checklist. H0 1.1 : There are no differences in review effectiveness between ad hoc review, checklist, and guided checklist.
H 1.2 : There are differences in review efficiency between ad hoc review, checklist, and guided checklist. H0 1.2 : There are no differences in review efficiency between ad hoc review, checklist, and guided checklist.
Checklists, as well as tool-supported strategies, are expected to systematize the activity of the developers, thus lowering their cognitive load by reducing the amount of information they have to keep in mind and helping them to focus on relevant issues (Paas and Van Merriënboer 1994;LaToza et al. 2020). In the software engineering literature, however, we found no direct measurement of the effect of tools on cognitive load and its effect on code review performance. Therefore, we ask: We formalize our research question as: H 2.1 : Cognitive load mediates the relationship between the guidance approach and review effectiveness.
H0 2.1 : Cognitive load does not mediate the relationship between the guidance approach and review effectiveness.
H 2.2 : Cognitive load mediates the relationship between the guidance approach and review efficiency.
H0 2.2 : Cognitive load does not mediate the relationship between the guidance approach and review efficiency. Figure 2 summarizes our research questions and their link with the key concepts of our investigation: e.g., cognitive load and review performance.

Methodology
After having reported the goal of our experiment and our hypothesis in the previous section, in this section we describe the experiment planning, following the reporting guidelines of Jedlitschka et al. (2008).
According to the methodology presented in our registered report (Gonçalves et al. 2020), we set up a controlled experiment where developers have to complete three code review tasks searching for defects. Each participant is randomly assigned to one of three possible treatments: (1) a control treatment with no guidance (henceforth: 'ad hoc review'), (2) checklist supported review ('checklist'), and (3) strategic checklist execution ('guided checklist').

Study Participants
As reported in our registered report (Section 3.4) (Gonçalves et al. 2020), we performed a power analysis to estimate the sample size needed to identify existing differences between the treatment groups. Based on previous studies, we do not expect a large effect size to appear (Dunsmore et al. 2001). The sample size is calculated using a convention for an ANOVA medium effect size (Cohen 1992). The estimated total sample size is 66 participants. Based on this analysis, we hired 70 developers from a software development outsourcing company located in India to take part in our experiment. The company has more than 2,000 employees and provides a wide range of services (e.g., DevOps, web development, and mobile development).
We contacted the developers through the company and they completed the experiment as part of their job. We requested all developers to have experience in Java, but we had no further control over the selection of the sample, which was up to the project manager at the company. We had the option to ask for additional developers in case of irregularities in data or drop-outs.

Descriptives
Characteristics of the study sample are described in Figs. 3, 4, and Table 6. After the data cleaning, our participants' sample consisted of 67 professional Java developers, counting 66 programmers and one tester. Among them, 54 identified themselves as male and 13 as female. The age of the participants ranged between 22 and 33 (M = 26.85). Furthermore, we know that 28 participants had a B.Sc. in Computer science and 18 had a M.Sc. degree in Computer Science, totaling 68.7% of the study participants with a university degree.
Most participants had no experience with jEdit -the system used in the review tasks. However, five of them used it in the past and six have contributed to the jEdit code base. While analyzing collinearity in the data, we did not find a significant relationship between experience with jEdit and performance in the experiment reviews.
Our sample consisted of professional Java developers, however many of them did not have code review experience. Several (39) of them already worked more than 8 hours before doing the experiment and 30 of them reported being moderately or very stressed before the experiment. Participants reached low effectiveness and efficiency, as reported in Table 6. Furthermore, we verified developers' understanding of the change-set at the end of each review task. The results are shown in Table 2.

Experiment Treatments and Materials
In this section, we describe the materials used in our experiment. First, we describe the experiment UI; then we present the three treatments (i.e., ad hoc review, checklist, and guided checklist). Moreover, we describe how we measured the cognitive load of the participants as well as how we assessed the usability of the devised guidance approaches. An explanation of the materials is provided in our registered report (Gonçalves et al. 2020).

Experiment UI
To conduct our experiment, we use a web-based tool ( Fig. 5 shows an example view) that allows participants to complete the experiment remotely. We log participants' answers, environment, and UI interactions. The tool was built and used in our previous work (Baum et al.

Treatment: Ad hoc Review
The ad hoc review (Uwano et al. 2006) condition is our control group, which we use as a baseline to evaluate participants' review performance. Developers assigned to this treatment use the same web-based experiment platform as the other treatments to perform the review tasks. All participants review the same tasks regardless of the treatment to which they are assigned. Developers in the ad hoc review group do not receive any specific aid during the review and can carry on the reviews as they prefer.

Treatment: Checklist
Checklists provide cues on where to focus attention to find common defects and improve the usage of cognitive resources (Bannert 2002). This can be seen as using Signaling to reduce cognitive load (see Table 1).

Fig. 4 Participants' demographics (2)
Developers assigned to the "Checklist" treatment of our experiment are required to identify defects using a checklist. We developed this checklist based on items from Microsoft checklists (McConnell 2004) and recommendations in the literature: A good checklist (1) requires a specific answer for each item, (2) separates items by topic, and (3) focuses on relevant issues (Degani and Wiener 1991; Kamsties and Lott 1995;Chernak 1996). Checklists should specify the scope in which items must be checked (e.g., "for each method/class") to prevent developers from memorizing big portions of code and jumping through it (Kamsties and Lott 1995).
Following these recommendations, we created the checklist for our experiment. For each defect, our checklist contains at least one item related to the issue but without explicitly pointing at it. Thus, the checklist contains items relevant to the review at hand but does not give obvious clues about the type or location of the defects.  (2004). We created an initial version of the checklist and performed an assessment with three Java developers with experience in code review to evaluate its goodness. Based on the collected feedback, we improved the items in our checklist. Then, we repeated this process with a new set of three developers.
The final version of the checklist contains 18 items, grouped by their scope (general, class, or method). For each item, developers can indicate whether they considered it without finding any defect or they found a defect while inspecting it. Reviewers are not forced to check every item, but a warning is shown if they attempt to complete the review without having marked all checklist items as checked. The checklist is displayed as a lateral bar on the left side of the screen (Fig. 5). Developers can open or close the checklist bar by clicking on the collapse checklist button on the top-left corner of the screen. Furthermore, we ask developers to note any defects that they encounter, even if they are unrelated to the content of the checklist.
The checklist items are reported in Appendix A.1; the mapping to the defects they help identify is in our replication package 7 .

Treatment: Guided Checklist
The human brain has a great potential to retrieve complex information and consequently make contextualized decisions. A tool-supported strategy could free the mental capacity to do these tasks by aiding systematic execution of steps and providing relevant information when needed (LaToza et al. 2020). Therefore, providing explicit strategies to perform code review might support developers by reducing their cognitive load and improving their performance.
The guided checklist is a version of the previous checklist (Section 4.2.3). Checklists should specify the scope for which an item must be checked: e.g., "for each class". Differently from a classic checklist, the guided checklist is not static but iterates over the classes and methods in the review change-set. This allows a detailed step-by-step review of each relevant checklist item, e.g., "For the class VFSBrowser, please check . . .".
In comparison to the checklist, the implementation of the guided checklist is improved by multiple methods for lowering cognitive load, as seen in Table 1. The guided checklist uses the same items and signals what to look for. Additionally, it (1) segments the task into smaller units, (2) reduces the need to hold information in the working memory by iterating through classes and methods, (3) reduces visual scanning by highlighting chunks and asking focused questions on a specified piece of code, and (4) minimizes the processing of unessential information by offering only items relevant to that chunk. Therefore, even though both checklist and guided checklist use signaling as a method to reduce cognitive load, the guided checklist is expected to reduce the amount of information a developer needs to process at a time and the scope to which they need to pay attention-the signal is more precise. Thanks to the identical content of checklist and guided checklist items, we can conceptually separate the effect of additional measures for reducing cognitive load.
The guided checklist is implemented as a top bar in the review task interface (Fig. 6). It displays the same items as the checklist. Differently from the checklist, items are not shown all at the same time, but participants are explicitly asked first to check the general items, then the class and method ones. The execution flow of the guided checklist is reported in Algorithm 1. We display only the items that are relevant for the selected code chunk. Furthermore, the strategy highlights to the user which code chunk(s) they are currently Fig. 6 Example of a strategy item in the web-based experiment UI reviewing. The user must explicitly mark the items as checked before being able to proceed with the review.

Cognitive Load
To measure cognitive load, we use a standardized questionnaire (StuMMBE-Q (Krell 2017)) that captures the two components of cognitive load (i.e., mental load and mental effort) in two 6-item sub-scales. The items are rated on a 7-point Likert scale. The individual responses are recorded as a score from 1 to 7. These scores are averaged to achieve a final score directly comparable to the response anchors. The scale contains no reverse-scored items. Effort and difficulty ratings are reliable measures for the cognitive processing that contributes to cognitive load (DeLeeuw and Mayer 2008). While there are other potential measures of cognitive load, such as response time to a secondary task (Paas et al. 2003), we use a questionnaire because it does not require the physical presence of the respondents or the usage of special equipment. Moreover, it does not directly interfere with the code review performance.

Usability of the Treatment Implementation
We adapted the System Usability Scale (Brooke and et al 1996) to measure the usability of the devised guidance approaches: We rephrased the items of the System Usability Scale to fit the purpose of the checklist and guided checklist evaluation. Using the scoring manual, the treatment is graded on the scale from A to F, Excellent to Awful (UsabiliTEST 2020).

Tasks
Participants in our experiment were asked to complete three code review tasks. Moreover, before starting the review, developers were shown a tutorial to familiarize themselves with the review UI used in the experiment. In the following, we describe how the tutorial and the review tasks are implemented. The code of the tasks used in the experiment is available in our replication package 8 .

Tutorial
The tutorial (Fig. 7) shows a brief code review consisting of one file. Reviewers are asked to perform either three tasks (in the 'checklist' or 'guided checklist' conditions) or two tasks (in the 'ad hoc' condition) to familiarize themselves with the review UI before proceeding to the experiment. (1) click on the view more context button to expand the context of a review change; (2) insert a remark; (3) if developers were assigned to the checklist or guided checklist treatments, mark an item of the checklist (guided checklist) as complete. The code to be reviewed contains a bug 9 for the reviewers to find.

Code Review Tasks
First, participants review a short, simpler change-set, then they have to do two reviews of two distinct, longer change-sets. The first change-set contains three defects, while the others Fig. 7 The Experiment Tutorial. In the first task of the tutorial, participants need to click on the show more context button to expand the context of a review change, and then click on the continue button to proceed to the next task of the tutorial contain nine and ten, respectively. The two large review tasks are presented to the participants in a randomized order. Among the initial 70 participants, 36 reviewed change Large Change A first, while 34 were assigned to change Large Change B first. Table 3 describes the review tasks (also available in our replication package) and Table 4 provides the number of participants assigned to each combination of treatment (ad hoc review, checklist, or guided checklist) and change order (Large Change A first or Large Change B first).
The code changes for the review are taken from a previous experiment on code review and contain both original and seeded defects . The review changes are extracted from an existing open-source project named jEdit 10 that was successfully employed in previous studies (Rothlisberger et al. 2012;Baum et al. 2019). To control for potential bias caused by developers' familiarity with this system, we explicitly ask developers about their previous experience with it. We instruct the participants to focus only on functional defects.

Variables
Our study relies on a number of quantitative measures concerning both the performance and the perception of the participants. Table 5 reports the variables considered in our study and presented in Section 3.1 of our registered report (Gonçalves et al. 2020).

Remark Evaluation and Review Performance
Developers enter their review comments (remarks) in the code review UI by writing in a text area that appears once they click on a code line. As done in previous experiments (Baum Table 3 Code change sizes, complexity, and number of correctness defects. The code and defects were previously used and described in Baum et al. ( , we count a comment as referring to a defect iff it is in the right position and can make a reader aware of the defect. In case a comment is on the right line but highlights an unrelated issue, we do not count it. The first two authors independently classified the comments of the ten developers assigned to the first iteration of the experiment. The first iteration contained 131 remarks. These were marked as either pointing to a defect (specifying which defect) or as false positives. Then, we computed the agreement between the two authors involved in this process using Cohen's kappa (Kvålseth 1989). They reached an inter-rater agreement of 0.769; disagreements (N = 7) were discussed to reach a consensus.
Afterward, we proceeded in an iterative fashion: The two authors independently evaluated two other batches of 131 remarks each. Then, the authors computed the agreement and discussed cases of disagreement until a consensus was reached. A Cohen's kappa of 0.891 and 0.806 was reached classifying the second and third batches of remarks, respectively. Since the inter-rater agreement between the authors involved in the classification achieved good results, the rest of the remarks were split between the authors for the classification. At this stage, the authors discussed only cases deemed as unclear during their individual work (N=14).
Once all the remarks are classified, we evaluate the review performance (review effectiveness and efficiency) of the experiment participants. We measure (1) developers' review effectiveness (the percentage of discovered defects in the task) and (2) review efficiency (the number of defects found per minute spent reviewing).

Experiment Design and Procedure
In our experiment, we use the measurement-of-mediation design (Spencer et al. 2005). The experimental design manipulates the independent variable (type of guidance), while the  Pauses are toggled by the participant.
Review effectiveness Ratio of defects found by the participant over the total number of defects in the code change (Biffl 2000) ratio Computed at the end using the number of detected defects and the total number of defects.
Review efficiency Number of defects found per hour spent reviewing (Biffl 2000) ratio Computed at the end using the number of detected defects and the review time.
Cognitive load Load imposed on a person's cognitive system while performing a particular task (Paas and Van Merriënboer 1994) ordinal see Section 4.2.5 Usability Perceived efficiency, effectiveness, and satisfaction in the use of an object (Brooke and et al 1996).   Figure 8 shows the flow of our experiment. Differently from what we reported in our registered report (section 3.5) (Gonçalves et al. 2020), we included a short tutorial in the explanation step (step 2) to allow participants to familiarize themselves with the experiment UI. At the start (Step 1 in Fig. 8), developers are briefed on the experiment and the data handling policy. We ask for informed consent and explicitly request the developers to not share information about the experiment with each other. Then (Step 2), participants are randomly assigned to one of the three treatments and have to complete a short tutorial before proceeding to the review tasks. The tutorial aims to ensure that all participants reach a clear understanding of the tasks, possess a basic level of familiarity with the review platform, and experience their specific guidance approach (if any). A description of the tutorial is given in Section 4.3.1.
After each review, participants are administered a standardized questionnaire (Krell 2017) to measure cognitive load relating to the review they just did (see Section 4.2.5; Step 4 in Fig. 8).
When the participants assigned to the checklist or guided checklist treatment complete the review tasks, we ask them to answer an adapted version of the System Usability Scale (Brooke and et al 1996) (see Section 4.2.6; Step 5 in Fig. 8).
At the end of the experiment (Step 6), we collect demographic data to gather descriptive characteristics of our sample and intervening variables such as programming and Java experience, coding and reviewing frequency, education, and current stress level.
Developers access the experiment online via a provided URL. Throughout the experiment, we control developers' comprehension of the system by asking questions about the change under review because code comprehension is an important condition for good reviews (Bacchelli and Bird 2013;Pascarella et al. 2018). The comprehension questions are taken from a previous experiment on code review  and are described in the related replication package (Baum et al. 2018).
The experiment was conducted in three iterations, with the aim of adjusting the experiment setup, if necessary. A group of ten developers took part in the first iteration of the experiment, while 30 developers were allocated to both the second and third iteration. Apart from asking for participants with higher review experience at the end of the first iteration, we did not make any further adjustments to the experiment.

Data Cleaning
According to the outline planned in our registered report (Gonçalves et al. 2020), participants who spent less than 5 minutes on a review or did not enter any review remark were classified as NAs (for each review task). We also checked to exclude participants who restarted the experiment or participated several times (we collect client IPs-hashed to guarantee data anonymization-and cookies). None of the participants was removed as a result of this process. However, three developers did not answer the experiment's demographics questions. Therefore, we excluded them from the final dataset. This left us with a resulting sample size of 67 developers.

Analysis Plan
In this section, we present our analysis plan (originally stated in our registered report (Gonçalves et al. 2020)). Since the developers in the sample reached low review effectiveness and efficiency and the data provided only limited evidence of the relationships we aimed to investigate, we had to adjust the analysis we could perform significantly, as described in Section 4.7.
In RQ 1 , we perform a One-way ANOVA to identify whether there is a significant difference in code review effectiveness and efficiency among the three treatment groups. Specifics of these differences are explored using Tukey's Range Test for the post-hoc analysis. In response to RQ 1 , we also present the first regression model used in answering RQ 2 as it refers to the relationship between guidance and performance as well.
We aimed to use mediation analysis (Spencer et al. 2005) for RQ 2 as described and formulated by Imai, Keele and Tingley (Imai et al. 2010). Mediation analysis combines regression models to assess the size of a direct and indirect (mediated) effect of an independent variable on the dependent one. Separate regression models are built for (1) the direct effect of the guidance on code review effectiveness and efficiency, (2) the effect of guidance on cognitive load as a mediator, and (3) the effect of cognitive load on code review performance while controlling for the effect of guidance and other control variables. As last step, (4) a mediation model is built using the regression models as arguments to calculate the significance of the indirect effect.
We planned to construct the models employed in our analysis using the mediation R package (Tingley et al. 2014). The type of guidance is considered as the independent variable, code review effectiveness and efficiency as the dependent variables, and the cognitive load as the mediator. To conclude a mediated effect of the treatment on the outcome variable, there must be a significant direct relationship between the treatment and the outcome variables in the model (1) and a significant relationship between the treatment and the mediator and between the mediator and the outcome in models (3) and (4). Model (4) also calculates the overall significance of the path from treatment to the outcome through the mediator (Tingley et al. 2014). If the direct relationship between guidance and code review performance remains significant in the models (3) and (4) Figure 9 presents the statistically significant correlations among core and control variables for our analysis (Pearson correlation, p < 0.05). Apart from programming, Java, and code review experience being inter-correlated, we observed a relationship between developers' understanding of the change (measured as the number of correct answers to questions concerning the reviewed change) and the review time (r(48) = .46) as well as a negative In other words, as expected, developers who spent more time on the review had a higher understanding of the change and experienced Java developers and reviewers found the reviews less cognitively demanding. In our sample, developers with more programming and Java experience performed the experiment after fewer hours of work (r(48) = −.28 and −.28).

Correlations and Collinearity
We performed a linear regression to assess which of our control variables are predictors of the output variables, including age, gender, experience with jEdit, and other demographic variables. We found stress, hours worked before the experiment, and review time to be significant predictors of code review effectiveness (p < 0.1, aggregated for all three changes). More hours worked and lower stress were also predictors of higher code review efficiency (p < 0.05) in the Large Change A and the hours worked were significant as well in the model aggregating data from all reviews (p < 0.1). Furthermore, we calculated the Variance Inflation Factor (VIF), finding no multicollinearity.

Adjustments
The methodology of the study has been pre-registered as a Registered Report at MSR'20 (Gonçalves et al. 2020). We committed to following the pre-approved study design. However, we found out that developers' performance throughout the experiment was deficient despite the proposed treatments. This presented a challenge for the analysis as there was very little variance in the values of code review effectiveness and efficiency. This limited our ability to perform the envisioned mediation analysis. Mediation analysis can be performed only if a relationship between the dependent and independent variable was established. If the relationship is not clearly established, it also cannot be mediated and therefore the mediation analysis becomes unsuitable. This proved to be the case, as reported in Sections 5.1 and 5.2 Mediation analysis is built in several steps, building regression models to investigate the relationships suggested in Fig. 2 and a model where the code review performance is predicted with guidance, cognitive load, and other control variables at the same time (equivalent to model (3) in Section 4.6.2). This sequence of regression models was used to answer the RQ 2 . Furthermore, since one of the regression models investigates the direct relationship between guidance and code review performance (see model (1) in Section 4.6.2), we report its results in answering RQ1.
We attempted to use several strategies to mitigate the low performance of developers. The first attempt was to avoid excluding some participants by trying to pinpoint those who did indeed the task, but just performed poorly. To this aim, we used information about the amount participants scrolled during the experiment as well as their understanding. Participants who were marked as missing values due to lack of review comments or very fast reviewing were included in the analysis with recorded, if they scrolled and answered at least one understanding question after the review correctly. This strategy resulted in several participants being included in the analysis. Several coefficients in our analysis have changed, but it did not improve the quality of the data for analyzing the relationship with code review effectiveness and efficiency as it raised the portion of inefficient developers included in the analysis. Therefore we excluded this approach.
The second strategy we employed was to exclude participants who potentially did not understand the reviewed code enough. We excluded from the analysis developers who did not answer any understanding question in a change correctly. Surprisingly, this strategy resulted in losing several cases of developers who not only entered review comments, but also successfully identified defects and furthered the problem with low variance of values in the variables measuring code review performance.
Seeing that these attempts did not resolve the issues with data quality, we stuck with the original selection criteria and worked with data of developers who spent more than five minutes on the review or entered at least a review remark. In the following sections, we present results based on this sample selection criteria.
All data and materials used in the study are available in our replication package 11 .

RQ 1 : Does Guidance in Review Lead to a Higher Review Performance?
Our analysis addressed the relationship between guidance and code review effectiveness and efficiency in two ways: (1) comparing the means of the three treatments through a One-Way ANOVA and (2) using a regression model as the first step in the mediation analysis. The experiment participants showed overall low review effectiveness and efficiency (Table 6, Figs. 10 and 11), which made addressing our research question challenging. To compute developers' review effectiveness and efficiency, we analyzed the aggregated performance in all three review tasks as well as the results of each review change task separately.
The small change had mean effectiveness of 12.5%, while the Large Change A (M = 7.53%) and Large Change B (M = 2.41%) were more cognitively demanding and developers found fewer defects, as reported in Table 6, Figs. 10, 11, and 12.
Using ANOVA analysis (see Table 7), which compares means of multiple groups, we found the only significant difference to be in the Large Change B, where the checklist showed significantly better efficiency than the control group (p < 0.1) while the guided checklist was not significantly different from neither. The Tukey's Range post-hoc test, also presented in Table 7, did not identify further differences between the three treatments. This was also confirmed in the regression model built to assess the direct relationship between guidance and efficiency, as reported in Table 8. In the Large Change B, the use of the   checklist is a significant predictor for review effectiveness (p < 0.1) and efficiency (p < 0.05). It seems that the checklist in the most complex change indeed helped developers to find more defects and to find them in shorter time.
Our regression models also gave us an indication of the presence of a relationship between guided checklist usage and code review effectiveness (p < 0.05) in the Small Change. All the regression models assessing the direct effect of treatments on effectiveness and efficiency are presented in Table 8.
Overall, we cannot conclude that a strong relationship exists between guidance approaches using cognitive load reducing methods and code review effectiveness and efficiency. The developers who participated in our study achieved deficient code review performance, regardless of the treatment to which they were assigned. This situation significantly undermined the possibility of achieving statistically significant results. Nonetheless, our data provide initial indications that the guided checklist effectively supported developers in finding defects in the small task, while the checklist allowed them to be more effective and efficient in the more complex review (Large Change B), as shown by our regression model. Our ANOVA analysis confirmed that, in the more complex review change-set, the use of the checklist led to better review efficiency compared to the ad-hoc review (control group).

RQ 2 : Is the Effect of Guidance on Code Review Mediated by a Lower Cognitive Load?
In RQ 2 , we aimed to examine whether the relationship between guidance and code review performance works through lowering the cognitive load. However, the experiment's participants showed overall low review effectiveness and efficiency, and our data did not fulfill the starting condition of an existing relationship between the treatment variable and the outcome (Section 5.1). As we could not perform the mediation analysis, we focused on investigating the individual relationships between guidance and cognitive load, cognitive load and code review performance and effect of guidance and cognitive load on code review performance using regression models.
Examining the direct relationships, we found out that the checklist use significantly lowers cognitive load in the most complex change (Large Change B, p < 0.01) and in the total score for all changes (p < 0.1), see Table 8.
Furthermore, using a univariate regression model we established the direct relationship between cognitive load and code review performance, finding that higher cognitive load significantly predicts effectiveness in the Small change (p < 0.05) and Large Change A (p < 0.1) and also better efficiency in the Small Change (p < 0.05), as shown in Table 9. These results are surprising and they are further discussed in Section 6.
In the case of checklist in the Large Change B, we can confirm that the direct effects show that it lowers the cognitive load and improves code review performance. However, the use of the guided checklist showed no statistically significant effect on the level of cognitive load.
After establishing the direct effect of the treatment, we built a regression model to estimate the effect of the treatment on the dependent variable while controlling for the mediator and control variables: i.e., the effect of the guidance, cognitive load, and control variables on  code review effectiveness and efficiency. As reported in Section 4.6.3, the control variables for effectiveness are stress, review time in minutes, and hours worked before the experiment. Stress and hours worked are intervening variables for efficiency as well. The resulting models are presented in Tables 10 and 11. When including the control variables, some of the direct effects disappeared while others emerged. In the Small Change, higher cognitive load remains related to higher effectiveness (p < 0.1) and efficiency (p < 0.01). However, the direct effect of the guided checklist on review effectiveness cannot be observed anymore. The use of the guided checklist was significantly related to a lower review effectiveness in the Large Change A (p < 0.1), suggesting that the guided checklist was not helpful for this change. Checklist remained a significant predictor of code review efficiency in Large Change B (p < 0.05). The checklist use and higher cognitive load were also predictors of efficiency considering the aggregated performance of all three review tasks.
The control variables predict code review performance too. A lower level of stress and a higher amount of hours worked before taking part in the experiment are related to higher code review effectiveness (Large Change A; all three reviews). More hours worked are also related to higher review efficiency (Large Change A and B). Furthermore, developers who spent more time on the review of Large Change A were more effective.
In our RQ2, despite not being able to perform a mediation analysis because of the low review performance of the participants, we still collected insights about the relationship between guidance, cognitive load, and code review performance. We (1) observed that higher cognitive load is a statistically significant predictor of review effectiveness (in the Small Change and Large Change A) and review efficiency (Small Change), and (2) provided initial evidence for the mediation effect for checklists as they improve review performance while reducing developers' cognitive load.

Checklists Usability
The checklist and guided checklist achieved similar usability scores (M = 57.35 and M = 58.5). Moreover, the scale assigns a letter grade to the assessed system. Both of our treatments were rated as D (Poor usability). Even though the resulting score is not optimal, the lower performance of the control group developers seems to indicate that a poor implementation of these treatments is not the main reason for the low performance. The ratings of the individual items are reported in Fig. 13.
Developers reported that they would need more time to familiarize themselves with the checklist, while the guided checklist was reported as easier to learn to use. Moreover, the guided checklist was more positively evaluated regarding its integration into the experiment UI. Both guidance approaches (checklist and guided checklist) were reported as easy to use and developers felt confident in their use. We also observed significant correlations between usability and other variables in the analysis. Among checklist users, developers who code more frequently reported a better usability of the checklist (r(22) = 0.5, p < 0.05). Also developers with higher cognitive load found the checklist more usable (r(22) = 0.35, p < 0.05). Developers assigned to the guided checklist took, on average, half an hour longer to complete the experiment compared to the developers in the control group or the checklist group. Users of the guided checklist reported lower usability with increasing time spent on the reviews (r(15) = −0.45, p < 0.05). Therefore, the time overhead posed by the guided checklist compared to the other two treatments decreased its usability. The longer time spent on the reviews potentially put an extra strain on the need of developers to hold the reviewed code in their working memory and increased the representational holding demands. We did not observe relationships between usability and code review performance.

Discussion and Lessons Learned
The developers who took part in our study achieved an overall low review performance (both in terms of review effectiveness and efficiency) regardless of the treatment to which they were assigned. Nevertheless, we observed significant relationships in the experiment data that allowed us to draw initial conclusions (1) on the benefits of providing guidance to developers during code review to lower cognitive load and (2) on the role of cognitive load in reaching code review effectiveness and efficiency. Furthermore, we collected valuable lessons learned to conduct future studies on this topic.

Review guidance reduces cognitive load:
Checklists and software development strategies aim to improve developers' performance by lowering the cognitive load (Kamsties and Lott 1995;LaToza et al. 2020). However, this relationship has not been explicitly tested yet. The results of our experiment provided initial results on how guidance can lower the cognitive load of developers. Based on the literature, we expected the reduction of the cognitive load to play a fundamental role in preventing cognitive overload in more complex changes (Bannert 2002). We indeed observed that the checklist significantly lowers cognitive load in the most complex review change (Large Change B) and on the whole review (considering all review tasks together). Our results gave an initial indication that guidance indeed lowers reviewers' cognitive load, indicating that further research could be valuable to be conducted in this research direction. The review strategy did not prove to lower the cognitive load for novice reviewers. Nonetheless, further studies need to be conducted to collect insight on the relationship between guidance and cognitive load.
Higher cognitive load can improve performance. Even though checklist usage lowers cognitive load, higher cognitive load predicted both code review effectiveness and efficiency in the Small Change. This stands in opposition to the hypothesis based on the literature suggesting that lower cognitive load leads to improved code review performance, as suggested by previous work in the field (Baum et al. 2017b;Pascarella et al. 2018;Kamsties and Lott 1995). The subject changes showed different patterns in the results: While in the Small Change, the guided checklist and higher cognitive load were more helpful, in the large changes, the checklist proved to be more efficient and to lower cognitive load, while the guided checklist led to lower effectiveness. This seems to indicate that the change complexity plays an important role in which type of guidance developers require.
If confirmed in further studies, this finding can have consequences for both researchers and practitioners: The complexity of the review change-set under analysis is a significant factor for choosing the right guidance approach to support the review.
According to the literature, lowering the cognitive load is important to prevent the upper-limit scenario when working memory is overloaded, therefore saving up limited cognitive resources for effective and efficient performance (Bannert 2002). However, we have observed a scenario where developers did not perform well regardless of the treatment they were assigned to. Furthermore, a higher cognitive load was linked to better performance: We found that a higher cognitive load led to higher review effectiveness and efficiency in the Small Change. We interpret this as the need to get engaged in the task and invest the cognitive resources into actually being able to identify the defects. If this finding is confirmed in further studies, when devising guidance approaches, researchers should take into account the positive effect that cognitive load might have on review performance. Future research should investigate how to best balance cognitive load to improve developers' effectiveness and efficiency.
Code understanding is fundamental: Previous studies reported how understanding the code is indeed one of the main challenges that developers face during code review (Tao et al. 2012;Bacchelli and Bird 2013). For this reason, researchers devised numerous approaches to increase reviewers' understanding: e.g., re-ordering review changes 2017b) or untangling complex review change-sets (Barnett et al. 2015;Dias et al. 2015). The former approach aims at increasing the understandability of a review change-set by showing changes in a more meaningful order (as opposed to the alphabetical one currently offered by popular code review tools: e.g., Gerrit or Phabricator). The latter focuses instead on dividing large review change-sets into smaller ones, comprising only changes related to the same issue.
We noticed that developers in this experiment faced significant issues in answering correctly the understanding questions shown at the end of each review task. As reported in Table 2, in the Small Change participants achieved an average score of 1.21 out of 2 when answering the understanding questions at the end of the task. In the Large Change A and Large Change B, developers achieved an average of 1.45 (out of 4) and 1.37 (out of 3) correct answers. These results seem to indicate that despite the support provided by review guidance, a significant increase in review performance can not be achieved if reviewers struggle to understand the content of a review change-set. It seems reasonable to think that for review guidance approaches to be effective, it is necessary to ensure that developers possess a good understanding of the code.
This finding might have practical implications on how to support developers during code review: Researchers must not only focus on guiding developers during the review but also support reviewers in gaining a preliminary understanding on the content of a review changeset.
The experiment tasks must fit the abilities of participants: The user interface and code reviews have been previously successfully implemented in an experiment with a sample of developers collected online and in a company . However, the developers in our sample not only performed poorly, but also had problems to answer the questions about understanding the code (as reported in Table 2). Furthermore, we have found a significant correlation between lower cognitive load and programming and reviewing experience (as shown in Fig. 9). Therefore, there is a not negligible possibility that the reviews were too difficult for these developers. This might explain why higher cognitive load was actually a significant predictor of code review performance. The developers actually needed to put a considerable mental effort into comprehending and successfully reviewing the code. To be more successful and also provide more diverse data, a future experiment should test more types of changes and defects with the target developers before selecting the appropriate changes for the final experiment.
Autonomy adds value to guidance: Our results indicate that the guided checklist performed better in the simpler task and was unhelpful in the Large Change A, while the simple checklist was more effective in the complex task. As reported in Section 5.1, the devised regression models highlighted the existence of a relationship between the guided checklist and review effectiveness in the Small Change, while no effect was reported for the two complex changes. At the same time, the use of the checklist was shown to be correlated with higher review performance (effectiveness and efficiency) in Large Change B. Given that the guided checklist implemented additional methods to reduce cognitive load compared to the checklist, these results are surprising. We believe the reason for this difference lies in developers' autonomy.
While the checklist allows developers to have maximum flexibility on how to check the items, the devised guided checklist controlled the flow of review explicitly telling participants what to check and when. This makes the guided checklist useful for a shorter detailed review, but it might become overwhelming for longer and more complex reviews. The importance of autonomy seems to be confirmed by LaToza et al. (LaToza et al. 2020), who designed a tool to support explicit strategies to perform software development tasks. In particular, their solution supported autonomous execution of these strategies and did not enforce the steps, rather allowed the developers to define the steps and be flexible in navigating the code.
The importance of autonomy should be taken into account by both researchers and practitioners (e.g., project managers) when implementing approaches to improve the code review process of a project. Approaches that too strictly guide developers may deplete their cognitive resources by enforcing a too fine-grained level of review, not allowing reviewers to adapt the review style to their personal needs. This might undermine the support this kind of approaches aim to offer.

Familiarity with guidance takes time:
The results of our System Usability Scale questionnaire (reported in Fig. 13) show that participants believed our guidance approaches to be well-integrated in the online experiment platform (the corresponding item achieved a mean score of 3.53 for the checklist and 3.9 for the guided checklist). Nonetheless, developers reported difficulties and the need for more support in learning how to use them. Participants assigned a mean score of 3.7 (for the checklist) and 3.3 (for the guided checklist) to the "need to get more used to it" item in the SUS questionnaire. This indicates that, despite the linearity of the implemented guidance approaches and the presence of a tutorial on how to use them, participants still would have benefited from more time to familiarize with these tools. Future studies could take into account this factor and either plan for longer controlled experiments or use different kinds of studies (e.g., field studies). For example, a longitudinal study would give developers time to familiarize themselves with the guidance approach under investigation.

RQ1.
Our first aim was to investigate the effect of guidance approaches (checklist and guided checklist) on developers' review performance. We expected these approaches to lead to higher review effectiveness and efficiency on all three review tasks. We hypothesized to observe a lower increase in performance on the small review task compared to the two more complex review tasks: The importance of lowering the cognitive load is greater when developers perform more complex tasks and potentially might reach the cognitive overload.
However, our results showed that the guidance approaches do not lead to better participants' performance on all three review changes. Our checklist increased developers' review effectiveness and efficiency only on Large change B, while the guided checklist led to higher effectiveness only on the small change. Furthermore, our findings highlighted the importance of developers' autonomy (see Lesson learned 5) while performing code review. A more guided approach (guided checklist) was proven effective on a small review change-set but unhelpful on a larger change-set, potentially because of the extra time overload required by the detailed level of review and lower usability related to it, as reported in Section 5.3.

RQ2.
In RQ2, we expected our mediation analysis to show that review guidance (checklist and guided checklist) leads to better performance by lowering developers' cognitive load. Even though both guidance approaches led to better performance (albeit in different contexts), only the checklist lowered developers' cognitive load (in the Large change B and in the total score). Only in the Large change B, the most complex change, we could observe the mediation pattern hypothesized in the literature -using the checklist lowered the cognitive load and also led to better performance.
Moreover, we expected that lower cognitive load always leads to better review performance. However, our results showed the opposite: A higher cognitive load was related to better review effectiveness and efficiency in the small change, and to better effectiveness in Large Change A. These results seem to indicate that cognitive resources also need to be invested to perform good reviews.

Influence of Participants' Lack of Experience
In this section, we reflect on the possible effects that the lack of experience in code review among our participants may have had on our results. In our experiment, we recorded both programming experience and review experience of the participants. While our participants had professional programming experience with Java, they can be considered as novice reviewers. Therefore, it seems that programming experience does not directly translate into the ability to find defects during code review. For this reason, in the following reflection, we focus on code review experience.
Experience reduces the cognitive load of the reviewers (Section 2): With higher experience, fewer cognitive resources are needed to complete a review. Developers with no review experience lack the automatization on what and how to review, how to find appropriate information, and how to process it. Therefore, their cognitive resources are depleted faster. This might have an impact on the type of guidance inexperienced developers need as opposed to experienced reviewers. A more guided review approach, as the one offered by our guided checklist, might be beneficial for novice reviewers as it guides them in the review step by step: The guided checklist supported developers in achieving higher review effectiveness in the Small Change. However, it did not assist the novice reviewers well in the complex change. This result is potentially due to the fact that the guided checklist requires a very detailed and thorough process that leads to a longer review time. Therefore, there is an additional strain for the mental load caused by representational holding -the need to keep information ready in the working memory for a prolonged period of time.
Developers' experience with the system under review might also have had an impact on our results. In our experiment, participants had to review change-sets extracted from a system they were not familiar with. This might have led them to spend significant cognitive resources on understanding the review changes, increasing their cognitive load and undermining the effect of the guidance. Therefore, future work can be designed and executed to investigate whether the devised guidance approaches achieve better results when applied to review change-sets with whom developers are already familiar.

Threats to Validity
Construct validity: The set of review tasks might influence developers' results. To mitigate this issue, we employed code review tasks already successfully applied in a previous experiment on code review ). Moreover, participants had to use an online platform and guidance treatment with which they were not familiar before. This might have negatively affected their review performance. To reduce this bias, the online platform used showed review changes in a similar fashion to the one of popular code review tools (e.g., Gerrit or Phabricator). Moreover, before starting the experiment, participants had to complete a short tutorial explaining the use of the experiment UI and of the guidance approaches (checklist and guided checklist). The way in which the checklist and guided checklist were implemented might have introduced bias in our results. We developed our guidance approaches following best practices from both researchers (Chernak 1996;Degani and Wiener 1991;Kamsties and Lott 1995) and industry (McConnell 2004). Nonetheless, we can not rule out that different guidance approaches (e.g., with a more specific focus on developers' autonomy) could lead to different results.
Our guided checklist made developers spend significantly longer time than the other groups. Furthermore, with a longer time, users of the guided checklist reported lower usability. Therefore, the review time was entered as an important confounding factor in the regression model. Internal validity: We analyzed the experiment logs to identify participants who did not take the experiment seriously. To this aim, we disregarded participants who spent less than 5 minutes doing the review or did not enter any remark. Moreover, we also controlled for developers who might have taken part in the experiment several times or have restarted the experiment. A poor understanding of the use of the experiment UI and the code under review might have introduced bias in our results. To mitigate this issue, we supported participants in two ways. (1) We showed them an interactive tutorial on the UI and used guidance approaches (if they were assigned to one of them). Participants were required to complete the tutorial by interacting with the UI, before being able to proceed with the experiment. Furthermore, we asked them questions about the instructions of the experiment. If they answered wrongly, we displayed the correct answer. (2) Prior to each review task, we showed participants a description of the context of the change. Moreover, we asked questions to the participants to verify their correct understanding of the context of the change. As done before, if developers answered these questions wrongly, we made them aware of the correct answer. Furthermore, we controlled developers' understanding of the code through a set of questions at the end of each review task.
Participants in our experiment achieved overall low review effectiveness and efficiency, regardless of the treatment to which they were assigned (control, checklist, and guided checklist). This prevented us from drawing strong conclusions to answer our research questions. Nonetheless, we were able to collect indications on the benefits of review guidance over developers' review performance and cognitive load. External validity: All participants in the experiment were professional developers with experience in Java. However, they had rather low experience with code reviews and code review effectiveness and efficiency. Therefore, our results are bound to novice reviewers with limited ability to identify defects. Furthermore, all participants work in the same company and, to the best of our knowledge, possess a very similar technical and cultural background. This might limit the generalizability of our findings.
The majority of the participants (N = 39) worked at least eight hours before taking part in our experiment. This might have negatively influenced their review performance and, therefore, introduced a bias in our results. To check the possible effect of the number of hours worked on the review performance, we included this variable as a control variable in our regression models. Our results showed that more hours worked do not have a negative effect on the reviewers' performance. On the contrary, they led to better performance.
We hypothesize three reasons as to why more hours worked before the experiment can be related to better code review performance: (1) The participants of the experiment are more productive at the end of their working day, (2) more productive developers worked on the experiment later in the day, and (3) in our sample, younger developers with more frequent coding practice had more hours worked before they started the experiment (Section 4.6.3); therefore, developers with more coding practice were working on the experiment later in the day.

Conclusion
We have examined how two types of guidance incorporating different methods of lowering cognitive load relate to code review performance and cognitive load. While a checklist performed better in a complex task, a tool-supported strategic checklist execution proved to be more effective in the simple task. Moreover, we obtained an initial indication that the use of a checklist lowers developers' cognitive load. However, a higher cognitive load was related to better code review performance. The study participants achieved low code review effectiveness and efficiency as well as a limited understanding of the code. Therefore, the higher cognitive load was probably needed to achieve better performance. Further studies are still needed to investigate the relationship between guidance, cognitive load, and code review performance.
Funding Open access funding provided by University of Zurich. P. Wurzel Gonçalves, E. Fregnan, and A. Bacchelli gratefully acknowledge the support of the Swiss National Science Foundation through the SNF Projects No. PP00P2 170529.

Availability of data and material
The methodology of the present study was accepted as registered report at MSR 2020. The published article is available here: https://dl.acm.org/doi/abs/10.1145/3379597.3387509 All data and materials are available in our replication package at the following link: https://doi.org/10. 5281/zenodo.5653341 Code Availability The code developed in the context of this study is available in our replication package at the following link: https://figshare.com/s/b26d1936417fe2c2c257 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.