1 Introduction

Code review is a technique of systematic examination of code changes. It can be conducted before or after the change is integrated into the main code repository (Rigby et al. 2008). Code changes submitted by a developer are reviewed by one or more of their peers. This is why code reviews are also known as peer reviews or peer code reviews. For the sake of simplicity, we use the term code review in this study.

Code review is an important practice for software quality assurance (Tao and Kim 2015; Bavota and Russo 2015; Boehm and Basili 2001; Mäntylä and Lassenius 2009; Barnett et al. 2015). Several open source projects, e.g., Android,Footnote 1Qt,Footnote 2 and Eclipse,Footnote 3 as well as companies, e.g., Microsoft,Footnote 4Oracle,Footnote 5 and Samsung,Footnote 6 adopt code review as part of their development process. Likewise, studies have also shown that code review can provide multiple benefits in the development process (Bacchelli and Bird 2013; Pangsakulyanont et al. 2014; Morales et al. 2015; Cohen et al. 2006; McIntosh et al. 2015).

The main goals of code reviews are to find bugs in the code change, and verify whether the project guidelines and coding style are being respected (Fagan 1976; Wiegers 2002; Wang et al. 2015; Bacchelli and Bird 2013; Bosu et al. 2017). Furthermore, code reviews help to improve the quality of the code on production, find better ways to implement the change, spread the knowledge about the project, and create awareness of the changes in the code base (Bacchelli and Bird 2013; Pangsakulyanont et al. 2014; Morales et al. 2015; Cohen et al. 2006; McIntosh et al. 2015).

Despite such benefits, code reviews can incur costs on software development projects, as they can delay the merge of a code change in the repository and, consequently, slow down the overall development process (Pascarella et al. 2018; Greiler 2016). The time invested by a developer in reviewing code is non-negligible (Tao and Kim 2015) and may take 10%–15% of the overall time invested in software development activities (Bosu et al. 2017; Cohen et al. 2006). Furthermore, performing a code review is not a trivial task per se. In fact, understanding the code change and its context is one of the major issues reviewers face during code reviews (Bacchelli and Bird 2013; Cohen et al. 2006; Tao et al. 2012; Sutherland and Venolia 2009; LaToza et al. 2006). The merge of a code change in the repository can be further delayed when reviewers experience difficulties in understanding the change, i.e., when they are not certain of its correctness, run-time behaviour and impact on the system (Cohen et al. 2006; Bacchelli and Bird 2013; Tao et al. 2012; Sutherland and Venolia 2009; LaToza et al. 2006).

We believe that confusion, i.e., any situation where a person is uncertain about something or unable to understand something (Ebert et al. 2017), can affect the artifacts that developers produce and the way they work, and hence, impact the development process (Cohen et al. 2006; Bacchelli and Bird 2013; Tao et al. 2012; Sutherland and Venolia 2009; LaToza et al. 2006). For instance, on the one hand, the code review might take longer than it should, the quality of the review might decrease, more discussions might take place, or even the code change might be blindly accepted or summarily rejected (Ebert et al. 2019). On the other hand, confusion might lead reviewers and authors to reach an improved solution (Ebert et al. 2019). As such, we believe that a proper understanding of the phenomenon of confusion in code reviews is a necessary starting point towards reducing the cost of code reviews and enhancing the effectiveness of this practice, thereby improving the overall development process.

In this paper, we extend our previous study of the reasons and impact of confusion in code reviews, as well as the strategies developers adopt to deal with confusion (Ebert et al. 2019). In that study, we built a framework for confusion in code reviews including reasons, impacts, and the coping strategies adopted by developers. To do so, we employed a concurrent triangulation strategy combining a developer’s survey and the analysis of code review comments. Our findings show that there are 30 different reasons for confusion, and that the three most prevalent ones relate to the missing rationale for the change, discussion of non-functional aspects of the solution, and the lack of familiarity with the existing project code. Furthermore, we observed that confusion can impact code reviews in 14 different ways. The most popular impacts are delaying, decrease of review quality, and the need for additional discussions. Finally, our framework includes 13 coping strategies developers reported to adopt when dealing with confusion in code reviews. The most prevalent strategies include requesting more information, improving own familiarity with the existing code, and engaging in off-line discussions.

The evidence provided by our previous study has several implications for both tool builders and researchers (Ebert et al. 2019). However, two factors motivated us to follow up on that study. The first factor is the relatively low number of coping strategies for confusion, (13), when compared to the number of reasons for confusion (30). This stems in part from the adopted methodology, since most of the discussion in the code reviews we examined revolves around the reasons for confusion (Ebert et al. 2019). The second factor is related to the contextualization of confusion in the literature, i.e., we want to discover to what extent different aspects of confusion are addressed in scientific studies. Code reviews has been extensively addressed by recent literature, and hence, we intend to identify suggested solutions for confusion in code reviews and, most importantly, summarize existing gaps, i.e., where future research should focus on. To contextualize our findings, we performed a systematic mapping study in order to identify mitigation strategies designed to address confusion, as well negative impacts of these factors going beyond confusion. The strategies might be beneficial for developers facing confusion and complement the currently employed coping mechanisms. To address these issues, we decided to conduct a deeper investigation of the solutions proposed and impacts identified in the scientific literature.

This paper extends our previous study by reporting on a systematic mapping study of the most frequently experienced reasons for confusion and solutions proposed for them. To identify the most frequently experienced reasons for confusion, we conduct a survey with 62 developers. Based on their answers, we selected the five most frequent reasons for confusion and performed a systematic mapping study of the Software Engineering literature to assess to what extent does the scientific literature discuss these reasons and identify solutions proposed in the literature for each one of them. Based on the identified solutions or the lack thereof, we propose an actionable guideline for developers on how to deal with confusion in code reviews. Furthermore, we propose a research agenda for researchers interested in studying how to provide support for developers experiencing confusion.

The remainder of this paper is organized as follows. Section 2 presents the background related to this study. In Section 3, we present our first study aimed at understanding the reasons for confusion, its impacts, and the strategies developers used to deal with it. In Section 4, we present the preliminary study we conducted in order to identify the most frequent reasons for confusion according to developers. Next, in Section 5, we present the second study we conducted in order to investigate the solutions and impacts of the most frequent reasons for confusion proposed by literature. The discussion is presented in Section 6. The related work is discussed in Section 7. Finally, the conclusions and future work are presented in Section 8.

2 Background

In this section, we provide a background of code reviews in Section 2.1. Then, we present our definition of confusion in Section 2.2.

2.1 Code Reviews

Formal code review was first defined by Fagan in 1976 as software inspections (Fagan 1976). Software inspection, the most formal type of code review (Rigby and Bird 2013), is a structured process for reviewing source code that relies on rigid roles and steps, with the single goal of finding defects (Fagan 1976). Notwithstanding the initial success of Fagan’s inspections with both the industry and research, its formality brings several drawbacks. Indeed, the inspections are very time consuming because the meetings need to be organised and the participants need to do some preparation. Another disadvantage is the chance of turning the inspection meeting into a political or social disaster (Wiegers 2002). Moreover, the formality of the inspection does not fit well with agile development methods (Martin 2003).

As a result, a more lightweight code review process with a better fit for test-driven and iterative development processes started to become more popular. Formalising this practice, Bacchelli and Bird (2013) defined the lightweight code review process as a “modern code review”, which is a review that is informal (as opposed to Fagan’s inspections), supported by code review tools, and occurs regularly in practice. We also use the term code reviews as a synonym for modern code reviews in this study.

The code review process is an iterative process and can be instantiated in different ways. As input, a code review receives the original code change and the outcome is the reviewed change, which might be either accepted or rejected. The developer who wrote the code change is the author, and might also be responsible for submitting the change for review. The reviewer is responsible for assuring that the code change is functionally correct, meets the performance requirements, and follows the quality standards of the project.

In general, there are two types of workflow for code reviews, depending on when the review is conducted in the development process:

  • Review-then-commit (pre-commit): the code is reviewed before it is integrated into the main repository of the Version Control System (VCS) (Tichy 1985);

  • Commit-then-review (post-commit): the code is reviewed after it is integrated into the main repository of the VCS (Tichy 1985);

Since the most common type of code review is review-then-commit (Rigby 2011), it will be the focus of this thesis. We present an example of the code review process within this approach in Fig. 1.

Fig. 1
figure 1

The code review process

It starts with the author submitting the code change (1). The reviewers are notified and start reviewing the code change (2). They should check and verify it based on several quality criteria, such as correctness, adherence to the project guidelines, and conventions. If the reviewers believe that the code change does not fulfil those requirements, they ask the author to fix it, or to submit a new one (3). Thus, the author needs to work on the code change and submit it again (1) for review (2). When the reviewers are satisfied that the code change is suitable, it is integrated into the code repository (4). However, if reviewers’ quality criteria are not achieved by the code change, it is rejected, and the code review is abandoned (5). There might be several iterations before the reviewers decide to end the process (1 to 3), where the code change might be accepted (i.e., it is merged into the main repository), or rejected (i.e., it is discarded).

2.2 Confusion Definition

There are several studies which tried to model the affective disequilibrium related to confusion, uncertainty, and lack of knowledge, especially from the Psychology field. In this section, we discuss some of the most relevant studies on those topics.

The Merriam-Webster dictionaryFootnote 7 provides the following definitions of the word confusion: (1) “a situation in which people are uncertain about what to do or are unable to understand something clearly” and (2) “the feeling that you have when you do not understand what is happening, what is expected, etc.”, i.e., confusion is both the situation and a sentiment.

Armour (2000) suggested categorising ignorance into layers based on what we know and what we do not know. He defined the Five Orders of Ignorance:

  • 0th Order Ignorance - Lack of Ignorance: when we know something, i.e., it is knowledge;

  • 1st Order Ignorance - Lack of Knowledge: when we do not know something, but we can easily identify that fact;

  • 2nd Order Ignorance - Lack of Awareness: when we do not know that we do not know something, i.e., when we are unaware of that fact;

  • 3rd Order Ignorance - Lack of Process: when we do not know a suitably efficient way to find out we do not know that we do not know something;

  • 4th Order Ignorance - Meta Ignorance: when we do not know about the Five Orders of Ignorance.

D’Mello and Graesser (2014) focused on confusion and how it impacts learning and problem solving. Similarly to the second definition of Merriam-Webster, D’Mello and Graesser consider confusion to be an affective state. According to the authors, confusion happens when an individual detects new or discrepant information, e.g., there is a conflict with prior knowledge. Jordan et al. (2012) investigated the frequency of uncertainty expressions in discussions of students using a computer-mediated environment. The authors introduced their own definition of uncertainty and provided a coding scheme to describe and model it. Acknowledging that defining uncertainty was not simple, Jordan et al. (2012) define uncertainty as: “situations when individuals have a sense of wondering, doubt, or unease about how the future will unfold, what the present means, or how to interpret the past”.

We believe that lack of knowledge and confusion, which can also encompass doubt and uncertainty, are strictly linked (e.g., confusion could be determined as lack of knowledge) and are both actionable (D’Mello and Graesser 2014). Thus, we define confusion broadly as:

a situation where a person is uncertain about or unable to understand something.”

3 Understanding Confusion in Code Reviews (Ebert et al. 2019)

In this section, we summarize our previous study aimed at a framework for confusion in code reviews. Specifically, we investigated what are reasons for confusion (RQ1), its impacts (RQ2), and the strategies developers are using to deal with it (RQ3) (Ebert et al. 2019). To the best of our knowledge, our study on is the first one conducting a deep investigating of the phenomenon of confusion in code reviews. In Section 4 we build upon this study to get further insights in frequently experienced reasons for confusion.

We describe the methodology in Section 3.1. The results are presented in Section 3.2. Finally, we discuss the threats to validity in Section 3.3.

3.1 Methodology

To strengthen the validity of the study we follow the recommendation of Easterbrook et al. (2008) and opt for a concurrent triangulation strategy, which is a combination of different research methods. Firstly, we conduct a survey to understand “what developers say” (Section 3.1.1). Then, we analyze the code review comments to understand “what developers do” (Section 3.1.2). Finally, we compare and contrast the findings of the two analyses (Section 3.1.3): indeed, Easterbrook et al. (2008) observe that “what people say” could be different from “what people do”.

3.1.1 Surveys

In the SE literature, a theory is missing to describe what are the reasons for confusion in code reviews, the impact of confusion on the development process, and what coping strategies developers employ to deal with confusion. As such, to answer our RQs we opt for grounded theory building (Glaser and Strauss 1967; Stol et al. 2016). We implement an iterative approach. During each iteration, we administer a survey to developers involved in code reviews. We ask developers that already answered the survey during one of the previous iterations to refrain from answering it again.

Survey Design

The survey was designed according to the established best practices Groves et al. (2009), Kitchenham and Pfleeger (2008), Singer and Vinson (2002), and Steele and Aronson (1995): prior to asking questions, we explain the purpose of the survey and our research goals, disclose the sponsors of our research and ensure that the information provided will be treated in a confidential way. In addition, we inform the participants about the estimated time required to complete the survey, and obtain their informed consent. The invitation message includes a personalized salutation, a description of the criteria we used for participant selection, as well as the explanation that there would not be any follow up if the respondent did not reply. This last decision also implies that we did not send reminders.

The survey starts with the definition of confusion as provided in Section 2, followed by a question requiring the participants to confirm that they understood the definition. Next, we ask two series of questions: the questions were essentially the same but were first asked from the perspective of the author of the code change, and then from the perspective of the reviewer of the change (cf. Table 1). Each series starts with the Likert-scale question about the frequency of experienced confusion: never, rarely, sometimes, often, and always. To ensure that the respondents interpret these terms consistently we provide quantitative estimates: 0%, 25%, 50%, 75% and 100% of the time. For respondents who answered anything different from never, we pose four open-ended questions (to get the as rich as possible data (Foddy 1993)): i) what are the reasons for confusion, ii) whether they can provide an example of a practical situation where confusion occurred during a code review (RQ1), iii) what are the impacts of confusion (RQ2), and iv) how do they cope with confusion (RQ3). Finally, we ask the participants to provide information about their experience as developers and frequency of reviewing and authoring code changes. We ask these question at the end of the survey rather than at the beginning to reduce the stereotype threat (Steele and Aronson 1995). Prior to deploying the survey, we discussed it with other software engineering researchers and clarified it when necessary.

Table 1 Survey questions. The questions marked “*” were only used in the first survey, “ + ”—only in the second and third surveys

Participants

The target population consists of developers who participated in code reviews either as a change author or as a reviewer. During the first iteration we target Android developers who participated in code reviews on Gerrit: 4,645 of their email addresses provided by Ebert et al. (2017) allow us to contact the developers by email and evaluate the response rate. In the subsequent iterations, the survey was announced on Facebook and Twitter. As the exact number of developers participating in code reviews reached cannot be known we do not report the response rate for the follow-up surveys.

Data Analysis

To analyze the survey data, we use a card sorting approach (Zimmermann 2016). We analyze the survey responses from the first iteration using open card sorting (Zimmermann 2016), i.e., topics were not predefined but emerged and evolved during the sorting process. After each subsequent survey iteration, we use the results of the previous iteration to perform closed card sorting (Zimmermann 2016), i.e., we sort the answers of each survey iteration according to the topics emerging from the previous one. If the closed card sorting succeeds, this means that the saturation has been reached and sampling more data is not likely to lead to the emergence of new topics (Finfgeld-Connett 2014; Lenberg et al. 2017). In such a case the iterations stop. If, however, during the closed card sorting additional topics emerge, another iteration is required.

To facilitate analysis of the data we use axial coding (Kitchenham and Pfleeger 2008) to find the connections among the topics and group them into dimensions. These dimensions emerge and evolve during the final phase of the sorting process, and they represent a higher level of abstraction of the topics.

As we have multiple iterations and multiple surveys answered by different groups of respondents, a priori it is not clear whether the respondents can be seen as representing the same population. Indeed, it could have been the case that, e.g., respondents of the second survey happened to be less inclined to experience confusion than the respondents of the third survey and the reasons of their confusions are very different. This is why we first check similarity of the groups of respondents in terms of their experience as developers and code reviewers, frequency of submitting changes to be reviewed and reviewing changes as well as frequency of experiencing confusion. If the groups of respondents are found to be similar, we can consider them as representing the same population and merge the responses. If the groups of respondents are found to be different, we treat the groups separately. To perform the similarity check we use two statistical methods: i) analysis of similarities (ANOSIM) (Clarke 1993), which provides a way to test statistically if there is a significant difference between two or more groups of sampling units, and ii) permutational multivariate analysis of variance using distance matrices (PERMANOVA) (Anderson 2001; McArdle and Anderson 2001).Footnote 8

3.1.2 Analysis of Code Review Comments

To triangulate the survey findings for the RQs we perform an analysis of code review comments. As a dataset we use the one provided by Ebert et al. (2017). Similarly to the developers contacted during the first survey iteration, this dataset pertains to Android. The code reviews of Android are supported by Gerrit, which enables communication between developers during the process by using general and inline comments. The former are posted in the code review page itself, which presents the list of all general comments, while the inline comments are included directly in the source code file. The dataset of Ebert et al. comprises 307 code review comments manually labeled by the researchers as confusing: 156 are general and 151 are inline comments.

Similarly to the analysis of the survey data, we use card sorting to extract topics from the code review comments. We conduct an open card sorting of the general comments to account for the possibility of divergent results, i.e., we did not want to use the results from the surveys because what developers do often differs from what they think they do and the emergent codes might a priori be different from those obtained when analyzing the survey data. To confirm the topics emergent from the general comments we then perform a closed card sorting on the inline comments.

3.1.3 Triangulating the Findings

Recall that the goal of concurrent triangulation is to corroborate the findings of the study, increasing its validity. However, following Easterbrook et al. (2008) we expect to see some differences between ‘what people say’ (survey) and ‘what people do’ (code review comments). Hence, if the topics extracted from the surveys and code review comments disagree, we conduct a new card sorting round only on the cards associated with topics discovered on the basis of the survey but not on the basis of the code review comments, or vice versa. In order not to be influenced by the results of the previous card sorting, we perform open card sorting and exclude the researchers who participated in the previous card sorting rounds. Finally, in order to finalize the framework for confusion in code reviews, we perform the consistency check within the topics and deduction of more generic topics, as recommended by Zimmermann (2016), as well as a consistency check across RQs (i.e., reasons, impacts, and coping strategies) and emergent dimensions.

3.2 Results

We discuss the application of the research method in practice (Section 3.2.1), and analyze similarity between the responses received at each one of the survey iterations (Section 3.2.2). Then, we present the demographics results from the survey (Section 3.2.3), and discuss reasons for confusion (RQ1, Section 3.2.4), its impact (RQ2, Section 3.2.5), and the strategies employed to cope with it (RQ3, Section 3.2.6).

3.2.1 Implementation of Approach

The implementation of the approach designed in Section 3.1 is shown in Fig. 2. Individuals involved in the card sorting are graduate students in computer science or researchers.

Fig. 2
figure 2

Implementation of the approach: three survey rounds, general and inline comments, the triangulation, and finalization rounds (Ebert et al. 2019)

First, following the iterative approach we have performed three iterations since saturation has been reached. Among the 4,645 emails sent during the first iteration, 880 emails have bounced; hence, 17 valid responses correspond to the response rate 0.45%. Such response rate was unexpectedFootnote 9 and might have been caused by presence of inactive members or one-time-contributors (Lee et al. 2017). For the second and the third survey rounds, the number of responses are 24 and 13 respectively; the response rate could not be computed.

The open card sorting of the first survey resulted in 52 topics related to the reasons (25), impacts (14) and coping strategies for confusion (13). The closed card sorting of the second survey resulted in three additional topics: two for impacts and one for the coping strategies. Finally, the closed card sorting of the third survey resulted in no new topics. The open card sorting on the general comments resulted in 16 topics related only to the reasons for confusion, i.e., no topics related to the impacts and coping strategies appeared. Then, the closed card sorting on the inline comments resulted in no new topics.

During the triangulation, we verified that what developers said about the reasons for confusion (survey) has a little agreement with what developers did in the code review comments. Only 6 topics were found both among the survey answers and code review comments, 19 topics appeared only in the survey and 10 topics—in the code review comments. Thus, we decided to conduct another card sorting on the divergent 29 topics. This time, since it was an open card sorting, from the cards belonging to divergent topics we identified 42 topics. As the last step, we finalized the framework and obtained a total of 57 topics related to reasons (30), impacts (14), and coping strategies (13). After finalizing the topics we observe that 70% (21/30) of them have cards both from the surveys and from the review comments. Moreover, the shared topics cover the lion’s share of the cards: 94.9% of the survey cards and 90.7% of the code review comments’ cards.

As explained above, using axial coding we identified the following dimensions, common for answers to the three RQs: review process (18 topics): the code review process, including issues that affect the review duration; artifact (15 topics): the system prior to change, code change itself and its documentation or the system after change; developer (15 topics): topics regarding the person implementing or reviewing the change; link (9 topics): the connection between developers and artifacts, e.g., when a developer indicates that they do not understand the code. Examples of topics of different dimensions can be found in Sections 3.2.43.2.5 and 3.2.6.

3.2.2 Analysis of Similarity of the Surveys’ Results

First, we verified the similarity of the second and third surveys. Since both were published on Facebook and Twitter, we expect the values to be similar, i.e., respondents to represent the same population. Using both ANOSIM (R = − 0.0171 and p-value = 0.542) and PERMANOVA (p-value = 0.975) we could not observe statistically significant differences between the groups, i.e., the answers can be grouped together. Then, we checked the similarity between the answers to the first survey (Android developers) and the answers to the second and the third surveys taken together. The results of the ANOSIM analysis, R = 0.126 and p-value = 0.01, showed that the difference between the groups is statistically significant. However, the low R means that the groups are not so different (values closer to 1 mean more of a difference between samples), i.e., the overlap between the surveys is quite high. This observation is confirmed by the outcome of the PERMANOVA test: the p-value = 0.191 is above the commonly used threshold of statistical significance (0.05). Based on those results, we conclude that the respondents represent the same population of developers and report the results of all three surveys together.

3.2.3 Demographics of the Survey Respondents

The respondents are experienced code reviewers, 80% (38 of 47 respondents that answered questions about demographics) have more than two years of experience reviewing code changes. The experience of our population as developers, i.e., authoring code changes, is even higher: 93% (44 respondents) have been developing for more than two years. The number of years of experience as developers is higher than the number of years of experience as reviewers: this is expected because reviewing tasks are usually assigned only to more experienced individuals (van Wesel et al. 2017). Respondents are active in submitting changes for review, and even more active in reviewing changes: almost 49% (23 developers) submit code reviews several times a week, while for reviewing this percentages reaches 72% (34).

The frequencies with which code change authors and code reviewers experience confusion are summarized in Fig. 3. On the one hand, when reviewing code changes, about 41% (20) of the respondents feel confusion at least half of the time, and only 10% (5) do not feel confusion. On the other hand, when authoring code changes only 12% (6) of the respondents feel confusion at least half of the time, and 35% (17) of the respondents do not feel confusion. Comparing the figures we conclude that confusion when reviewing is very common, and that developers are more often confused when reviewing changes submitted by others as opposed to when authoring the change themselves. We also applied the χ2 test to check whether experience influences frequency of confusion being experienced. The test was not able to detect differences between more and less experienced developers in terms of frequency of confusion being experienced as a developer, nor between more and less experienced reviewers in terms of frequency of confusion being experienced as a reviewer (\(p\simeq 0.26\) and 0.09, respectively).

Fig. 3
figure 3

Frequency of confusion for developers and reviewers

3.2.4 RQ1. What Are the Reasons for Confusion in Code Reviews?

We found 30 reasons for confusion in code review (see Table 2). They are spread over all the dimensions, with the artifact and review process being the most prevalent.

Table 2 The reasons, impacts and coping strategies developers use to deal with confusion; in the parenthesis are the numbers of cards

There are seven reasons for confusion related to the code review process. The most common is organization of work which comprises reasons such as unclear commit message (e.g., “when the description of the pull request is not clear”, R50), the status of the change (e.g., “I’m unsure about the status of your parallel move changes. Is this one ready to be reviewed? [...]”),Footnote 10 or the change addressing multiple issues (e.g., “change does more than one things”, R31). The second and third reasons most cited are, respectively, confusion about the tools, e.g., “I don’t know why the rebases are causing new CLs”,Footnote 11 and the need of the code change, e.g., “If I understand correctly, this change might not be relevant any more”.Footnote 12

The artifact dimension it is the largest group with 11 topics related to the reasons for confusion. The most popular is the absence of the change rationale, e.g., “I do not fully understand why the code is being modified” (R20). Discussion of the solution related to non-functional aspects of the artifact is the second largest topic and it comprises reasons such as poor code readability (e.g., “Poorly implemented code” (R43)), and performance (e.g., “is this true? i can’t tell any difference in transfer speed with or without this patch. i still get roughly these numbers from “adb sync” a -B build of bionic: [...]”).Footnote 13 The third most frequent reason indicates that developers experience confusion when unsure about the system behavior, e.g., “what is the difference between this path (false == unresolved) and the unresolved path below. [...]”.Footnote 14

Six reasons for confusion are related to the developer dimension. Disagreement among the developers is the prevalent topic, e.g., “[...] If actual change has a big difference from my expectation, I am confused.” (R11). The second most cited reason is the misunderstanding of the message’s intention, e.g., “Sometimes I don’t understand general meaning (need to read several times to understand what person means)” (R13).

Six reasons are related to the link between the developer and the artifact. The most popular one is the lack of familiarity with existing code, e.g., “Lack of knowledge about the code that’s being modified.” (R37) followed by the lack of programming skills, e.g., “sometimes I’m confused because missing some programming” (R13), and the lack of understanding of the problem, e.g., “I’m embarrassed to admit it, but I still don’t understand this bug.Footnote 15

figure e

3.2.5 RQ2. What are the Impacts of Confusion in Code Reviews?

The total number of topics related to the impacts of confusion is 14 (see Table 2). They are related to the dimensions of the review process, artifact, and developer. There was no topic related to the link between the developer and the artifact.

We found seven impacts of confusion related to the code review process. Delaying the merge decision is the most popular impact, e.g., “The review takes longer than it should” (R46). The second and third most cited impact are that confusion makes the code review quality decrease, e.g., “Well I can’t give a high quality code review if I don’t understand what I am looking at” (R5), and an increase in the number of messages exchanged during the discussion, e.g., “Code reviews take longer as there’s additional back and forth” (R1). One interesting impact of confusion is the blind approval of the code change by the developer, even without understanding it, e.g., “Blindly approve the change and hope your coworker knows what they’re doing (it is clearly the worst; that’s how serious bugs end up in production)” (R16). Confusion may also lead to developers to just reject a code change, e.g., “I’m definitely much more likely to reject a ’confusing’ code review. Good code, in my experience, is usually not confusing” (R36).

There are only two impacts of confusion related to the artifact itself. First, the developer may find a better solution because of the confusion, e.g., “It has not only bad impact but also good impact. Sometimes I can encounter a better solution than my thought” (R11). Second, the code change might be approved with bugs, as the reviewer is not be able to review it properly due confusion, e.g., “Sometimes repeated code is committed or even a wrong functionality” (R24). The incorrect solution impact is related to decrease review quality, however, the perspective is of the code change containing a bug in production rather than of the reviewing process.

Finally, there are four impacts of confusion related to the developer. The most quoted impact is the decrease of self confidence, either by the author, e.g., “I can’t be confident my change is correct” (R38), or by the reviewer, e.g., “I feel less confident about approving it” (R48). Another impact is the developer giving up, abandoning a code change instead of accounting for the reviewer’s comments, e.g., “other times I just give up” (R14), or leave the project, e.g., “dissociated myself a little from the codebase internally” (R14). We also found emotions being triggered by confusion, such as anger (e.g., “It pissed me off ”, R3) and frustration (e.g., “Cannot be an effective reviewer—can replace me with a lemur”, R40). And finally, confusion can be contagious, e.g., “It often causes confusion spreading to other reviewers” (R12).

figure f

3.2.6 RQ3. How Do Developers Cope with Confusion?

We found 13 topics describing the strategies developers use to deal with confusion in code reviews. Four of them are related to the review process. The most common is to improve the organization of work, such as making clearer commit messages, e.g., “Leave comments on the files with the main changes” (R50). It is followed by spending more time and delaying the code review, e.g., “I need to spend much more time” (R13). Assigning other reviewers is also a strategy adopted by developer, e.g., “Sometimes I completely defer to other reviewers” (R48). Interestingly, blind approval is also a strategy developers use to cope with confusion, i.e., it is not just an impact, e.g., “assume the best, (of the change)” (R34).

Two strategies are related to the artifact. Developers make the code change smaller, e.g., “Also I ask large changes to be broken into smaller” (R31), and clearer, e.g., “Try to make the actual code change clear” (R12). They also improve the documentation by adding code comments, e.g., “A good description in the commit message describing the bug and the method used to fix the bug is also helpful for reviewers” (R5).

The dimension with the most quotes is related to the developers themselves. Requesting for information on the code review tool itself is the most quoted among developers, e.g., “Put comment and ask submitter to explain unclear points” (R15). Developers also take the discussions off-line, i.e., using other means to reach their peers, e.g., “schedule meetings” (R50) or “ask in person” (R1). Providing and accepting suggestions is also mentioned as a good way to cope with confusion. It includes strategies such as being open minded to the comments of their peers, e.g., “Being open to critical review comments” (R12), and providing polite criticism, e.g., “Trying to be ’a nice person’. Gently criticizing the code” (R3). The use of criticism by developers in code reviews was also found by Ebert et al. (2018), but their study focused on the intention of questions in code reviews. Disagreement resolution is also a good strategy to cope with confusion, e.g., “I try to explain the reasoning behind the decisions/assumptions I made” (R31).

Regarding the link between the developer and the artifact, there are three strategies developers use to cope with confusion. Firstly, to study the code or the documentation, e.g., “It forces me to dig deeper and learn more about the code module to make sure that my understanding is correct (or wrong)” (R12), and “Read requirements documentation” (R24). Secondly, to test the code change, e.g., “play with the code” (R9). Finally, developers also use external sources to improve their knowledge about the technology, e.g., “Sometimes further research on the web [...]” (R25).

figure g

3.3 Threats to Validity

As any empirical study, our work is subject to several threats of validity. We identified three kinds of threats to its validity: construct, internal, and external, all of which are discussed below.

Construct validity is related to the relation between the concept being studied and its operationalisation. In particular, it is related to the risk of respondents misinterpreting the survey questions. To reduce this risk we included our own definition of confusion and requested the respondents to confirm that they understood it. For the same reason, we always anchored the frequency questions and adhered to well-known survey design recommendations (Groves et al. 2009; Kitchenham and Pfleeger 2008; Singer and Vinson 2002; Steele and Aronson 1995).

Internal validity pertains to inferring conclusions from the data collected. The card sorting adopted in our work is inherently subjective because of the necessity to interpret text. To reduce subjectivity every card sorting step has been carried out by several researchers. Moreover, to assure the completeness of the topics related to the reasons, impacts and confusion coping strategies we conducted several survey iterations until the data saturation has been achieved, and augmented the insights from the surveys with those from the code review comments.

External validity is related to the generalizability of the conclusions beyond the specific context of the study. Our first survey targeted only a single project: Android. However, the second and the third ones targeted a general software developer population. Statistical analysis has not revealed any differences between the respondents of the different surveys suggesting that the answers obtained are likely to reflect opinions of the code review participants, in general. To complement the surveys we consider 307 code review comments from Gerrit. While the functionality of Gerrit is typical for most modern code review tools, developers using more advanced code review tools do not necessarily experience confusion in the same way. For instance, CollaboratorFootnote 16 supports custom templates and checklists, that if properly configured might require the change authors to indicate rationale of their change, reducing the importance of “missing rationale” from Table 2.

4 Which Reasons for Confusion are Most Frequent? A Preliminary Study

The long-term goal of our research is to help developers combat confusion in code reviews. The main contribution of the study discussed in Section 3 is a framework for confusion in code reviews, presented in Table 2, including 30 reasons, 14 impacts, and 13 coping strategies. The difference in numbers between the reasons, on the one hand, and impacts and coping strategies, on the other hand, suggested a gap between the way confusion is experienced and the ways it impacts software development and is addressed. However, many of these reasons for confusion have been extensively studied in the scientific literature (Bacchelli and Bird 2013; Tao et al. 2012; Kononenko et al. 2015). Hence, we decided to complement the results in Table 2 by investigating the solutions proposed by literature for the most frequent reasons for confusion, as well as the impacts of those reasons. As a preliminary step towards this goal, we survey developers to gauge the frequency with which the 30 reasons for confusion from our framework typically occur in practice. The results of the survey allow us to prioritize the reasons for confusion, i.e. to identify the reasons for confusion to focus on in the literature review discussed in Section 5.

In the remainder of this section, we present the aforementioned survey, involving 62 developers. More specifically, we aim at answering the following research question:

  • RQ4. Which reasons for confusion do developers perceive as occurring most frequently?

We describe the methodology in Section 4.1. Section 4.2 presents the results, and threats to the validity are discussed in Section 4.3.

4.1 Methodology

We start by discussing the design of our survey (Section 4.1.1). Then, we present the participants selection (Section 4.1.2). Finally, we discuss the data analysis process we used (Section 4.1.3).

4.1.1 Survey Design

We designed a survey to ask code reviewers how often they experience each of the 30 reasons for confusion included in our framework (see Table 2). We design the survey in line with established best practices (Groves et al. 2009; Kitchenham and Pfleeger 2008; Singer and Vinson 2002; Steele and Aronson 1995). We start by explaining the goal of this survey and our research goals, disclose the sponsors of our research and inform that the information provided will be treated in a confidential way. We also inform the respondents about the estimated time to finish the survey, and then, obtain the respondents’ informed consent.

The questions of the survey are presented in Table 3. It starts with the same definition of confusion used in the former study and presented in Section 2. Then, we ask the respondents to confirm their understanding of this definition (Q1). Next, Q2–Q29 ask how often do the respondents feel confused when reviewing changes due to reasons for confusion from Table 2, i.e., we focused on code reviewers. Frequency is measured on a Likert scale: not at all, less than once a month, once a month, once a week, once a day, and more than once a day. For the sake of readability, we split the 30 questions corresponding to reasons for confusion from Table 2 according to the four dimensions defined in Section 3.2: review process, artifact, developer, and link between the developer and the artifact. We do not include two reasons for confusion in this survey since they are only related to the code change author, and not the reviewer, i.e., the reasons code ownership and community norms.

Table 3 Survey questions

Before deploying the survey, we discussed it with other software engineering researchers and clarified it when necessary: e.g., we replaced “unnecessary change” by “a change which is unnecessary for the project”.

4.1.2 Participants

As the target population, we considered developers who reviewed code changes in reviews. We sent the survey to two different groups. The first group comprises 33 developers who answered the survey from our first study (cf. Section 3.1.1) and indicated that they would like to be informed about the results of that study. Within ten days after the first mail we sent a reminder. The email message included a personalized salutation, a brief discussion of the results of our first study (Ebert et al. 2019), an explanation about this new study, and the link for the new survey. The second group consists of developers recruited via social media: we published the survey on Facebook and Twitter and asked developers to answer it. We left the survey open until we received no more responses for two weeks (cf. surveys conducted by German et al. (2018) and Kononenko et al. (2018)).

4.1.3 Data Analysis

Similarly to the analysis of Section 3.1.1, we have a survey with two different groups of respondents. Thus, a priori it is not clear if the responses can be seen as representing the same population. We used the same statistical methods, ANOSIM (Clarke 1993) and PERMANOVA (Anderson 2001; McArdle and Anderson 2001), to perform the similarity check. Again, if the groups of respondents can be said to be similar, we can consider them as representing the same population, and then merge the responses. Otherwise, we would treat the groups separately.

To further analyze the responses of our survey, we applied the Scott-Knott Effect Size Difference (ESD) test (Tantithamthavorn et al. 2017) to group the 28 reasons for confusion into statistically distinct ranks according to their Likert scores in terms of frequency. Scott-Knott ESD is a variant of Scott-Knott test (Scott and Knott 1974), in which there is no normality assumption of the data. The Scott-Knott ESD test merges any two statistically distinct groups that have a negligible effect size into one group. Scott-Knott ESD has been successfully applied in the software engineering context Calefato et al. (2019), Catolino and Ferrucci (2019), and Tantithamthavorn et al. (2017).

4.2 Results

In this section, we present the results of our survey. We start by explaining how we conducted the survey (Section 4.2.1). Then we present the results of the similarity analysis (Section 4.2.2). Finally, we present the results of RQ4 using Scott-Knott ESD test (Tantithamthavorn et al. 2017) (Section 4.2.3).

4.2.1 Implementation of the Survey

The first emails were sent on the July 15th, 2019. Among the 33 emails sent for the first group, four emails have bounced. We received 13 responses, i.e., a response rate of 44%. Seven developers answered the survey in the first day, while the remaining six developers answered our survey after the reminder. The survey was published on Facebook and Twitter on the same day we sent the emails. The response rate could not be computed for this group. We closed the survey after two weeks with no new response in August 21st, i.e., the last response we received was on August 7th. We received 50 responses from the social media but one respondent did not indicate their consent, i.e., we have obtained 49 valid responses.

4.2.2 Analysis of Similarity of the Surveys’ Results

The results of the similarity check with ANOSIM, R = − 0.06928 and p-value = 0.792, did not show any statistically significant differences between the two groups. The results for the PERMANOVA method, p-value = 0.506, also did not show any statistically significant differences. Based on those results, we conclude that the two groups of respondents represent the same population of developers, and subsequently we merged their responses and report the results pertaining to the combined group. Hence, we have a total of 62 valid responses considered in our analysis.

4.2.3 RQ4. Which Reasons for Confusion do Developers Perceive as Occurring Most Frequently?

The results of the frequency of reasons for confusion are presented in Table 4. Since our goal is to define the most frequent reasons for confusion, we need a fair measure to order them. One possibility is to consider as more frequent the reasons that more developers classified as “More than once a day”, normalized by the overall number of classifications for each reason. A similar approach has been employed by previous work (Begel and Zimmermann 2014). However, in our case, every reason has been classified the same number of times, unlike previous work. Furthermore, we do not think that a reason classified just once as “More than once a day” but not as “Once a day” is really more frequent than one that has not been classified as “More than once a day” but received, e.g., ten classifications as “Once a day”.

Table 4 The 28 reasons for confusion ranked according to the Scott-Knott Effect Size Difference test in terms of frequency, and the mean and median Likert scores

Thus, we used the Scott-Knott Effect Size Difference (ESD) test (Tantithamthavorn et al. 2017) to group reasons with similar frequencies. Table 4 shows the 28 reasons for confusion organized into seven different groups. The first group contains the most frequent five reasons for confusion. Additionally, Table 4 also shows the mean and median Likert scores for the 28 reasons for confusion in terms of frequency, and their respective dimensions.

We can see that the most frequent reasons for confusion are either related to the artifact (i.e., the code change itself) or to the review process. They are: long or complex code change, organization of work (e.g., an unclear commit message, the status of the code review, a change addressing multiple issues), dependency between different code changes, lack of documentation, and missing code change rationale. The least frequent reasons for confusion accordingly to developers are related to developers themselves and to the link between developer and artifact: propagation of confusion, language issues in the communication, lack of programming skills, and lack of knowledge about the development or code review process.

We conjecture that the most frequent reasons for confusion are top ranked because they are related to processing a large amount of information which is spread across different places. For example, long or complex code change can be related to many different files (or many places in the same file); organization of work can refer to the same code change addressing multiple issues; and dependency between different code changes is related to different changes. As for the least frequent reasons for confusion, we conjecture that they are related to self admission of confusion by developers themselves as they pertain to the dimensions related to the developer (and the link between developers and the artifact), such as lack of knowledge about the development or code review process, lack of programming skills, language issues in the communication, and propagation of confusion.

figure h

4.3 Threats to Validity

Similarly to our first study (see Section 3), this survey is subject to three kind of threats of validity:

Internal validity relates to how conclusions are inferred from the data analyzed. This threat in our survey relates to how developers recollect past events, i.e., when and how they feel confused in code reviews. We acknowledge that the frequency of confusion might also depend on how often the survey respondents perform code review activities (e.g., on a daily basis, weekly, and so on). However, we believe that there is no reason for assuming that some reasons for confusion might be remembered more easily than others, which mitigate such a threat.

Construct validity relates the concept being studied and its operationalisation, i.e., the degree to which we actually measure what we intend to. One threat to the validity of this study is that survey respondents can misinterpret the questions. We followed the same approach presented in Section 3.3 to reduce this threat. Specifically, we presented our definition of confusion and requested the respondents to confirm whether they understood it. Additionally, we designed our survey based on well-known recommendations (Groves et al. 2009; Kitchenham and Pfleeger 2008; Singer and Vinson 2002; Steele and Aronson 1995). Another threat to construct validity pertains to the measure we employed to rank the reasons for confusion in terms of their frequency. In order to reduce such threat, we used a specific test, Scott-Knott ESD test (Tantithamthavorn et al. 2017), for measuring, comparing, and clustering the frequency of the responses for the reasons for confusion. One last threat to construct validity is the use of a survey itself, since it relies on developers’ perceptions. Our reason for adopting this approach is the possibility to scale it up, since we can gather information about all the reasons for confusion described in Section 3.2.4 from many developers.

External validity is related to the generalizability of the conclusions of the study. The first group of population of our survey targeted Android developers. The second group targeted a more general software developer population. Thus, we used statistical analysis to verify similarity between these different populations. The results suggests no difference between the first and the second group, indicating that the responses can be treated as one group.

Another external threat is related to volunteer bias, i.e., when the subjects who volunteered to participate in a research project might differ in some ways from the target population. We tried to reduce such a threat by recruiting participants both by personal invitations and via social media. Furthermore, since the likelihood of volunteer bias increases with the refusal increases, we ensured anonymity and confidentiality of volunteers in order to try to increase participation, and thus, to decrease volunteer bias.

5 A Systematic Mapping Study of Solutions and Impacts of Confusion in Code Reviews

The main contribution of the preliminary study, as reported in the previous section, is an ordered list of the most frequent reasons for confusion according to developers (cf. Table 4). As mentioned before, many of the factors we have identified as possible reasons for confusion have been studied in software engineering literature (Bacchelli and Bird 2013; Tao et al. 2012; Kononenko et al. 2015). To contextualize our findings, we perform a literature review. Based on the results of our survey presented in Section 4, we selected the top five most frequently occurring reasons for confusion, as a starting point to conduct a systematic mapping study of the scientific literature. Our goal is to identify their impacts on code reviews, beyond confusion, and the solutions and mitigation strategies researchers have proposed to cope with them. Such strategies might be beneficial for developers facing confusion and complement the currently employed coping mechanisms.

As such, we designed and ran a systematic mapping study aims to answer the following research questions:

  • RQ5. What are the solutions proposed by researchers for the most frequent reasons for confusion?

  • RQ6. What relationships has previous research established between the most frequent reasons for confusion and their impacts?

The results of this mapping study allow us to complement the framework presented in Section 3 in three ways:

  1. i.

    by identifying new coping strategies to address confusion;

  2. ii.

    by establishing links between the reasons for confusion and the coping strategies proposed by researchers and employed by developers, as identified by previous studies; and

  3. iii.

    by determining how the reasons for confusion and impacts of confusion are connected.

Section 5.1 describes the methodology of the systematic mapping study. In Section 5.2, we present the results this study, and threats to the validity are discussed in Section 5.3.

5.1 Methodology

The goal of the mapping study is to identify, classify, and understand what are the solutions proposed by the research community to the most frequent reasons for confusion in code reviews (RQ5), according to the survey described in Section 4. Furthermore, we aim to identify the link between the most frequent reasons of confusion and their impacts in the code review process (RQ6). Based on the results of RQ4, we chose the most frequent reasons for confusion on the mapping study. Then, we conduct the mapping study, following the guidelines by Petersen et al. (2008) and Petersen et al. (2015).

To perform the systematic mapping study, we used Parsifal,Footnote 17 an online tool supporting systematic literature reviews and mapping studies within the context of software engineering. It provides support for all the phases of the mapping studies: planning, conducting, and reporting the mapping.

Kitchenham and Charters (2007) developed PICO (Population, Intervention, Comparison, and Outcomes): a guideline to identify keywords and formulate search strings from research questions in systematic literature reviews. The guidelines of Petersen et al. (2015) suggest that only P (population) and I (intervention) should be used for systematic mapping studies. In our context, the population are code reviewers, and the intervention are the most frequent reasons for confusion. Due to the large number of reasons for confusion in our framework (30), on the one hand, and the estimated effort required for the mapping study, on the other hand, we consider the most frequent reasons, i.e., the five topics from the first group in Table 4:

  • Reason #1: Long or complex code change;

  • Reason #2: Organization of work (e.g., an unclear commit message, the status of the code review, or a change addressing multiple issues);

  • Reason #3: Dependency between different code changes;

  • Reason #4: Lack of documentation;

  • Reason #5: Missing code change rationale (e.g., in the commit message, or in code comments).

Since we have five different reasons for confusion, we created five different search strings to simplify the process. Firstly, we defined the string related to code reviews by including several synonyms to it: code review OR code inspection OR ((peer code review OR peer review) AND software. After a few queries, we decided to add the term software as a way to exclude secondary studies of different areas, since the string peer review is also related to systematic literature reviews. Then, we combined this string with terms related to the specific reason for confusion, e.g., the reason missing code change rationale resulted in the search string ((lack OR missing OR omission OR absence OR absent OR unclear OR “not clear” OR bad OR misunderstanding) AND (documentation OR comment OR license)). We did this for each one of the reasons. Tables 5 show all five search strings:

Table 5 The search strings for all the five reasons for confusion

We search for articles in IEEE Xplore,Footnote 18ACM DL,Footnote 19Scopus,Footnote 20 and SpringerLink.Footnote 21 All the searches were conducted on September 3, 2019. The searches also include plural forms of the words. For the libraries ACM DL, Scopus and SpringerLink, we could group all five search strings into one to run it once. For the IEEE Xplore, there is a size limit of the string, hence, we needed to run eight search strings (as the search string related to the organization of work needed to be split into three). SpringerLink allow us to filter the articles by discipline, e.g., only computer science related articles. However, we decided not to do so because we wanted to include as many scientific papers as possible during this step and could not trust the disciplines as recorded by SpringerLink. Additionally, in the ACM DL query we used the ACM DL Guide to Computing Literature option, which is the “most comprehensive bibliographic database focused exclusively on the field of computing” and it “includes all of the content from The ACM DL Full-Text Collection along with citations, and links where possible, to all other publishers in computing”. Table 6 shows the number of articles returned by each library.

Table 6 Number of articles per library

In all digital libraries, except for SpringerLink, the search was conducted on the title, abstract, and keywords. Since SpringerLink does not allow one to restrict the search to title, abstract and keywords only, we have initially performed a full-text search. However, the full-text search retrieved 30,128 articles. Thus, we created a script to query the html pages of each of the 30,128 articles to identify the title, abstract, and keywords. Then, we conducted another search round on those fields only. This step resulted in 19 articles. Next, the 427 identified articles were reviewed based on the following criteria:

  • Inclusion criteria:

    • Articles available in full-text;

    • Articles discussing code reviews;

    • Articles subject to peer-review.

  • Exclusion criteria:

    • Books, chapters, proceedings, and gray literature;

    • Duplicate articles;

    • Articles not in the field of software engineering;

    • Articles not written in English;

    • Secondary studies (e.g., systematic literature reviews).

The first author started with applying the exclusion criteria by removing the duplicate articles with the aid of the Parsifal tool. In total, we found 155 duplicated articles. Next, the same author applied the remaining exclusion criteria and removed 95 additional articles. To determine whether the article belonged to the field of software engineering, was written in English, or constituted a secondary study, titles and abstracts have been used. In order to diminish research bias, the 95 articles excluded at this stage were reviewed and confirmed by the remaining authors. Hence, by applying the exclusion criteria 250 (= 155 - 95) articles have been removed, leaving 177 (= 427 - 250). Then, the first author verified the inclusion criteria on the remaining articles: 49 of them did not pass the inclusion criteria, leaving 128 (= 177 - 49) articles for the last step, the full-text reading. Once again, the remaining authors reviewed and confirmed those excluded by the inclusion criteria. For the full-text reading step, we looked for any of the five reasons for confusion being mentioned in the articles. We split the 128 articles among the four authors in a way that each paper was reviewed by two authors, i.e., each author reviewed 64 papers. All the disagreements were resolved with online meetings between the authors. Finally, a total of 38 articles have been identified as discussing at least one of the five reasons. The number of included and excluded articles is shown in Fig. 4.

Fig. 4
figure 4

Number of included articles during the study selection process

We developed a simple template to extract data from the articles, as shown in Table 7. Each data extraction field has a data item and a value. The extraction was performed by each author during the selection phase. The items ID, title, publication year, and venue were extracted automatically by the Parsifal tool. The remaining items were extracted manually by the authors.

Table 7 Data extraction form

5.2 Results

In this section, we present the results of the our systematic mapping study, which aimed at answering RQ5 and RQ6. Firstly, we provide some general data about the articles selected by the mapping study (Section 5.2.1). Then we discuss the results of RQ5 (Section 5.2.2) and RQ6 (Section 5.2.3), respectively.

5.2.1 General Information About the Selected Articles

In Fig. 5 we show the distribution of the 38 articles per year and kind of venue, respectively. We can observe a trend showing an increase of studies related to the most frequent reasons for confusion in code reviews (the size and the color gradient of the circles increases with the number of articles). The data for 2019 is incomplete because the study only considered articles published until September.

Fig. 5
figure 5

Distribution of the articles per year according the kind of venue. The data for 2019 is incomplete

We also see that the papers investigating the reasons for confusion cover a broad spectrum of venues including journals (e.g., TSE, EMSE, and JSS), magazines (e.g., IEEE Software), conferences (e.g., ICSE, SANER, MSR, FSE, and ICSME), workshops (e.g., CSD, and MUD). Moreover, we see that these reasons have been discussed at broad-spectrum venues targeting the entire domain of software engineering (e.g., ICSE, APSEC, and FSE), focused events targeting specific activities within software engineering such as maintenance (e.g., ICSME, and SANER), and those dedicated to specific techniques used to analyze software data (e.g., MSR, MUD, and PROMISE). Table 8 provides the complete list of the 38 articles resulting of the mapping study, grouping them by venue.

Table 8 Articles included in the literature study

5.2.2 RQ5. What are the Solutions Proposed by Researchers for the Most Frequent Reasons for Confusion in Code Reviews?

In Fig. 6, we present the number of articles which address any of the five reasons for confusion. The most common reason is long or complex code change with almost all, i.e., a total of 31, articles discussing it. The remaining reasons for confusion were addressed by a much lower number of articles: organization of work with eight, lack of documentation with five, dependency between different code changes with four, and missing code change rationale with 3.

Fig. 6
figure 6

Number of articles that mentioned each of reason for confusion

In the remainder of this section, we discuss the solutions found in the scientific literature for each one of the five reasons for confusion. It is worth noting that not all articles presented solutions for the reasons for confusion they address.

Long or complex code change::

We found a total of five solutions for this reason for confusion proposed by eight different articles in the literature:

  1. 1.

    Make the change short and simple: This is the most commonly repeated advice to deal with code changes which are long or complex (Gousios et al. 2014; MacLeod et al. 2018; Sadowski et al. 2018). In fact, Gerrit, a popular code review system, has an option “Show Change Sizes As Colored Bars”: when this option is enabled, the size of the bar indicates the number of changed lines.

  2. 2.

    Make use of salient files: Not all files affected by a complex change are equally important and automatic identification of the most important files might reduce the reviewers’ effort. Pascarella et al. (2019) propose an automatic just-in-time identification of defective files in a complex change, while the work of Huang et al. (2018a) introduces the notion of “salient classes”, i.e., the most important class in and the main reason for the code change, and builds a classification model to automatically identify them.

  3. 3.

    Improve code review tools: Code review tools could be expanded to provide functionality that is already present in modern IDEs, such as jumping to definition of an identifier, finding a reference, or exploring a caller/callee tree (Tao et al. 2012).

  4. 4.

    Make the use of “super reviews”: To allocate the task of reviewing long or complex code changes to the most experienced developers in the team (Kononenko et al. 2015).

  5. 5.

    Ordering the changes within the code change: Another way to support developers reviewing long or complex changes is to provide a suggested order of the code change parts in order to reduce the overall cognitive load (Baum et al. 2017).

Organization of work::

This is the second most often discussed reason for confusion in the literature. It is a broad topic that gathers different situations related to how work is organized and conducted in a software development project. Even though, we only found two solutions proposed by seven articles in the literature for different aspects of organization of work that may lead to confusion, described below:

  1. 1.

    Describe the code change: A reviewer may have a hard time attempting to understand an unclear commit message. It may be unclear for a number of reasons: because it is too short, because it does not include rationale, or because it is poorly written. One of the confusing aspects of the organization of work is lack of clarity in the commit message. To address this problem (MacLeod et al. 2018) stress the importance of describing code changes in an informative way, particularly emphasizing the motivation for the change and the tests associated with it.

  2. 2.

    Decompose composite code changes: This is also a common solution proposed for situations when confusion is related to how the code change is organized, i.e., changes addressing multiple issues. Several tools have been proposed to automatically split composite code changes in different changes (Luna Freire et al. 2018; Guo et al. 2019; Barnett et al. 2015; Guo and Song 2017; Tao and Kim 2015; Konopka and Navrat 2015), e.g., a change that implements a new functionality and fixes a bug is split into two changes, one for the new functionality and one for the bug fix.

Lack of documentation::

From the five articles discussing this reason for confusion, three of them proposed two solutions for it:

  1. 1.

    Document well the change: Developers should properly describe their changes and ensure that all decisions made during the implementation and review are also well-documented (MacLeod et al. 2018).

  2. 2.

    Support for the placement of code comments: Code review tools could be expanded to assist developers by suggesting for appropriate locations to place comments in the source code. Huang et al. (2018b) proposes such approach to help developers to decide where to add code comments in the source code by analyzing code context information. Gousios et al. (2014) also suggested that code review tools should provide automated improvement of documentation.

Dependency between different code changes::

We identified three solutions to address this reason for confusion on three articles, the third most frequently mentioned in our survey:

  1. 1.

    Cluster related code changes: Clustering code changes which are related to each other is a simple solution, however, developers need to be careful to avoid submitting different issues in the same code change (MacLeod et al. 2018). This is the trade-off between clustering changes and making them composite.

  2. 2.

    Create tools to summarize similar code changes: Code review tools could be expanded to find similar changes and detect potential mistakes (based on previous changes) to support reviewers in understanding the impact of related changes. Zhang et al. (2015) developed a tool that summarizes similar code changes and detects potential mistakes to support reviewers’ understanding of the impact of related changes.

  3. 3.

    Use commit-then-review: In order to avoid longer cycle times when there are dependencies between different code changes so that one has to be committed before another can be started, Baum et al. (2016) suggests to use commit-the-review process, instead of review-then-commit.

Missing code change rationale::

This reason for confusion was addressed in three different papers. From those papers, only one proposes a solution for the absence of rationale:

  1. 1.

    Provide the motivation for the code change: This is the most basic solution to solve confusion due to missing rationale (MacLeod et al. 2018) in code reviews.

figure i

5.2.3 RQ6. What Relationships has Previous Research Established Between the Reasons for Confusion and Their Impacts?

The results of RQ6 are shown in Table 9. We can observe that long or complex code change and organization of work have the largest number of impacts described in the literature (4). For the remaining reasons for confusion (dependency between different code changes, lack of documentation, and missing code change rationale) we found they are related to only one impact each in the literature. It is also worth noting that all impacts found in the literature are related to the review process dimension of our framework, exception for frustration, which is related to the developer.

Table 9 Relationships between reasons for confusion and their impacts

We believe that the discrepancy between the number of relationships between reasons for confusion and their impacts can be explained by the number of articles addressing the reasons in the literature: long or complex code change and organization of work have the largest number of articles. Below we discuss each of the impacts.

  • Delaying of the code review, i.e., the merge decision, is one of the impacts with the largest number of reasons related to it: long or complex code change (Zhang et al. 2012; Gousios et al. 2014; Pascarella et al.2019; Baysal et al. 2013, 2016; Sadowski et al. 2018; Tao and Kim 2015; Huang et al. 2018a), organization of work (Guo and Song 2017), and dependency between different code changes (Baum et al. 2016; Zhang et al. 2015; Izquierdo-Cortazar et al. 2017);

  • Decreased review quality is related to the number of problems identified in the code change during the review, i.e., the review is less effective and potentially identifies less bugs or non-adherences to project guidelines. The literature shows this is caused by long or complex code change (Baum et al. 2019; Pascarella et al. 2019; Barnett et al. 2015; Kononenko et al. 2015; Faragó 2015; An et al. 2018; Bosu et al. 2014; Yang et al. 2017), and organization of work (Barnett et al. 2015). Some studies also reported that long or complex code change can cause the introduction of vulnerabilities issues (Bosu et al. 2014; Yang et al. 2017);

  • Increased development effort is related to long or complex code change and organization of work, i.e., the reviewer will have to invest more effort to finish the review (Mishra and Sureka 2014; Huang et al. 2018a; Baysal et al. 2013) , the code change author will need to submit additional revisions if their code change is long or complex (Baysal et al. 2013), as well as the reviewer will not know from which part of the code change they should begin the review in case of long or complex code changes (Huang et al. 2018a);

  • Review rejection was related to three different reasons for confusion: long or complex code change (Rigby and Storey 2011; Norikane et al. 2017; Gerede and Mazan 2018; Hellendoorn et al. 2015), organization of work (Tao and Kim 2015), and lack of documentation (Norikane et al. 2017);

  • Frustration of the developer is reported in literature as related to missing code change rationale (Sadowski et al. 2018).

figure j

5.3 Threats to Validity

Following Petersen et al. (2015), the following types of validity should be considered for systematic mapping studies: descriptive validity, theoretical validity, and generalizability.

Descriptive validity is related to the extent to which the observations are described accurately and objectively. We designed a data collection form to support the recording of data, and hence, reduce this threat. We used a spreadsheet to record the data, from which some of the data points were automatically extracted with the aid of the Parsifal tool.

Theoretical validity is related to the ability of the authors capture what they intend to capture during the study. Researcher biases might appear during the application of inclusion and exclusion criteria, the selection phase, and extraction of data. Application of the inclusion and exclusion criteria was conducted by the first author, and all excluded articles were reviewed by the remaining authors. The articles remaining for the selection and extraction data phases were split among the four authors in a way that each paper was reviewed by two authors. The authors checked and resolved all disagreements with online meetings. Furthermore, to reduce the bias of the data extraction phase, all the extracted data was reported in a spreadsheet with pre-established fields. The first author reviewed all the data extracted by the other authors and, when necessary, the extracted data was discussed by two or more authors.

External validity concerns the generalizability of the study conclusions. Our results may not apply for to systematic literature reviews as they are different in their goals.

6 Discussion and Implications

The main contribution of this study is fourfold:

  1. i.

    a improved framework for confusion in code reviews (Section 6.1),

  2. ii.

    a guideline for developers on how to cope with confusion during code reviews (Section 6.2),

  3. iii.

    actionable implications for the tool builders (Section 6.3), and

  4. iv.

    a research agenda for researchers to provide support for confusion (Section 6.4).

6.1 Improved Framework for Confusion in Code Reviews

In this section, we revise the framework for confusion in code reviews presented in Section 3 and augment it with the results of the systematic mapping study (from Section 5). The results of the RQ6 did not show any new impact related to the most frequent reasons for confusion. The five impacts we found in the literature review are already described in the original framework. From those, all except one are related to the review process. This result suggests literature should also aim at investigating the remaining impacts identified in our first study (Section 3).

Based on the results of the RQ5, we could improve our framework as we found new solutions in the literature. From the 13 solutions for confusion we identified in the literature, eight of them are new to our framework. The final improved framework for confusion in code reviews is presented in Table 10 (the new solutions are presented in italics font). We can observe that all new solutions are either related to the review process or to the artifact itself, i.e., the code change. We believe these results highlight the need for more research on the other dimensions related to the developer and the link between developer and artifact.

Table 10 The improved framework for confusion in code reviews

6.2 Implications for Developers

We found that long or complex code change is the most frequently experienced reason for confusion in code reviews according to developers, followed by a change addressing multiple issues. These results highlight that to avoid confusion patch authors should aim for changes that are simpler, smaller, and non-composite. Based on the preceding discussion we propose the following guideline for developers on how to deal with confusion in code reviews.

  1. 1.

    Before submitting different commits, developers should check and cluster related code changes to diminish the chances of creating dependency between different code changes (MacLeod et al. 2018), which is the third most frequent reason for confusion.

  2. 2.

    Long or complex code changes is the most frequent reason for confusion in code reviews. Even though this is fairly obvious, developers should keep in mind that making the changes short and simple will be beneficial for reviewers and also for them, as it improves the chances of their changes being accepted (Gousios et al. 2014; MacLeod et al. 2018; Sadowski et al. 2018). One twist to this formula is that, if changes are simple and strongly related, they should probably be committed together, to reduce reviewing overhead.

  3. 3.

    Developers should also provide the motivation for the code changes, as it is important to avoid confusion due to missing rationale (MacLeod et al. 2018).

  4. 4.

    Developers should describe the code changes to avoid submitting unclear commit messages (MacLeod et al. 2018). This will ease the job of reviewers and avoid unnecessary, frustrating, and time consuming requests for additional information.

We believe that our guidelines are complementary to the guidelines proposed by Rigby et al. (2008) as our results derive from different developers of different projects (Android and others) and add new specific instructions on documentation. For instance, Rigby et al. (2008) described Apache code reviews as: “(a) early, frequent reviews (b) of small, independent, complete contributions (c) conducted asynchronously by a potentially large, but actually small, group of self-selected experts (d) leading to an efficient and effective peer review technique”. Thus, we can observe that their guideline on (b) relates to two of our guidelines: making the changes short and simple and cluster related code changes. While the remaining we can say are complementary to each other.

6.3 Implications for Tool Builders

Code reviews are supported by tools such as Gerrit. Currently the only feature of Gerrit that we can relate to confusion reduction is flagging large code changes. Indeed, long or complex code changes are among the most popular reasons for confusion in our framework.

Several changes related to organization of work can also be addressed by the tools supporting code reviews. For instance, CollaboratorFootnote 22 supports custom templates and checklists that, if properly configured, might require the change authors to indicate rationale of their change. Similarly, decomposition of composite code changes (Luna Freire et al. 2018; Guo et al. 2019; Barnett et al. 2015; Guo and Song 2017; Tao and Kim 2015; Konopka and Navrat 2015) can be integrated in code review tools: e.g., we envision a bot checking the pull request suggested by a developer, decomposing it when necessary and submitting several pull requests on the developer’s behalf. If such an intervention will prove not to be acceptable for developers, functionality of the bot can be restricted to automatic identification of composite changes. Another possibility for code review tools is to provide the code change parts in a specific order to reduce the overall cognitive load of reviewers (Baum et al. 2017). Finally, Upsource code review tool of JetBtrains is capable of automatically recommending code reviewers for a given change (Kovalenko et al. 2018). Similar techniques might be integrated in other code review tools. On the same vein, different heuristics to find the best group of reviewers can be integrated into these tools.

6.4 Implications for Researchers

The first item in the agenda for researchers is to invest more on the least addressed reasons for confusion in code reviews: organization of work, dependency between different changes, missing code change rationale, and lack of documentation. These are all important reasons for confusion. For example, in the study of Section 3, where we investigated real code reviews and also obtained responses from developers, missing rationale was the most common reason for confusion. Notwithstanding, it is rarely addressed in the scientific literature. These four reasons are in the top five most frequent according to developers. Researchers should aim at exploring more these topics related to code reviews, e.g., by creating automatic approaches to extract the rationale of the change based on code comments or on source code elements.

One the one hand, our findings make it clear that developers should not compose different issues (such as a bug fix and a refactoring) in the same code change, i.e., decompose composite code changes (Luna Freire et al. 2018; Guo et al. 2019; Barnett et al. 2015; Guo and Song 2017; Tao and Kim 2015; Konopka and Navrat 2015), since long or composite changes are one of the most frequent reasons for confusion. On the other hand, developers should cluster related changes into a simple solution (MacLeod et al. 2018) to avoid dependency between different code changes. This is not an easy trade-off to balance. There has been much investigation into how to break composite changes. However, to the best of our knowledge, there are no papers proposing solutions to balance simple, related changes being clustered together and a change addressing multiple issues being too complex to understand.

Since we found several studies focusing on decomposition of code changes and only one about dependency between different code changes (MacLeod et al. 2018), we believe more research is needed to help developers on clustering related changes. For instance, researchers can investigate approaches that analyze code changes before they are integrated and suggest combinations of related commits, thus freeing developers from having to commit multiple small, strongly-connected changes in separate commits.

Another avenue for researchers we see is related to the solution making use of salient files, which aims to solve long or complex code changes. We found two articles (Huang et al. 2018a; Pascarella et al. 2019) arguing that the use of important files within the code change can help reviewers in the process of conducting reviews by indicating where they should start and how to proceed when reviewing long or complex code changes. In a similar vein, we envision the use of the task context (LaToza et al. 2006) of the code change author. This context consists of the set of changed files and also the files and methods the author accessed during the implementation. This information elements can be presented together with the file diffs to the reviewer. This approach reduces the need for navigation by providing the reviewer with information that is likely to be necessary to understand the code change.

7 Related Work

In this section, we discuss the related work. Studies related to code reviews are presented in Section 7.1, while studies related to confusion are discussed in Section 7.2.

7.1 Code Review

Code review has been the focus of a plethora of studies (Bavota and Russo 2015; Bacchelli and Bird 2013; Tao et al. 2012; Kononenko et al. 2015; Hentschel et al. 2016; Mukadam et al. 2013; Hamasaki et al. 2013; Thongtanunam et al. 2014; Yang et al. 2016; van Wesel et al. 2017).

Bacchelli and Bird (2013) introduced the term modern code review which is supported by tools, is informal, and which happens frequently. They explored the motivations, challenges, and outcomes of code reviews by observing, interviewing, and surveying software developers. Their study shows that finding defects is not the only benefit of code reviews, knowledge transfer and team awareness are also advantages coming from reviews. They also show that the main challenge of code review is understanding the code change and its context.

Tao et al. (2012) investigated how the understanding of code changes affects the development process. They conducted surveys and follow-up emails with software designers, testers, and software managers at Microsoft. They shown that rationale is the most important information for understanding a code change. However, respondents mentioned that code changes can be easily understood if a good description is provided. They discovered that reviewers could benefit more from the code-exploration features provided by common IDEs (e.g., call hierarchy from Eclipse) when they are exploring the change context and estimating its risk.

Bavota and Russo (2015) investigated how code reviews influence the chance of inducing bug fixes, and the quality measured by code coupling, complexity, and readability of the code changes. They showed that commits not reviewed are twice as likely to introduce defects than reviewed commits. Furthermore, the reviewed code changes have a substantially higher readability as compared to unreviewed code changes.

Kononenko et al. (2015) investigate the quality of code reviews in an OSS project by exploring the factors that might affect the reviews. They use the SZZ algorithm to find code changes that introduce defects and then relate them to the code review information. They show that 54% of the code changes that went through the review process introduced defects into the system. Furthermore, personal metrics (reviewer experience and workload) and participation metrics (number of reviewers) are associated with the quality of the code review process. Another interesting result is that the technical properties of the code change (the size, number of files changed, etc.) have a significant impact on the chance of inducing defects in the system.

Pascarella et al. (2018) investigated, by analysing code review comments, what information reviewers need to perform a proper code review. They analysed threads of comments which started from a reviewer’s question from a total of 900 code reviews. Additionally, semi-structured interviews and one focus group with developers were conducted to understand the perceptions of the code review needs from developers. They found seven high-level information needs, such as the suitability of an alternative solution, the correct understanding of the code change, rationale, and the context of the code change.

Paixão and Maia (2019) conducted an empirical study to understand the frequency of rebasing operations and their impacts in the code review process by performing a large-scale investigation of more than 28,000 code reviews of 11 systems. They found that rebasing operations happens in about 75.35% of code reviews, and from those, about 34.21% of rebasing operations tend to tamper with the reviewing process. The authors also propose a methodology to handle rebasing operations in empirical studies that employ code review data.

As for the work related to secondary studies, i.e., systematic literature reviews and systematic mapping studies, we found two articles focused on code reviews. Coelho et al. (2019) focused on refactoring-aware code reviews, in which the reviewers are informed that code change being reviewed contains a refactoring. They conducted a systematic mapping study in order to investigate gather evidence of the studies related to refactoring-aware code reviews in terms of actual support, research trends, and open research topics. Their findings show a lack of proper support when reviewing code change with different types of refactorings and a need for more empirical investigation of the effectiveness of the refactoring-aware solution for code reviews (both in open source and industrial scenarios).

Schettino et al. (2019) conducted a systematic mapping study focusing on code reviewer recommendation, with emphasis on application contexts, the input data, and the empirical validations. They found that several researchers try to validate their work with open source datasets, with GitHub being the most used. Furthermore, the literature proposed the following data as input for the recommendation systems: social relationships, revision expertize and development. These input were evaluated with Top-k and review activeness metrics.

7.2 Confusion

Confusion has been studied before, also in relation with complex cognitive tasks (D’Mello and Graesser 2014; D’Mello et al. 2014). Approaches to automatic identification of confusion have been recently developed, based on natural language processing (Yang et al. 2015; Jean et al. 2016; Ebert et al. 2017). Yang et al. (2015) used textual content of comments from a forum and its clickstream data to automatically identify posts that express confusion. Their model to identify confusion comprises questions, users’ click patterns, and users’ linguistic features based on LIWCFootnote 23 words. They tried to identify the reasons why users are confused by looking at the recent click behavior. Jean et al. (2016) proposed an approach to detect uncertain expressions based on the statistical analysis of syntactic and lexical features. Ebert et al. (2017) assessed the feasibility of automatic recognition of confusion in code review comments based on linguistic features. They assessed the performance of several classifiers based on supervised training, using a gold standard of 800 comments manually labeled as indicating or not a developer’s confusion.

Confusion-related phenomena have been recently investigated in code reviews. Uwano et al. (2006) proposed the use of eye tracking to characterise the performance of developers performing code reviews. They developed a system which captures the source code line number the reviewer’s eye is looking at. It is also able to record the transition from a line to another when the reviewer’s eyes move, as well as the time spent at each line. Their system was used to perform an experiment with five students reviewing code changes. As result, they identified a specific pattern in reviewer’s eyes: “scan”. This pattern is characterised by the reviewer’s action of reading the entire code before investigating in details each line. Furthermore, reviewers who did not spend sufficient time for the scan tend to take more time for finding defects.

Ram et al. (2018) aimed to obtain an empirical understanding of what makes a code change easier to review. They empirically defined reviewability as how the code change is: i) explained (e.g., in the change description), ii) properly sized and self-contained (e.g., small changes), and iii) aligned with the coding style of the project. They researched academic literature papers, and also blogs and white papers, interviewed professional developers, and evaluated a tool to rate the reviewability of code changes. They found that reviewability is affected by several factors, such as the change description, size, and coherent commit history.

Barik et al. (2017) conducted an eye tracking study to understand how developers use compiler error messages. They found that the difficulty experienced by developers while reading error messages is a significant predictor of task correctness and it also increases the overall hardness of resolving a compiler error

Gopstein et al. (2017) introduced the term atom of confusion which is the smallest code pattern that can reliably cause confusion in a developer. Through a controlled experiment with developers, they studied the prevalence and significance of the atoms of confusion in real projects. They shown that the 15 known atoms of confusing occur millions of times in programs like the Linux kernel and GCC, appearing on average once every 23 lines. They reported a strong correlation between these confusing patterns and bug-fix commits, as well as a tendency for confusing patterns to be eventually commented.

The work presented in this paper is complementary with respect to the ones discussed so far. To the best of our knowledge, these two studies are the first that aim at building a framework of what make developers confused during code reviews, their impacts and what strategies do developers implement to overcome confusion. Additionally, we conducted the first systematic mapping study focused on the reasons for confusion in code reviews.

8 Conclusion

The omnipresence of code reviews calls for a careful attention for obstacles and problems developers experience when reviewing source code or authoring code being reviewed. In this paper, we describe two empirical studies that we conducted to understand the reasons for confusion, its impacts, and the strategies available to deal with it.

We built a confusion framework with 30 reasons for confusion, 14 impacts, and 13 coping strategies adopted by developers. To this aim, we used a concurrent triangulation strategy combining a developer’s survey and the content analysis of code review comments in Gerrit. Furthermore, we surveyed developers and identified which ones of the 30 reasons for confusion are experienced most frequently. We found that the five most frequent reasons for confusion are: the presence of long or complex code change, poor organization of work, dependency between different code changes, lack of documentation, missing code change rationale, and lack of tests.

We conducted a systematic mapping study of the scientific literature, which revealed 13 solutions to the most frequent reasons for confusion in code reviews. Moreover, we found that the literature has established the relationship between such reasons for confusion and five impacts in our framework.

Based on our findings we formulated guidelines for developers on how to deal with confusion, suggestions for tool builders on how to support the code review, as well as an agenda for researchers interested in studying code reviews.