Investigating the effectiveness of peer code review in distributed software development based on objective and subjective data

Code review is a potential means of improving software quality. To be effective, it depends on different factors, and many have been investigated in the literature to identify the scenarios in which it adds quality to the final code. However, factors associated with distributed software development, which is becoming increasingly common, have been little explored. Geographic distance can impose additional challenges to the reviewing process. We thus in this paper present the results of a mixed-method study of the effectiveness of code review in distributed software development. We investigate factors that can potentially influence the outcomes of peer code review. The study involved an analysis of objective data collected from a software project involving 201 members and a survey with 50 practitioners with experience in code review. Our analysis of objective data led to the conclusion that a high number of changed lines of code tends to increase the review duration with a reduced number of messages, while the number of involved teams, locations, and participant reviewers generally improve reviewer contributions, but with a severe penalty to the duration. These results are consistent with those obtained in the survey regarding the influence of factors over duration and participation. However, participants’ opinion about the impact on contributions diverges from results obtained from historical data, mainly with respect to distribution.

Witter dos Santos and Nunes Journal of Software Engineering Research and Development (2018) 6:14 Page 2 of 31 required formal meetings and checklists (Kollanus and Koskinen 2009). Today, such a practice is more informal, being referred to as Modern Code Review (MCR) (Bacchelli and Bird 2013). It is often assisted and enforced by tools, such as Gerrit (Google 2017a). The effectiveness of code review depends on different factors and, when it cannot provide expected benefits, it becomes a costly and time-consuming task (Czerwonka et al. 2015; Thongtanunam et al. 2016a). For example, if there is a time gap between the completion of a change and its review by a peer, the author may have its work partially blocked, possibly affecting the whole software release (Thongtanunam et al. 2015b). This lack of dynamism in the code review activity increases the work in progress of teams, as new tasks are started while waiting for the pending reviews. Furthermore, the context switching between coding tasks and reviews may also have a negative impact on developers' work.
To understand the factors that positively and negatively affect the effectiveness of code review, previous studies were performed, e.g. (Thongtanunam et al. 2015a;Baysal et al. 2016;Yang 2014;Bosu et al. 2015). Examples of investigated factors are the patch size, the nature of the change, or author's company-that is, both technical and nontechnical factors have been investigated. Moreover, to evaluate effectiveness, different criteria have been adopted, such as the review duration and the number of defects found after code review. As a result, relevant conclusions regarding code review have been reached. For instance, developers from other teams provide fewer but more useful feedback than those from the same team (Bosu et al. 2015). Despite all the significant results obtained so far, code review has been investigated only to a limited extent in the context of geographically distributed software development (Sengupta et al. 2006), which is becoming increasingly common over the last decades. In the late 90s, researchers focused on enabling formal code inspections, which involve meetings, in distributed scenarios (Perpich et al. 1997;Stein et al. 1997). In modern code review, in contrast, tool support and asynchronous communication help deal with geographic distribution. However, the effects of geographic distribution on the outcomes of code review (such as duration or reviewer engagement) have not been explored. Recent studies of code review in distributed software development are limited to experience reports on code inspection (Meyer 2008).
We thus in this paper focus on exploring how both technical and non-technical factors influence a set of metrics that are indicators of the effectiveness of code review in the context of Distributed Software Development (DSD). We present the results of a mixed-method study in which we investigated the relationship between four influence factors-namely number of changed lines of code, involved teams, involved locations and active reviewers-and the effectiveness of code review. As there is no single objective metric that captures whether a review is effective, we measured and analyzed different review outcomes that can be seen as an indication of the review effectiveness, such as reviewer participation and number of comments. The study involved (1) an analysis of objective data collected from a software project; and (2) a survey with 50 practitioners with experience in code review. This study is an extension of our previously presented work (Witter dos Santos and Nunes 2017), which was complemented by the survey that allows us to compare the results obtained with both research methods.
The first part of our study, referred to as repository mining, is based on a large amount of data (8329 commits and 39,237 comments) extracted from the code review database

Related work
Since the pioneering work of Fagan (1976) on formal code inspections, many researchers proposed approaches to improve this well-structured and phased form of code review (Parnas and Weiss 1985;Bisant and Lyle 1989;Martin and Tsai 1990). With the popularity of DSD, other researchers investigated how to make code inspections feasible when the involved people cannot physically meet in a particular location (Perpich et al. 1997;Stein et al. 1997). Despite its popularity among researchers and practitioners, formal code inspection and its variations have received less attention since the early 2000s (Kollanus and Koskinen 2009).
More recently, much work focusing on modern code review has been done, ranging from studies that investigate what leads to successful code review to approaches that recommend suitable reviewers. For example, in Balachandran (2013)'s approach, recommended reviewers are those that made the most recent changes in the portion of code to be reviewed. His approach was improved by Thongtanunam et al. (2014), for projects with specific characteristics, using the File Path Similarity (FPS), which takes into account previous changes with similar paths or file names. These approaches were extended by also considering similarity among past commit messages (Xia et al. 2015) and recent activity of the possible reviewers (Zanjani et al. 2016). Viviani and Murphy (2016) took another direction by prioritizing pending reviews for each reviewer instead of finding the best candidate reviewers for a given change. This is motivated by the fact that several projects have a high concentration of review requests in a small group of contributors (Yang 2014).
Despite all these significant contributions to the field of code review, it is crucial to understand the factors that influence the effectiveness of code review to, for example, provide foundations to improvements while making reviewer recommendations. Therefore, many studies focus on providing a deeper understanding of code review, and its influence factors (e.g. number of changed lines of code and experience of individuals) and outcomes (e.g. duration and discussion among reviewers). Although such studies are similar to ours, they do not focus on DSD. We next discuss technical and non-technical influence factors investigated in existing studies.

Investigation of technical factors
Different correlations were studied involving technical factors. Thongtanunam et al. (2015a)'s study provided evidence that reviewers are less rigorous and find fewer defects on files with a high incidence of defects in the past, focusing on superficial aspects, such as coding standards rather than on functional aspects. In a more recent study (Thongtanunam et al. 2016a), the same authors identified that bug fixes typically receive the first feedback faster than implementations of new features. Moreover, they reported that changes with detailed and explanatory commit messages have lower stale rates, while those that are poorly described receive less attention of reviewers.
Focusing on the code review duration (in working days), a few influence factors were investigated. Bosu et al. (2015) concluded that the patch size affects the duration in most of the analyzed cases, while task priority in the release plan and the affected software components have only occasionally influenced some of the projects analyzed by Baysal et al. (2016).

Investigation of non-technical factors
Non-technical factors also received attention recently. As stated by Czerwonka et al. (2015), the social network that naturally emerges inside the companies or projects should be considered as well as the specific reviewers' skills and their availability and willingness to review. An analysis of the social network of three open source projects (Yang 2014) revealed that the most active reviewers have central roles in the social network of those projects and are frequently some of the most important contributors. Bosu et al. (2015) observed, in a particular organization, that 75% of the code review feedbacks come from members of the author's team, but are slightly less useful than those from other teams. Baysal et al. (2016), in turn, pointed out that when multiple organizations contribute to the same project, the code review can take more time to be completed and have higher rejection rates depending on which organization is authoring or reviewing a patch, based on the analysis of several case studies.
The experience of authors of the code under review has also been pointed out as relevant in code review. Senior members of the company and those with recognized expertise usually receive more priority, faster and more detailed feedback, enabling a faster code review with better results for the quality (Baysal et al. 2016;Rahman et al. 2016). The experience of the reviewers is relevant as well, based on results of the investigation of a large company (Bosu et al. 2015)-the quality of provided feedback increased during the first year in the company and then stabilized in a plateau.
In a study involving three large open source projects, Thongtanunam et al. (2016a) also investigated non-technical factors, focusing on how the code review was affected by prior events on the files under review. Their conclusions are (1) files that received a slow initial feedback in the past will also likely receive slow feedback in the future; (2) files with more authors and reviewers in the past receive more attention; and (3) the number of changed files, directories and the length of the commit message are also important.
Summary Given that many factors that influence code review have been investigated, we summarize what each previous study analyzed in Table 1. Rows in this table consist of the examined influence factors, while columns represent the analyzed outcomes associated with code review. In cells, we list the studies that focused on the relationship between a given influence factor and outcome. Some of these studies analyzed MCR targeting FLOSS (Free, Libre and Open Source Software) projects, such as OpenStack, Qt, and LibreOffice, which present DSD characteristics. However, we emphasize that most of these studies did not investigate the impact of distribution: factors associated with distribution were random variables rather than independent variables. For instance, Baysal et al. (2016) reported that some analyzed companies had co-located groups, while others used DSD, without treating this issue as a dependent variable. Similarly, Bosu et al. (2015) found that comments from other teams are slightly more useful, but without considering co-location or the number of involved teams.
As can be seen, different combinations of influence factor and outcome have been analyzed. Differently from previous work, our study focuses on DSD and, therefore, we focus on other influence factors, such as the number of involved cities and teams. Some of our investigated factors, e.g. patch size (LOC), have already been studied, but not in a DSD scenario. Moreover, we analyze four different outcomes of code review, which are described in next section together with other details of our study settings.

Study subject
Our study is based on the analysis of data collected from a (commercial) software project and developers from a single software development company. Due to the project size, we were able to collect a large amount of information regarding its code review. We next describe the code review process of the project, provide details about the collected data, and characterize the participants of our survey. No further information can be given due to a confidentiality agreement.

Code review process
We overview the code review process followed in the target project in Fig. 1. First, authors send a piece of code to be reviewed. Anyone can, at any point in time, invite reviewers or add itself as a reviewer, what would allow any (interested) developer to contribute. Moreover, in our target project, Gerrit is configured with the reviewers-by-blame plugin (Google 2017c), which automatically adds reviewers based on the last changes made on the files to be reviewed, as proposed by Balachandran (2013). The code is then analyzed by automated reviewers that check several quality criteria, such as compilation, cyclomatic complexity, lack of documentation, failed unit tests, among other static analysis and runtime verifications. This automated verification usually takes less than 15 min to execute and rejects the change if any critical test fails, so that the author can fix the reported issues. Human reviewers and authors can discuss, ask and provide suggestions for each line of code. Moreover, each reviewer can vote to summarize its feedback using one of the following values.
Veto The reviewer considers that the change cannot be integrated without fixing the reported issues or answering questions made. This prevents the commit to be merged. Rejection The reviewer recommends fixes before the change is merged. Neutral The reviewer typically asks easy questions to be answered. Acceptance The reviewer considers that, though the change it adequate, it needs more reviews from other developers. Approval Only maintainers of the module associated with the commit have this kind of vote, as they are responsible for the module quality. Maintainers can perform technical reviews, but must also verify that relevant developers are not missing in the list of invited reviewers and that the overall state of the code review is adequate.
It is important to note that all invited reviewers, but the maintainer, are not obliged to provide feedback. Before approving the change, the maintainer of each module should consider if the most important reviewers already reviewed the code. In the end, the piece of reviewed code is submittable if all the following conditions are satisfied: (i) there is no rejection from automated reviewers; (ii) there is no veto; and (iii) the maintainer has approved the change. If all these conditions hold, the maintainer is able to merge the change into the destination branch.

Analyzed data
The target project of this study involves the development of an operating system for embedded systems of routers and switches, using the C, C++, and Yang (Internet Engineering Task Force (IETF) 2017) languages. This project has a total of 269 repositories, from which 63 are dedicated to test automation, using Python, Vagrant, and Ansible. We consider that the operating system code and its tests are part of the same project, as the developers implement both firmware and tests for each task. All repositories are configured to reject the merges without code review.
The mined data refers to a period of 72 weeks, starting in October 2014. In the collected data, we had a total of 11,109 code reviews. After filtering these data (see next section), we obtained 8329 code reviews associated with 39,237 comments (an average of 4.7 comments by review). Such code reviews are associated with: (i) 201 experienced developers; (ii) 4 development locations in 4 different cities in the same country and time zone; and (iii) 21 different teams. Members of a given team work on the same location, i.e. there are no distributed teams. All teams are organized as feature teams and use Scrum with three-week sprints to release new software versions every three months. A continuous integration pipeline is used to run functional tests on several test environments that contain network topologies with real products and emulators.

Survey participants
Our goal by conducting a survey with software developers of the same company is to be able to compare the perceptions of developers with concrete objective data. The survey was conducted in January 2018, when we randomly invited 80 developers to participate. From them, 50 voluntarily provided a response within a week (response rate of 62.5%). However, 5 participants reported (very) low experience in code review (as author or reviewer) and these were discarded because they could provide unreliable answers. Because our survey involved other projects and a different time period of the first part of the study, many participants were not developers of our target project during the period in which we collected data. However, the projects of which participants are members use Gerrit to implement and reinforce modern code review practices. Although there may be divergences in the software development process as a whole, all these projects adopt the code review workflow described in Fig. 1. Table 2 provides detailed information about 45 participants, including age, education and experience. Answers that indicate the participant experience-used solely for the purpose of characterizing our sample-were self-reported based on the participant's subjective view. We can see that 97.8% of the participants reported medium to very high experience with projects with multiple teams, whereas 93.3% reported medium to very high experience with projects with multiple locations, suggesting that they have experience in DSD.

Study settings
After discussing our target project, we now proceed to detailing our study. We first state our goal and research questions, then describe collected metrics and finally the procedure of the two parts of our study, namely repository mining and survey.

Goal and research questions
To design our study, we followed the goal-question-metric (GQM) paradigm (Basili et al. 1986). Therefore, we first specify our goal using the GQM template and derived research questions. Our goal is detailed next. To understand the factors that influence code review in distributed software development, characterize and evaluate the relationship between different influence factors and code review effectiveness from the perspective of the researcher as code review is performed by software developers in a single project study.
Based on this goal, we derived a set of research questions, each associated with one of the influence factors investigated in our study. There are both technical and non-technical factors. As said, although some have been investigated in the past, it is our goal to analyze them in the context of DSD. For short, we refer to our investigated scenario as distributed code review. Our research questions are listed as follows.  We investigate the number of teams and locations separately because the former captures distribution among teams, allowing us to analyze the impact of involving reviewers that have different project priorities and goals (possibly conflicting) and limited interaction, while the latter additionally captures the impact of geographic distribution.

Influence factors and outcomes
Each research question is associated with an influence factor to be investigated, with respect to their impact on the effectiveness of distributed code review. However, there is no unique metric to measure review effectiveness. Therefore, we consider a set of outcomes of code review, which are measured. They can be used as indicators of the Page 10 of 31 review effectiveness. Before detailing these outcomes, we next further specify our influence factors-which are listed following the order of our research questions-detailing how they are measured.
Patch Size (LOC) The patch size (LOC) is used to refer to the number of lines of code added or modified in a commit and thus need to be reviewed. This lines of code considered are those present in the final version of the code, after going through the reviewing process. Teams Teams refer to the number of distinct teams associated with the author and invited reviewers. If the author and all reviewers belong to the same team, the value associated with this influence factor is 1. Locations Locations refer to the number of distinct geographically distributed development sites associated with the author and invited reviewers. If the author and all reviewers work in the same development site, the value associated with this influence factor is 1. Active Reviewers Actives reviewers are those that actually participate in the reviewing process-with comments or votes-from those invited. Although this can be seen as an outcome of the review, given that there is no control of how many of the invited reviewers will actually participate, we aim to explore if the number of active reviewers influence other outcomes, such as duration. Therefore, active reviewers are investigated as an influence factor, consisting of the number of reviewers that contributed to the review. Now we focus on describing the analyzed code review outcomes that indicate the review effectiveness. Code review is effective when it achieves its goals, which can be untimely to identify defects in the code, issues related with code maintainability and legibility, or even to disseminate knowledge. However, these goals might include constraints regarding the impact in the development process and invested effort.
It is not trivial to evaluated whether these goals are achieved. For example, Bosu et al. (2015) created a model to evaluate whether the comments of a code review are useful based on the text of the given comments. This measurement, however, may not be precise. In our work, we focus on measurements that are more objective.
We thus selected four objective outcomes, described as follows. The first is related to project time constraints, while the remaining three are related to the input from other developers (reviewers) leading to possibly less failures, code quality improvement and knowledge dissemination.

Duration (DUR)
Duration counts how many days the code review process lasted, from the day that the source code is available to be reviewed to the day that it received the last approval of a reviewer.

Participation (PART)
Participation consists of the fraction of invited reviewers that are active, ranging from 0% (no invited reviewer participates) to 100% (all invited reviewers participate). Automated reviewers are not taken into account.
Comment Density (CD G ) Instead of simply counting the number of review comments, we take into account the amount of code to be reviewed. Therefore, comment density refers to the number of review comments divided by the number of groups of 100 LOC under review, thus giving the average number of review comments for each 100 LOC. Review comments can be any form of interaction, e.g. approval, rejection, question, idea or other types of comments made by any reviewer-votes count as comments because they are a form of input and have a particular meaning. A multiplying factor of 100 is used to avoid small fractioned numbers, which are harder to compare and less intuitive. Comments from automated reviewers are ignored, as this type of feedback is a constant, regardless of human interactions.

Comment Density by Reviewer (CD R )
It is expected that the higher the number of reviewers or teams, the higher the number of comments. Therefore, CD G alone can lead to the wrong conclusion that discussions were productive when many reviewers are involved. We thus also analyze comment density by reviewer, given by the division of the comment density by the number of active reviews (without taking into account automated reviewers).
As we are interested in the effectiveness of code review, we next describe what we consider an effective code review based on the outcomes considered in the study.
Too short or too long code review. There are studies (Kemerer and Paulk 2009;Ferreira et al. 2010) that suggest time constraints for code review activities, limitation on the number of lines reviewed per hour and also the total amount of hours spent doing code review in a single day. Such limitations are imposed because the code review may become error-prone or even consume more time and resources to be finished due to tiredness. Moreover, if the review takes too long (i.e. high duration) to be completed, developers may be prevented to continue their work and also work does not get done. Therefore, shorter code reviews are preferred. However, if such review is too short, it may also mean that reviewers have not properly analyzed the change.
Low reviewer participation. When reviewers are invited to participate in the review, it is expected that they contribute. However, not all participate. Therefore, the higher the participation of reviewers, the better. Nevertheless, we do not expect that participation is 100%, given that there are developers that are invited automatically and may not be relevant reviewers anymore.
Few contributions from reviewers. Reviewers may contribute in different ways, ranging from a simple vote to long discussions. We assume that the higher the number of comments made by reviewers, the more fruitful the discussion and consequently the more effective the review. However, as explained, we do not consider the absolute number of comments, but its density considering the amount of code to be reviewed. Moreover, we consider the amount of contribution generally (CR G ) and by reviewer (CR R ). For both, the higher, the better.
Although in some situations a low number of comments (either generally or by reviewer) is enough-for example, when a low number of comments helped to improve the code, or the change to be reviewed is minor-note that these might be not the usual case. Because we analyze a high number of code reviews, these exceptional cases do not significantly impact the results. Moreover, votes count as comments; consequently, even if there is no need for long discussions, it is important to have at least the acknowledgement of the reviewer in the form of a vote, i.e. a comment. Finally, we also analyze duration and participation, which complement the analysis of the effectiveness of code review.

Procedure: repository mining
In short, the procedure of the first part of our study consists of the extraction of information from a code review database and analyses of the relationship between influence factors and outcomes. In this section, we detail how we operationalized this.
Our data is extracted from Gerrit 1 , a tool that provides the management of Git repositories with fine-grained control over the permissions for users and groups. It also provides a mechanism to implement code review, with its associated approvals, allowing votes, comments, and edition of the source code. Every interaction among authors and reviewers is recorded, including the comments and votes of the review bots, which are automated reviews. It also provides a sophisticated query mechanism to get information about all open and closed reviews. We next describe the steps taken to obtain, process and filter the data for this study using Gerrit.
Getting raw data from the code review database First, we fixed a time frame in the past so we can get data completed code reviews. Gerrit provides a query mechanism (Google 2017b) that can be used to get structured information about code reviews in JSON format. One query for each week had to be made due to the limitation of obtaining at most 500 results per query.
Parsing and filtering code review information The retrieved JSON files provided part of our required data. The remaining data had to be computed from the raw data obtained from the internal Gerrit database model. The resulting data was filtered, discarding some reviews of some types of modules, described next.
1 Documentation. Some repositories are used exclusively for internal documentation of the project (processes and products), and have a different workflow and time constraints. 2 Third-party software. Some repositories are maintained by open source communities or component vendors. Local internal copies of these repositories exist due to traceability and to avoid downloading them multiple times. Reviewers do not review code in these repositories. 3 Binary artifacts. Some repositories contain binary files, e.g. images and libraries.
These files are not reviewed and, if considered, would (incorrectly) increase the patch size.
Representing and analyzing data Given that we have four research questions with four associated influence factors as well as four outcomes, there is a large amount of data to be analyzed. Our data consists essentially of continuous or discrete positive numbers, with different scales and ranges. For example, there are only four involved locations while the patch size can be up to approximately 4 KLOC. To deal with these discrepancies, we adopted an approach similar to that of Baysal et al. (2016). We clustered data in groups, representing the variance of outcomes in each group using box plots. Additionally, we performed statistical tests to identify groups that are significantly different from each other.

Procedure: survey
Our survey collected anonymous data from participants. The questionnaire they were given includes four main parts: (i) presentation of the study and consent to participate; (ii) demographic data of participants; (iii) participant background and experience; and (iv) questions about how outcomes of code review are affected by influence factors.
In the first part of the questionnaire, we briefly introduced modern code review and stated the goal of our survey. We explicitly declared that the participation was voluntary and informed participants that any information that could reveal the their identity would be kept confidential. Next, we presented the adopted terminology to ensure that terms used in the questions would be correctly understood. The introduced terms are: teams, locations, authors, reviewers and active reviewers. Next, participants were asked about their demographic information as well as background and expertise, as shown in Table 2.
Finally, we asked participants to assess how they evaluate the relationship between influence factors and review outcomes. The considered influence factors are the same as above. However, as outcomes, we considered duration, participant and number of comments. The reason to not distinguish total comments and comments by reviewer is due to our assumption that it is hard for a participant, without data, to assess these separately. Questions associated with each outcome followed the prototype presented in Table 3, replacing X by the name of the outcome.
For each influence factor, we used a 5-point Likert scale, so participants could provide one of six possible answers: (1) much worse, (2) worse, (3) no influence, (4) better, (5) much better, or "I am unable to inform. " Our questions assumed that there is a directly or inversely proportional relationship between influence factors and outcomes. If participants believe it was not the case, they were asked to state that they were unable to inform and describe their opinion in an open-ended question. Our questionnaire was validated with a pilot study.

Results
Having described our study settings, we proceed to the presentation of obtained results. They are presented according to our research questions, and in each of them, we discuss results associated with each of our investigated outcomes.

Repository mining
In this section, we present the results of the part of our study that is based on objective data, which was obtained by mining the software repositories of the analyzed project.

RQ-1: patch size (LOC)
The first influence factor we analyze is the patch size in terms of LOC. We split our data into 13 groups according to this factor, listed in the first column of Table 4. This table shows our obtained results regarding this influence factor-mean (M) and standard deviation (SD)-for each group, considering each review outcome. We also show the number of code reviews in each group as well as values associated with all reviews, in order to be able to compare overall values with values of each group. For better visualizing the results, they are also presented in Fig. 2. We analyze the influence of patch size in review outcomes by comparing the means across the different groups. Given that our data does not have a normal distribution, we use the Kruskal-Wallis test (nonparametric) to verify whether there is a statistically significant difference among the means. This is also the case for the other analyzed influence factors and outcomes. When there are significant differences, we use Dunn's test for post hoc tests, because the compared groups have different sizes.
Our results show that there are significant differences across the different groups (H = 1761.32, p < 0.05). More specifically, larger patches take longer to have their review completed-which is expected. Kemerer andPaulk (2009) andFerreira et al. (2010) pointed out that there must be a limit of LOC reviewed per hour by a single reviewer and a maximum number of hours of code review per day, in order to achieve good coverage and final quality. However, the relationship between patch size and duration is nonlinear; a linear least-squares regression model produced an r-squared value of 2.9%, which means that a linear model has low explanatory power for the data. For example, the average time to review patches of 601-800 LOC is two times greater than the time to review patches of 61-80 LOC, which are ten times smaller. Among some groups, e.g. 21-40 LOC and 41-60 LOC, there is no statistically significant difference. Due to space restrictions, we do not report all significant differences among groups. They can be seen elsewhere together with more information of our results, such as medians 2 .
Considering participation, we observed that the proportion of invited reviewers that actually provide feedback during the review process decreases when the patch size increases. The difference among the patch size groups is also statistically significant (H = 709.58, p < 0.05), mainly due to differences between groups with 600 LOC or less and larger groups. A possible explanation to this is that larger patches likely require more effort from reviewers, discouraging engagement in the process. When reviewers participate in the code review, the amount of contribution is measured by the overall comment density and comment density per reviewer. Our data shows that in both cases the larger the patch, the lower the comment density. Regarding overall comment density, there are statistically significant differences (H = 709.58, p < 0.05). According to the post hoc tests, this is due only to smaller groups. There is a significant difference only among few groups with more than 601 LOC, but among groups with less LOC, there are significant differences in most cases. Similarly, the comment density by reviewer decreases as patches are larger (H = 3579.57, p < 0.05), showing similar results in post hoc tests. This indicates that the amount of contribution is highly affected as the patch size increases up to a certain point. Then, the amount of contribution is limited but does not decrease after the patch reaches a certain size (> 601 LOC in our study).
One possible explanation for the results regarding patch size is that the patch size has an intimidating effect on invited reviewers, because the time required to provide significant contributions increases. This invested time, in our target project, is not explicitly recorded and is not associated with deliverables considered more relevant, such as produced code. Conclusions of RQ-1: The patch size negatively affects all outcomes of code review that we consider as an indication of effectiveness. Reviewers are less engaged and provide less feedback. Moreover, the duration is not linearly proportional to the patch size, which may affect the quality of code review.
As discussed in the related work section, other studies investigated the impact of the patch size in code review. Bosu et al. (2015) showed that for some projects the proportion of relevant comments decreased by 10%, when they compared changes in 40 files with changes in a single file, while Baysal et al. (2016) showed that changes with more LOC need more iterations to be concluded, but without considering the time interval. Each iteration is typically the result of an accepted feedback or comment. This indicates that results with respect to patch size in non-distributed scenarios also hold for our investigated scenario.

RQ-2: teams
Each code review has a list of involved people, authors, and reviewers, and each one works in a single team. Consequently, every code review has also a list of involved teams based on the list of involved people. The results regarding the number of involved teams vs. review outcomes are shown in Table 5 and Fig. 3. With respect to the groups with 6 and 7 involved teams, we have only two occurrences of code reviews associated with each of them. We thus omitted them from Fig. 3, for legibility.
According to our results, the duration of code review is considerably higher if more teams are involved, with higher mean and also standard deviation values. The latter means more dispersion, as can be seen in the corresponding box plots, having more durations that are outliers. This can be partially explained by the working dynamics of teams, which have different goals, tasks, and managers.
Furthermore, technical divergences are often extensively discussed. When only one team is involved, issues are addressed faster, usually mediated or decided by a senior team member or even by the team manager. However, when more teams are involved, the implicit hierarchy among reviewers becomes flattened and reaching a consensus becomes harder. In this case, divergences usually reach managers and are resolved after meetings, conferences or e-mail discussions, what slows down the review. In our results, there is a statistically significant difference across the different groups of numbers of involved teams (H = 586.72, p < 0.05). Comparing groups in a post hoc analysis, we observed that this is due to the differences among groups with five or less involved teams, except the difference between code reviews involving 4 and 5 teams. Similarly, there are also statistically relevant differences with respect to participation (H = 226.72, p < 0.05) and the post hoc analysis showed that the difference is only significant among reviews with four or fewer teams. However, the results indicate only a small negative influence on this review outcome.
Considering the effect on contributions, differences are also significant (H = 184.71, p < 0.05). Post hoc tests showed that this is due to the difference between reviews involving one team and the others. Although Fig. 3 indicates that the overall comment density increases together with the number of involved teams (except in the case of 5 involved teams), we can see in Table 5 that the standard deviation is high, indicating that results vary a lot, justifying the not significant differences. This can be explained by the specific teams involved, whether they are in the same location or not (an issue that is investigated in RQ-3). In our target project, there is an internal team rotation over the years, as new teams are created, merged or split, with knowledge sharing when teams change, reducing the diversity of skills between author and reviewers and affecting the number of questions, doubts or different opinions. Surprisingly, the comment density by reviewer is higher when two teams are involved, followed by reviews involving one or three teams. The differences among teams are indeed significant (H = 91.94, p < 0.05), with post hoc tests showing that if more than three teams are involved, it actually makes no difference.
Conclusions of RQ-2: We found evidence that code review with more involved teams have lower effectiveness considering duration and participation, but higher effectiveness with respect to the overall comment density. Comment density by reviewer is slightly higher when two teams are involved when compared to reviews involving one or three teams.

RQ-3: locations
People involved in the code review are not only associated with a single team, but also with a single working site. We now investigate the influence of the number of involved locations on review outcomes. Results associated with this influence factor are shown in Table 6   As can be seen, the duration of code review is considerably higher if more locations are involved. With a further analysis of our data, we observed that with two involved locations, reviews that started in the second half of the sprint sometimes were not finished on time, causing a performance penalty to the author's team-as said, code review is mandatory. This can be explained by the natural isolation of people working in different places, which requires daily effort to synchronize priorities and state the importance of every patch under review. Within the same team and location, this communication happens on a daily basis in the Scrum daily meetings or other activities that promote interaction. There is a statistically significant difference among the groups (H = 158.0, p < 0.05), in fact, among all groups, as shown in the post hoc analysis.
There is also a positive influence on comment density (H = 134.05, p < 0.05) and comment density by reviewer (H = 56.12, p < 0.05). However, there is a negative impact on participation (H = 86.69, p < 0.05). Post hoc tests show that for these outcomes the differences actually exist only between code reviews with one and two locations, probably because there are few occurrences involving three locations. One possible interpretation of these results, in addition to the geographical distance barrier, is that code reviews with more involved locations have more diversity of technical skills, which is plausible because teams are organized based on groups of related features and technologies. Moreover, there are few rotations of team members among different locations, creating some form of local technical specialization on each location. This diversity promotes feedback, questions, and comments, but requires more time to complete the review process. Consequently, reviewers from other locations should be invited if there is a good technical reason to do so. Otherwise, the higher duration is not compensated by a higher level of contributions. We also observed that the results with respect to comment density by reviewer have large differences when compared to those discussed in the previous sections. Results show that: (i) the average review duration in the same location is 32% greater than in the same team; (ii) the average duration with two locations is 38% greater than with two teams; and (iii) the average density of review comments with two locations is 24% higher than with two teams.
Conclusions of RQ-3: We found evidence that code reviews with more involved locations have lower effectiveness with respect to duration and participation, but higher effectiveness considering contributions. The overall comment density and comment density by reviewer are considerably higher with more involved locations. The participation is slightly lower with multiple involved locations.

RQ-4: active reviewers
As explained in Section 3, in the code review process, the only mandatory reviewer is the maintainer of the module. Other reviewers are invited, but their contribution is optional. Moreover, anyone can invite reviewers. Those that actually contribute are the active reviewers. The number of active reviewers might influence how other reviewers engage in the discussion, and this is what we investigate in RQ-4. Obtained results regarding the relationship between active reviewers and review outcomes are shown in Table 7 and Fig. 5-the group with 10-12 active reviewers are omitted in this figure because they have only one review occurrence each.
Considering duration, we intuitively expect higher values because more reviewers provide more feedback and comments and need more time to reach a consensus. This is in fact confirmed by a statistical test (H = 1652.94, p < 0.05), with post hoc tests showing that in most cases reviews with less than ten reviewers take more time to complete than with more active reviewers.
We highlight that there are reviews, more specifically 1313, involving one active reviewer. This occurs when the author is the module's maintainer, and thus is the only mandatory reviewer. Consequently, the duration is low in these cases, lasting 1.3 days on average. Exceptional cases occur when the code being reviewed is related to hardware  platforms and their infrastructure modules, where typically one or two developers work on for several months. Reviewer participation is almost the same with more active reviewers. Although there are statistically significant differences among groups (H = 268.49, p < 0.05), the post hoc tests show that this is due to a few groups that have nearly no significant differences, indicating that the more invitees, the more active reviewers.
Considering the overall comment density, there is a statistically significant difference (H = 660.89, p < 0.05) when reviewers contribute. However, the post hoc tests show that the presence of more than two active reviewers does not significantly improve the comment density. Moreover, the comment density by reviewer is actually lower with three or more active reviewers (H = 275.55, p < 0.05). This suggests that a number of two active reviewers seems to be the optimal case considering a trade-off between duration and contributions from reviewers.
Conclusions of RQ-4: We found evidence that code review with more active reviewers has lower effectiveness considering the duration. The participation is slightly lower with more active reviewers. Moreover, having more than two active reviewers does not improve the overall comment density and negatively affects the comment density by reviewer.

Results and analysis: survey
Given that we presented the results of the part of our study that is based on objective data, now we focus on subjective data collected from our survey. This allows us to understand developers' perceptions of the relationship between influence factors and outcomes as well as contrast results obtained from objective and subjective data. As above, each influence factor is analyzed individually, with the aim to answer our research questions. Before we focus on our first influence factor, namely Patch Size (LOC), we present an overview of the answers provided by participants in Fig. 6. Each box shows the proportion of the different answers provided by participants with respect to each pair of influence factor and outcome. Boxes that are concentrated on the left-hand side of the chart indicate that participants consider that there is an inversely proportional relationship between the influence factor and outcome. Boxes concentrated on the right-hand side, in contrast, indicate a directly proportional relationship among them.
In next sessions, we present for each research question a table with statistics about participants' responses, presenting minimum, maximum, median, average and standard deviation values. As Likert scales provide ordinal data, researchers have divergent opinions (Jamieson et al. 2004;Norman 2010) about the use of average and standard deviation values to summarize these data, but we present them as a complementary description of data. Moreover, median and average values are similar considering our data, supporting the use of average to describe our data.

RQ-1: patch size (LOC)
The first three rows of Fig. 6 show that 96% of the participants believe that duration is negatively affected by patches with a higher number of LOC. Similar results are associated with participation, as 80% of participants stated that it becomes (much) worse. However, only 29% of the participants believe that the number of comments decreases for patches with more LOC, while 49% have a different opinion, indicating that more comments are provided. Table 8 provides complementary details of the responses, including the number of participants that were unable to evaluate the influence in certain outcomes. For both duration and participation, average and median values are close to worse (2), with a high concentration of responses around this value. The number of comments, in contrast, has average and median values closer to no influence (3), but the standard deviation is higher, with a wider range of values. This suggests that some of the code review outcomes are more affected than others. To analyze differences among the effects on different outcomes, we conducted a Friedman's test, a non-parametric test as the distribution of values is not normal. The test revealed a statistically significant difference among groups (χ 2 = 54.47, p < 0.05). We then used Nemenyi's test for post hoc tests, which indicated a significant difference among all groups. Consequently, we conclude that duration is the most affected outcome in the opinion of the participants, while the number of comments is the least affected.

RQ-2: Teams
Similarly to the results above, duration is negatively affected by an increased number of teams, as shown in Fig. 6. However, with respect to participation and number of comments, participants have divergent opinions. For participation, their opinions are divided between negative influence and no influence, while for number of comments there is a similar number of answers for all possible relationships (negative, positive or no influence).
These divergences among opinions can also be seen in Table 9, which shows the number of participants who answered each alternative. While duration have both average and median values close to worse (2), participation and number of comments have these values closer to no influence (3).
To verify if there is a code review outcome that is the most affected, we conducted similar statistical tests as above. A Friedman's test revealed a significant difference among groups (χ 2 = 47.73, p < 0.05), with the post hoc tests showing differences among all    outcomes. This suggests that the duration is, once again, the most affected outcome in the opinion of the participants.

RQ-3: Locations
By comparing rows 4-6 to rows 7-9 in Fig. 6, it is possible to observe that results regarding the influence of teams and locations present the same pattern of answers. Table 10, in fact, shows that the number of answers for each alternative as well as average and median values are similar to those presented in Table 9. Consequently, in summary, participants also believe that a higher number of involved locations has a negative influence on duration, but have divergent opinions regarding the other two influence factors. Statistical tests also indicated the same results. A Friedman's test (χ 2 = 43.29, p < 0.05) revealed a significant difference among groups, with the post hoc tests also pointing out duration as the most affected outcome.

RQ-4: active reviewers
Finally, the results associated with active reviews suggest that the higher the number of active reviewers, the worse the duration and participation, being the average associated with the latter closer to no influence. This can be seen in Fig. 6. Differently from the previous analyzed influence factors, participants believe that the number active reviewers have a positive influence on the number of comments (Table 11). By conducting the same statistical tests used to identify significant differences among outcomes, we concluded that there is a significant difference among groups (χ 2 = 37.21, p < 0.05). Moreover, as it is the case of patch size, this is due to differences across all groups. Consequently, duration is the most negatively affected outcome, while the number of comments is positively affected.

Discussion
In this section, we present insights and lessons learned from both studies and compare them. Finally, we discuss the threats to validity.

Lessons learned and comparison of results
Our study involves two parts (repository mining and survey) that aimed to answer the same research questions. The goal of performing the two analyses is to have two sources of evidence (one objetive and other subjective) to answer our research questions. Moreover, the contrast between the opinion of developers and our conclusions from the repository mining might reveal that developers may be unaware of the consequences of inviting reviewers from another team or location. Before we start our discussion, we summarize in Table 12 the derived conclusions regarding how our investigated factors influence review outcomes, considering our investigated scenario in both studies. We highlight conflicting results.

Patch size
As can be seen, the amount of LOC to be reviewed affects all considered outcomes based on both sources of data. Considering the analysis of duration based on our objective data, we found that it is not linearly proportional to the patch size, as depicted in Fig. 7, which uses a scatter plot and a heat map to illustrate the concentration of data. This suggests that the rate of 200 LOC/hour proposed by Kemerer and Paulk (2009) is not followed, potentially making code review less effective and more error prone. A lower review coverage is another possible explanation to justify these results, which may lead to more post release defects (Shimagaki et al. 2016;McIntosh et al. 2014).
The results of the two parts of our study are consistent, because they both indicated negative effects of LOC on patch size on duration and participation. However, the perceived effect on the number of comments is close to neutral. This diverges from our conclusions from the repository mining, which indicated that fewer comments are provided for larger patches, so participants underestimate the possible harmful effects of larger patches in the discussion. Interestingly, one of the participants pointed out that after a certain number of LOC, the duration would likely decrease, causing reduced review quality. This was actually observed in our objective data for patches with more than 3 KLOC.
This suggests that large patches should be avoided and that further investigation is necessary to determine which kinds of tasks are likely to produce larger patches, and possibly use other forms of code review, like pair programming, to reduce the severe penalty over the duration.

Teams and locations
Regarding the effects of the number of involved teams, both objective and subjective data indicate an agreement of the negative effects on duration. This agreement also holds for participation. Nevertheless, the average perception (i.e. no influence) of the effect over the number of comments is different from that obtained with the repository mining, which suggested that more comments are provided when more teams are involved. The effects of the number of locations, considering both parts of the study, on duration and participation are also negative. However, concerning the number of comments, a higher number of comments is expected when more locations are involved, according to the study based on repository mining, as opposed to no influence reported by participants of the survey. When comparing the effects of teams and locations, the study based on repository mining suggested that discussions are more fruitful with multiple locations involved, while the results of our survey suggest that there is no influence of teams and location on the number of comments.
The consistent effects of teams and locations over duration and participation might be associated with the lack of awareness of other teams' priorities, dependencies and schedules. The definition of collocation adopted by Olson and Olson (2000) is that teams are collocated if its members can reach each other with a short walk of 30 meters, suggesting that even people working in the same building or on the same floor are subject to the same issues. This shows that developers should carefully consider who from other teams and locations they invite as reviewers, and that they should put extra effort to ensure that external reviewers are available to do their job in reasonable time; collaboration readiness was in fact reported as a major factor on DSD (Olson et al. 2008;Olson and Olson 2013;Bjørn et al. 2014).

Active reviewers
The results obtained based on the mined project data give evidence that code review with more involved (active) reviewers are less effective, with very significant drawbacks on the average duration and density of review comments per reviewer. Results obtained based on our survey are consistent with this finding considering duration and participation. However, the results of our survey showed that participants often overestimate the positive effects of active reviewers to the total number of comments when more active reviewers are present, as 68% participants reported that the number of comments is (much) higher when more reviewers are active, while the study based on repository mining suggested that having more than two reviewers does not improve discussion. This suggests, again, the importance of choosing adequate reviewers and calls for sophisticated code reviewer recommenders.

Contributions
On the contributions of this paper, considering the related work on modern code review that we presented in Table 1, this work analyzed the influence of different influence factors, such as the number of teams, locations and active reviewers. The same holds for the code review outcomes, as these studies did not analyze the participation and comment density. Moreover, our mixed-method study evidentiated the dissonance between participants' perception and metrics extracted from code review databases in some situations. By collecting the opinion of participants using a form that is symmetric with the research questions, our study presented a significant degree novelty when compared to related work, as only Bosu et al. (2015) used insights obtained from interviews with developers create a definition for the usefulness of code review comments.
Code review practitioners can benefit from the discussion section, which was based on the analysis of results from both parts of this study and should foster critical thinking on simple, daily decisions inside the teams. Although some of the suggestions are relatively simple to adopt, the existence of the problem itself is not always evident. For instance, having more invited reviewers does not mean that the code will be reviewed faster or that more comments will be provided, so authors should carefully select reviewers. When reviewers from other teams or locations are involved, the likelihood of having extra delays is something that the teams should be aware of, and eventually adopt countermeasures instead of just assuming that review is in progress. Similarly, delivering smaller patches improves the code review process, so smaller tasks are preferable.

Threats to validity
Construct validity As mentioned in Section 3, our target project involves many programming languages, including Yang, which is a way to represent data. When counting and analyzing lines of code, code written in all these languages are treated equally. Reviewing the same amount of code in one language may require more time than in another. However, considering the developers' expertise, they do not state more difficulty in reviewing code in particular languages, and also the involved languages are not largely different with respect to verbosity. Furthermore, many medium and large software projects use many programming languages. Therefore, the amount of code in different languages is considered a random variable rather than a confounding variable of the study. Considering our survey, we adopted a single variable, total number of comments, instead of the complementary metrics of the first part of the study, namely Comment Density per 100 LOC and Comment Density per 100 LOC by Reviewer. Although complex metrics help the analysis of influence of factors on comments, we assumed that a single variable would be more adequate for developers to evaluate in the survey. The adequacy of this choice was confirmed by our pilot study. However, this may cause the comparison between the results of the repository mining and the survey to be less accurate.

Internal validity
We identified five internal threats. First, given that we analyzed an extensive period of our target project, its developers changed over time. However, as the number of developers and analyzed reviews is large, individual developers' behavior and expertise have a low impact on the obtained results. Moreover, this change in development teams is expected in any software project.
Second, in most of the cases, authors and reviewers communicate using Gerrit to provide feedback, even when they are on the same team or location. However, there is no explicit obligation in the target project to record in Gerrit feedback given by means of other forms of communication, such as telephone or informal meetings. Nevertheless, this is very unusual for this project-developers tend to use the available tools to ensure that relevant questions will not be forgotten by the authors. Isolated occurrences thus do not significantly affect the results.
Third, the participation outcome may have been affected due to the automatic addition of reviewers by Gerrit's reviewers-by-blame plugin. The plugin may add, as reviewers, developers that no longer work in the same team or even in the company. Consequently, their participation was not expected. As what matters is the relative comparison of participation for groups of each outcome, this likely has not affected the results. The probability of having reviewers that fall into this category is the same for the different reviews.
Fourth, our survey was conducted after the publication of the shorter version of this work (Witter dos Santos and Nunes 2017). If participants had access to this work before taking part of the survey, they may have been influenced by our previous results. Although we cannot completely confirm that no participant became aware of the work, no results were intentionally disclosed within the company in which the participants work. Moreover, the time gap between the earlier version of this work and the collection of survey data is only of three months.
Finally, our survey was conducted with participants from the same company on which we collected data for the objective study, but from potentially different projects. Moreover, participants of the survey did not necessarily participate of the project that had its data analyzed. However, all participants use the same code review process and the same tools, including the automated reviewers.
External validity Generalizing the results of empirical studies is always an issue, because the collected and analyzed data may not generally represent software projects. Although we focused on a single project, our results are based on a large amount of data of a large project. Therefore, we were able to identify trends and statistically significant results. However, we emphasize that our results are potentially generalizable only for distributed development environments similar to that of our target project. Although geographically distributed, development locations occur in the same country and mostly involve developers of the same nationality. Therefore, further studies should investigate whether our results hold to globally distributed development environments, which may impose additional barriers to MCR, such as different time zones, communication languages, and culture.

Conclusion
Code review is an important static verification technique for improving software quality as well as promotes knowledge sharing within a software project. To identify the scenarios in which code review in fact succeeds, many studies investigated the relationship between different factors and the review outcomes. However, there is limited investigation of the situations in which modern code review is effective in the context of distributed software development when developers and reviewers are spread into geographically distant development locations.
In this paper, we presented the results of a mixed-method study, composed of two parts. In the first part, repository mining, we extracted a large amount of code review information from a software project whose aim is to develop an operating system for embedded systems. This project involves 201 developers, spread into 21 teams located in 4 different cities. We investigated how the patch size (in terms of lines of code), the number of teams, the number of locations and the number of active reviewers influence the duration, reviewer participation and comment density (general and by reviewer) of the review. We found evidence that the duration of the code review is highly affected by all investigated factors-the higher they are, the longer the review process. Similarly, the participation of reviewers is negatively affected in all cases, but mainly by the number of lines of code to be reviewed. The density of review comments is higher when a relatively small patch size is reviewed by other reviewers of teams or locations other than that of the author. The density of review comments per reviewer is positively affected by the number of involved locations and negatively affected by the other factors.
In the second part of the study, we conducted a survey to collect data about the perceived effects of the four investigated influence factors over code review outcomes (duration, participation and total number of comments). We obtained 50 responses from software developers with relevant professional experience in DSD projects with modern code review practices. We found evidence that higher values of the influence factors have similar effects on the analyzed code review outcomes. Duration and participation are negatively affected; the total number of comments is negatively affected by patch size, teams and locations, but is positively affected by the number of active reviewers.
Due to the large amount of data investigated in our study, we could not identify particular occurrences of code review that could help us to make other analyses and further explain our data. Even if this was possible, given that we used data from the past to have complete reviews, developers would potentially not remember specific cases. Our study had, however, gave us insights for future investigations. First, we aim to perform an observational study involving developers and managers that will allow us to verify if our conclusions based on the present study hold. Second, further analyses can be made using code review data. For example, the proportion of votes (vetoes, rejections, approvals and neutral feedback), the influence of the number of contributions as author or reviewer (overall and in the same module or file) and other reviewers' characteristics are interesting issues to be investigated. As the patch size demonstrated to be a prominent influence factor, we also plan to analyze other forms of complexity and effort during code review, assuming that reviewing ten lines added to a complex module requires more effort than reviewing ten lines added to a simple module. It is possible to analyze the influence of other indications of complexity, such as the total number of classes, files and LOC as well as the total cyclomatic complexity.
For modular systems, some influence factors arise from the relations among the modules and from the role of each module. For instance, the number of dependent modules could influence the participation in or the duration of the code review, and critical infrastructure modules might have different code review dynamics when compared to modules that implement user interfaces. Therefore, we plan to analyze the influence of architectural aspects of the modules in the code review.
Finally, we considered many metrics to indicate the effectiveness of the review and aim to investigate whether it is possible to derive a single metric that captures review effectiveness by combining different review outcomes.

Availability of data and materials
The original data used in repository mining phase of this study cannot be disclosed due to a confidentiality agreement; even anonymized or obfuscated data may be used to deduct more than we are allowed to disclose. Regarding the survey with software developers about code review, the answers of each individual participant are not available also due to the term of agreement signed by each participant that participated in the survey. The scripts used for repository mining can be provided upon request to the authors to allow further replications. However, this is conditioned to the approval from the company involved in the study to ensure that no sensible data is accidentally disclosed.