How Different Are Different diff Algorithms in Git? Use --histogram for Code Changes

Automatic identification of the differences between two versions of a file is a common and basic task in several applications of mining code repositories. Git, a version control system, has a diff utility and users can select algorithms of diff from the default algorithm Myers to the advanced Histogram algorithm. From our systematic mapping, we identified three popular applications of diff in recent studies. On the impact on code churn metrics in 14 Java projects, we obtained different values in 1.7% to 8.2% commits based on the different diff algorithms. Regarding bug-introducing change identification, we found 6.0% and 13.3% in the identified bug-fix commits had different results of bug-introducing changes from 10 Java projects. For patch application, we found that the Histogram is more suitable than Myers for providing the changes of code, from our manual analysis. Thus, we strongly recommend using the Histogram algorithm when mining Git repositories to consider differences in source code.


Introduction
The diff utility calculates and displays the differences between two files, and is typically used to investigate the changes between two versions of the same file. Since understanding and measuring changes in software artifact is essential in empirical software engineering research, diff is commonly used in various topics, such as code churn (Nagappan and Ball, 2005;Shin et al, 2011), code authorship (Rahman and Devanbu, 2011;Meng et al, 2013), process metrics (Hata et al, 2012;Madeyski and Jureczko, 2015;Kamei and Shihab, 2016), clone genealogy (Kim et al, 2005;Duala-Ekoko and Robillard, 2007), and empirical studies of changes (Barr et al, 2014;Ray et al, 2015).
Along with the growth of GitHub, recent studies analyze software changes from Git repositories by using the git command. Git offers four diff algorithms, namely, Myers, Minimal, Patience, and Histogram. Without an identifying algorithm, Myers is used as the default algorithm. Due to the different diff algorithms, different diff results can be obtained and this can have an impact on empirical studies. To study the impact of different diff algorithms, we focus on the Myers and Histogram algorithms and empirically analyze the impact on software engineering research. To the best of our knowledge, empirical comparisons of different diff algorithms in git diff command have never been undertaken. In this paper, we carry out two sequential analyses: systematic mapping and empirical comparisons.
For the systematic mapping, we collect papers from three high ranking journals and four top international conference proceedings published from 2014 to 2017. We then map 51 identified papers in the following four aspects: frequency of diff algorithms, analyzed software artifact, purpose of mining Git repositories, and data origins. The results of the systematic mapping revealed that the advanced diff algorithms had not been considered in the previous studies. In terms of the focus of the git command, 50 out of 51 papers centralized on mining the code changes. We also found that the purposes of using the git command were to get patches in more than half of the collected papers (54.9%), followed by metrics collection (25.5%), and bug-introducing change identification (SZZ algorithm) (15.7%). Regarding the dataset, most papers investigated OSS projects (94.1%), even though the remaining work analyzed industrial data.
In our empirical analyses, we conduct three comparisons based on the most popular usages of git diff found in our mapping study: collecting metrics, identifying bug introduction, and getting patches. We investigate the disagreement between two diff algorithms: Myers and Histogram, and take a manual measurement of their quality in generating the diff lists. Based on previous related studies, we investigate the code changes from the files in 14 OSS projects that employ Continuous Integration for metrics collection and 10 Apache projects for the bug introduction identification to quantify the differences of the diff outputs that resulted from both diff algorithms. We analyze the quality of patches derived from Myers and Histogram by manually comparing their two diff from 377 changes, a statistically representative sample of the 21,590 changes identified in the above two comparisons. Our findings show that using various diff algorithms in the git diff command produced unequal diff lists. This influences the different number of files that have dissimilar added and deleted lines of code in each CI-Java project. The differences of these added and deleted lines that are distinguished by their different number and position range from 0.8% to 6.2% and from 1.4% to 7.6%, respectively. The divergent diff outputs also affected the different number of identified files in bug introduction identification. The percentage of files that have different deleted lines of code range from 2.4% to 6.6%. Regarding the result of the patches analysis, we found that, in-code changes, Histogram is better in 40.3% files, while Myers is better in 10.9% files. However, both diff algorithms evenly have a good quality in generating the list of non-code changes.
In sum, the contributions of this work are: -A systematic survey of studies that use diff; -An analysis of metrics collected from diff outputs produced by Myers and Histogram; -An analysis of Myers and Histogram outputs in identifying potential bugintroducing changes; -A manual comparison between Myers and Histogram to investigate their output quality.
The remaining parts of this paper are structured as follows. Section 2 presents a brief explanation of diff algorithms used in the git command. We explain the differences between two diff algorithms in generating the list of changes. Section 3 describes how we conduct a systematic mapping study and present the result of the survey. Sections 5, 6 and 7 report our procedures and discuss their results in performing three comparison studies; namely, collecting metrics, identifying bug introduction, and getting patches respectively. In Section 8, we discuss threats to validity, and finally we conclude in Section 9.
We have provided the data sets used in this paper publicly on the Web 1 .

Diff Algorithms in Git
Diff is an automatic comparison program used to find the disagreements between the older and the newer version of the same file in a storage (including insertions, deletions, document renaming, document movements etc.). The diff utility extracts code changes line by line in one file compared to the other file and reports them in a list. The operation of the diff program has been fundamentally solved by using the longest common subsequence (LCS) problem initiated by Hunt and MacIlroy (1976) which has been constructed for an efficient use of time and space derived from particular input that appears in the changes in software archives. Since its first run on the Unix operating system in 1970, the diff command has been widely used in many studies. The git diff command has numerous options in the application of code changes extraction 2 , including extracting changes related to the index and commit, paths on a filesystem, the original contents of objects, or even quantifying the number of changes for each object relatively from the sources. Researchers and practitioners are able to use the variation of these available options depending on their needs in extracting the data, not to mention, the diff algorithms. The essence of diff algorithms is in contrasting the two sequences and to receive insight of the transformation from the first into the second by a series of operations using the ordered deletion and insertion. The subsequence can be flagged as a change if a delete and an insert concur on the same scope. The diff algorithm can be selected with this option --diff-algorithm=<algorithm>.
In Git, there are four diff algorithms, namely Myers, Minimal, Patience, and Histogram, which are utilized to obtain the differences of the two same files located in two different commits. The Minimal and the Histogram algorithms are the improved versions of the Myers and the Patience respectively. Each algorithm has its own procedures for finding the items presented in the original document, but absent in the second one and vice versa; as a consequence, different outputs may be produced. Due to the similarity of the basic idea of Minimal and Histogram algorithms with their precursors, in this paper we only contrasted the two diff algorithms: Myers and Histogram.

Myers
Myers algorithm was developed by Myers (1986). In the git diff command, this algorithm is used as the default. The operation of this algorithm traces the two primary identical sequences recursively with the least edited script. Since the Myers only notices the sequences which are actually equal in both, the comparison between the other prior and posterior subsequences is executed repetitively for the entire remaining sequences. Figure 1 indicates several code changes from the first into the second version of the same file (GuiCommonElements.java) taken from Openmicroscopy project 3 . As can be seen in the figure, the code between line 673 and 689 in the first version transformed to the newer version between line 673 and 693. From the figure, we can see that in the block of the method, there is an insertion of one new block of if condition (i.e. line 676-680 in Version 2) and a modification of the other block of if condition (i.e. line 678-685 in Version 1 changed into line 684-690 in Version 2). If we run the diff command for Figure 1, the Myers will produce the list of diff as illustrated in Figure 2.
The Minimal algorithm is the extended version of Myers. The operation of this algorithm in finding the changes resulted from a comparison of two objects resembling the Myers, but an extra attempt was made to keep the patch size as minimal as possible 4 . As a result, the diff lists created using this algorithm are often identical with the Myers. If we apply the Minimal algorithm to the code in Figure 1, the diff output is shown in Figure 2 as well.
A major limitation of the Myers algorithm is it frequently catches the blank lines or parentheses and conforms the lines to match instead of catch-  ing the line that is unique, such as code of function declaration, or a line of assignment. Consequently, the Myers sometimes produces unclear diff lists that do not describe the actual code changes. The position between changed code and code that replace them is often written distantly in inappropriate lines, or located separately in a line that does not represent the modification. Additionally, there is occasionally a conflict of identification of the changed code; for example, the code in lines 4 and 15 in Figure 2. In fact, these lines of code were derived from the same unique line that was unmodified. Using the Myers algorithm, this unique line is detected as a changed code even though it does not show the alteration. This makes it possible to cause misidentification of a code change.

Histogram
The Histogram algorithm is the enhanced version of Patience, which was built by Bram Cohen who is renowned as the BitTorrent developer 5 . It supports low-occurrence common elements which are applied to improve efficiency. The Histogram was initially built in jgit 6 and is introduced in git 1.7.7 7 .
The Patience marks the important lines within the text by focusing on the lines that have the smallest number of occurrences, but are essential. This diff automated procedure is an LCS-based problem as well, but it uses a different technique. The Patience only notices the longest common subsequence of the marked lines attained from the lines which emerge uniquely in a specific range and the lines that are also written precisely similar in both files. This implies that the lines having a single bracket or a new line are usually disregarded; otherwise, the Patience retains the distinctive line such as a function definition.
The Histogram strategy works similarly to with the Patience by developing a histogram of the appearances for every line in the first version of a file. Every element in the second version is subsequently shown to match with the first sequence in an orderly way to find the existences of the elements and to count the occurrences. If the elements exist and their presences are less than in the first sequence, they are expected to be a potential LCS. Once the screening is finished for the second sequence, the lowest occurrence of LCS  is marked as the separator. Two sections resulting from the partition (i.e. section 1 represents the area before the LCS, while section 2 represents the region after the LCS), are then executed repetitively using the same process as the beginning of the algorithm. This means that the Histogram performs similarly to the Patience if a unique common element exists in both files; otherwise, it selects the element that has the least occurrences. In comparison with the other two diff algorithms, (i.e. the Myers and the Patience), the Histogram nevertheless, has been declared much quicker 8 .
In contrast with the Myers, the Histogram algorithm provides diff results that are easier for software archives miners to understand, as the Histogram more clearly separates the changed code lines. This diff algorithm uses a unique line of code as a benchmark to match the sequences of the changed lines between the two files. This reduces the occurrence of conflicts (i.e. a line of an unchanged code identified as a changed code, so that in the diff list, this code is written in duplicate as both a deleted and inserted code). For example, if we extract the differences between the two versions of the same file in Figure 1 using the Histogram in the git diff command, we obtain the output as depicted in Figure 3. A unique line of code in line 10 of Figure 3 is not detected as a changed code due to its role as the benchmark to match the line, where this line is identified as a changed code in case of Myers. This influences the sequences of the other changed code. An additional block of if condition is written between lines 4 and 9 where it should be placed. This block of code is clearly understood as the new code inserted before the statement of the assignment code (code in line 10 which is used as one of some unique lines to match). It is also obvious that the code between lines 12 and 16 were replaced by one line of code in line 17, while the closing curly brace in line 20 was omitted from the files, and three new lines of code (line 23, 24 and 25) were added at the end of the code in Figure 3.

Systematic Mapping: How Previous Studies Used Git Diff ?
To understand the ways in which the previous studies work using diff, we conducted a systematic mapping of papers that used the git diff command for their studies. As described by Petersen et al (2008), a systematic mapping study can provide and visualize a statistical insight of a study domain by classifying and quantifying the number of publications related to the research interest within the same study domain. The main activity of the method was searching the relevant literature from a wide range of publications including journal articles, books, documented archives and scripts.
We performed a systematic mapping as we intend to: (i) draw an overview of the research area through quantification in a structured way (Kuhrmann et al, 2017), (ii) confirm the knowledge in the currently published studies (Petersen et al, 2015). A systematic mapping is reliable because the findings are repeatable and consistent across the time (Wohlin et al, 2013), and they are beneficial for better reporting of some empirical findings of the primary studies (Budgen et al, 2008).
To understand how recent studies used git diff, we prepared the following research questions for this systematic mapping.
-Which diff algorithm is used? -What kind of software artifact is analyzed, code or other documents? -What is the purpose of using diff ? -Where does the data source come from, OSS or industry?
3.1 Procedure Figure 4 illustrates an overview of our systematic mapping procedure, which is divided into an initial stage and an advanced stage. The first stage has three steps including a digital libraries selection, papers collection, search string definition and initial search execution. The second stage begins with repetitive manual exclusion by narrowing the search terms and the reading of full papers, followed by paper classification, and statistical analyses.
Step 1: Digital Libraries Selection. The selection of appropriate literature is essential to guarantee high-quality papers and to grasp the state-ofthe-art issues in the software engineering field (Kavitha, 2009). We specifically targeted papers which were published in high ranking journals and conference proceedings of the software engineering area. To maximize the probability of  Fig. 4: Design of the Survey Procedure finding highly relevant good quality articles, we used three specific digital resources: ACM Digital Library 9 , IEEE Xplore 10 , and SpringerLink 11 . Table 1 shows the list of the publication sources used in our survey including their impact factors (IF) 12 and rankings published in 2018 CORE Rankings 13 . We gathered published papers from these three digital sources between the years of 2014 and 2017.

Fig. 5: Number of collected papers from each source
Step 2: Papers Collection. To reduce bias in the context of the study, we only collected technical papers. Papers which did not meet our criteria (i.e shorter-than-10-page papers, editorials, panels, poster sessions, and opinions) were excluded. As depicted in Figure 5, by applying our criteria, we sourced 1,801 papers in total from the three digital sources in a 4-year time span.
Step 3: Search String Definition and Execution. In this step, we formulated search keywords to filter the targeted papers into more specific works that use the git diff command. We defined three specific search terms related to the command, namely git, log and diff. Papers that contained one of three words with an exact match without affixes or suffixes (e.g. github, blog, logarithm, logging, different, difficult etc.) were collected. The command git log was also targeted because this command can produce diff with specific options. By using these search terms, all papers extracted from the databases were then manually scanned in full text. Consequently, only published works containing these three search strings were included. As a result of Step 3, we were able to identify 108 papers.
Step 4: Full Text Reading. To ensure the collected previous studies are relevant to our objectives, we then performed a full text reading of the papers. This process was undertaken by the first and the second authors to avoid obscurity and to separate the primary studies more exhaustively based on their contents. We applied the inclusive and exclusive criteria to the full paper which is described in Table 2. Papers that fit the inclusive criteria were kept for further processing while other papers that met the exclusive criteria were excluded from the study. After this step, we had 51 papers.

Which Diff Algorithm Is Used?
Out of the 51 primary studies, we identified the application of different diff algorithms in the command in extracting the changes. Of particular note was that even though most instructions applied different options in the use of the git command to extract the required data, none of the previous selected works considered different diff algorithms. This shows that all of the collected studies used Myers as the default algorithm. To understand the components that were extracted using the git command in the previous studies, two main focuses emerged as our parameters to classify the documents; namely, code changes and license changes as depicted in Figure  7. As can be seen in the figure, code changes were prominently the focus for the researchers in extracting software repositories using the git command over four years.

What Is the Purpose of Using Diff?
By reading the papers manually, we summarized the purposes from the extraction of software development records and grouped them into five categories, as can be seen in Figure 8.
From the figure, we see that the most common purposes is to get patches, amounting to as many as 28 studies, followed by collecting metrics and identifying bug-introductions, which covered 13 and 8 studies, respectively. A few studies addressed authorship identification and merges investigation. This finding motivated us to carry out a further investigation of the impact of different diff algorithms in the extraction of the added and deleted lines for metrics collection, bug-introducing change identification, and getting the patches.

Where Does the Data Source Come From?
Our intention is to provide a comprehensive understanding of the different outcomes generated by different diff algorithms; thus, we need to run a set of tests of the algorithms' implementations in the git diff command. The survey results of the usage of the git diff command confirm that the previous studies conducted between 2014 and 2017 did not use various diff algorithms to extract the differences between the first and the second versions of the same file. In mining the diff lists, they applied the standard commands using a default diff algorithm with some additional options, but without considering various diff algorithms. We also found that the information most sought after in prior studies was code changes in open source projects. The code changes were mostly utilized to thoroughly investigate counting the number of line changes and to record them in the form of metrics, locating the origin of a bug using a specific method (i.e. SZZ algorithm), and analyzing the patches. The results of these types of analyses obviously rely on the diff records produced by an applied diff algorithm in the git diff commands. Thus, different diff algorithms in extracting the line of code changes might differentiate the final result of a study and the conclusion of the description as well.

Overview of Comparisons and Research Questions
The findings from our systematic mapping revealed the three most common purposes for using the git diff command. This encouraged us to undertake comparison analyses between the Myers and Histogram algorithms in three applications: metrics, the SZZ algorithm, and patches. Our intention is to investigate the level of differences between the two diff algorithms used in these three applications and their possibility of affecting the result of studies. To achieve these goals, we address the following research questions: RQ 1 : Can the values of diff-related metrics become different because of different diff algorithms? For metrics (Section 5), equal and unequal changed lines in the files identified by the two diff algorithms were calculated based on two factors: the quantity and the position of the line of code. We then compared the quantity of the files that have the same and different added and deleted lines of code to understand the significance of the differences of both algorithms in providing the diff records. RQ 2 : Are the results of bug-introducing change identification different because of different diff algorithms?
The result of locating bug-introducing changes using the SZZ algorithms relies on the diff results. In Section 6, we applied the Myers and Histogram algorithms in the git diff command to know whether the diff lists affect the result of bug-introducing change identification. RQ 3 : Which diff algorithm is better in generating a good diff?
Lastly, we compared the quality of the identified patches manually. In Section 7, we investigate 377 changes, a statistically representative sample of the 21,590 changes identified in the above two comparisons.
5 Comparison: Metrics (RQ 1 ) RQ 1 : Can the values of diff-related metrics become different because of different diff algorithms?

Analysis Design
As illustrated in Figure 10, we investigate the following two basic diff-related metrics with two diff algorithms: Myers and Histogram.   NLA The number of added lines in a file. NLD The number of deleted lines in a file.
For our empirical analysis, we collected the Git repositories of 14 projects used in the previous study (Rausch et al, 2017), which are identified in our systematic mapping as a study utilizing git for collecting metrics. The targeted 14 projects are OSS that employ Continuous Integration (CI) and are written in Java. The descriptions of the projects and the number of commits in the master branches are shown in Table 3.
We investigated all modified files in all commits in the master branches. We considered the results the same if the values of both NLA and NLD were the same with the two algorithms; otherwise, the results were considered different. We also investigated the agreement of the identified change locations. File-level and commit-level results are discussed to see how the different results can appear in a different granularity. Table 4 summarizes the result from the comparison between two diff algorithms in 14 projects. From the total number of modified files identified by both algorithms, we counted the quantity of files in each commit that have same or different number values of NLA and NLD metrics. Similarity, the number of same and different results in changed locations are shown in the table. We see that the percentages of different metric values are between 0.8% and 6.2%. Considering the different results in locations of changes, ranging from 1.4% to 7.6%, we found that quite a few portions of the metric values are same even though the identified locations are different.

Results
To further explore of the disagreements between Myers and Histogram, we calculated the number of commits influenced by the different number of code changes and the locations in the diff output of files. In each project, we counted the sum of files that have the same and different quantity and the position of lines inserted and removed from each commit across the project. A single commit may contain more than one modified file. If a commit recorded at least one file having unequal changed lines of code either in their number or their location, we classified this commit as 'different'. On the other hand, if all files in a commit had identical changed lines, we categorized the commit in the 'same' class. In this process, we only notify the files that have an unequal number and location of the lines of code. Our results show that several changed files impacted by the changed lines have similar commits. We grouped the same commits from these several files that contain different changed lines of code into a single commit. We then summarized the percentage of commits that have a different number and position of the changed lines of code resulting from the usage of the Myers and Histogram algorithms in the git diff command as described in Table 5.
In general, our comparisons revealed that the data extraction using two diff algorithms in the command produced identical diff lists for most files in all commits. However, even though the output has been dominated by the same results for each file in a commit, the diff output from the Myers and Histogram recorded several files that have different added and deleted lines. These disagreements impacted the dissimilar number of commits that have files containing changed lines of code. The level of differences in the number of commits influenced by the amount of lines of code are adequately high, ranging from 1.7% to 8.2%, while the unequal location of lines affects the level of differences in the quantity of commits from 2.8% to 13.9%.

Summary
The finding from the metrics comparison provides clear evidence that the use of multiforms of diff algorithms might differentiate the diff lists. Since the metrics are insensitive to differences in change locations, the same values can be obtained even if identified change locations are different. However, we see  Fig. 11: SZZ: Locating bug-introducing changes that different metric values were obtained from 0.8% to 6.2% in the file-level and 1.7% to 8.2% in the commit-level. These differences can have impacts on studies using diff-related metrics.
6 Comparison: SZZ Algorithm (RQ 2 ) RQ 2 : Are the results of bug-introducing change identification different because of different diff algorithms?

SZZ Algorithm
The SZZ algorithm proposed byŚliwerski et al (2005) is an approach to identify bug-introducing changes. The SZZ uses a bug-tracking system (e.g. Bugzilla) as the reference to link archived versions of a software (e.g. CVS). Figure 11 depicts the basic idea of the SZZ algorithm.
The SZZ algorithm first identifies bug-fixing commits by searching bug report identity numbers (bug ID) in log messages, which have been written by developers when they fix bugs. The commit ID of this bug-fixing commit is subsequently used to track the previous commit (parent commit). The code changes are extracted by applying diff to find the differences between the older version of a file in the parent-commit and the newer version of the same file in the bug-fix commit. The identified deleted lines are considered to be candidates of bug-related lines. To identify bug-introducing commits, cvs annotate command is used to investigate when lines are added. Among the candidates of bug-related lines, lines that have been created before the bug reporting time are considered to be validated bug-related lines. The commits that introduced those validated bug-related lines are identified as bug-introducing commits.
A study undertaken by da Costa et al (2017) evaluated the output of five SZZ procedures in discovering the bug-introducing changes. The study on 10 Apache projects used three criteria for measurement, i.e. (1) the disagreement ratio of SZZ on identifying the appearance of the first bug, (2) the quantity of the prospective bugs and the number of days between the first and the last bug caused by the same bug-introducing changes, and (3) the difference in the days between the earliest and the latest bug-introducing changes to ascertain the root cause of an issue. The results of the investigation showed that the proposed SZZ improvements from the prior work (Davies et al, 2014) leads to an increase in the misidentification of bug-introducing changes. The study by da Costa et al (2017) also recognized that many upcoming bugs may be induced by only one bug-introducing change. In addition, the authors reported that most defects were caused by bug-introducing changes that are separated by more than a twelve-month period.
The SZZ algorithm has also been studied by Rodriguez-Perez et al (2018). The authors conducted a literature review of published articles that focus on the SZZ algorithm's functionality and its ability to be imitated. The study also investigated the impact of the SZZ improvements over time. Through an elimination process of the academic papers, the authors succeeded in reducing the number of publications to 187 papers. The results show that the SZZ algorithm has an important role in the software engineering research area. This can be seen from the increase in the number of published research citations since 2005, and the research has been frequently disseminated in high-quality publications. The study also found that the drawbacks of SZZ have been introduced in almost half of the selected publications. However, only a few papers provide detailed descriptions or offer the reiteration process material that makes the study reproducible. Additionally, more than 30% of the analyzed works still use the first SZZ developed byŚliwerski et al (2005). For our empirical analysis, we studied 10 open source Apache projects used in the previous study (da Costa et al, 2017), which is identified in our systematic mapping as a study utilizing Git for identifying bug introduction using the SZZ algorithm. The descriptions of projects and the number of commits in the master branches are shown in Table 6. We analyzed the impact of using different diff algorithms on the original SZZ algorithm. We studied the disagreement between the Myers and Histogram in the results of the SZZ algorithm based on diff. Figure 12 describes the validation process of our analysis. First, bug report IDs in the commit messages are searched with specific keywords (i.e. "bug", "fix", "defect", and "patch" (Śliwerski et al, 2005)), then the identified commits are marked as candidates of bug-fixing commits. In each candidate bug-fixing commit, we focus on the modified files. The two diff algorithms are used to identify deleted lines using the command: git diff -w --ignore-blank-lines --diff-algorithm=<algorithm> <parent commit ID> <bug-fix candidate commit ID>. By fetching files in the parent com- Table 7: Summary of valid bug-related lines, valid files, valid bug-introducing commits, and valid bug-fix commits resulting from Myers and Histogram mit ID, we subsequently applied the git blame command (similar to cvs annotate) to locate the origin of the deleted lines. Those deleted lines are considered to be candidates of bug-related lines.
Similar to the procedure of da Costa et al (2017), the next step is to find the affected software versions of a bug. We extract bug reports and their affected versions from the JIRA issue tracking system 14 . If a single bug ID affects more than one version, the earliest version is chosen since the SZZ algorithm targets the initial appearance of a bug. From the collection of affected-versions, we compare the dates of the introduction of the candidates of bug-related lines with the release dates of the versions. If the release dates of the affected versions are later than the dates of the introduction of the candidates of bug-related lines, we classified them as valid bug-related lines; otherwise, we classified them as invalid.
With these sets of valid bug-related lines, we validate bug-introducing commits, bug-related files and bug-fixing commits. The validation processes are performed in the opposite direction with the above procedure. A valid bugintroducing commit is a commit that initially adds valid bug-related lines. Files containing bug-related lines are considered to be valid bug-related files. From the candidates of bug-fixing commits, if there is at least one valid associated bug-introducing commit, we consider the candidate bug-fixing commit to be valid, otherwise invalid. Table 7 presents the outputs of the Myers and Histogram algorithms in the number of valid bug-related lines, files, bug-introducing commits, and bug-fix commits. Two algorithms produced a different number of valid bug-related lines in all 10 projects, which then led to the different number of files, bugintroducing commits, and bug-fix commits. Similar to the analysis of metrics in Section 5, differences in the quantities of changes are relatively small or the same for some projects, because of the insensitivity of change locations.

Results
Since investigating the locations of bug introduction is also important, we perform a comparison of files that have the same and different locations of bugrelated lines. Table 8 shows this result. It can be seen that the total number of files that have a different location of the changed code is high in each project, ranging from 2.4% to 6.6%. This means that some files can contain suspicious bug-related lines, only because of different algorithms.
Bringing these data into further analysis, we then summarized the number of valid bug-fixing commits. As shown in Figure 13, all studied projects have a different number of valid bug-fixing commits caused by the different positions of valid bug-related lines resulting from the Myers and Histogram. The percentage of the different results are between 6.0% and 13.3%, or 9.7% on average. This analysis found evidence that nearly 10% of bug-fixing commits do not guarantee success in locating bug-introducing changes since some deleted lines that were suspected as the candidate bug-introducing changes are different if we applied different diff algorithms in the git diff command. This is because a valid bug-related line in a file has the possibility of being identified by a particular diff algorithm, but it remains undetected while using the other diff algorithms.

Summary
The results from the SZZ algorithm confirm that different diff algorithms possibly generate different results, from 6.0% and 13.3% in the total of the identified bug-fix commits. The Myers and Histogram sometimes produced a different number and location of the deleted lines (bug-related lines) in several files. These differences certainly affected the number of disagreement files that have the bug-related lines, the amount of bug-introducing commits, and the bug-fixing commits that actually have the bug-contained files. This indicates that several prior studies that had used the SZZ algorithm to locate bugs have the possibility of producing inaccurate analyses.
7 Comparison: Patches (RQ 3 ) RQ 3 : Which diff algorithm is better in generating a good diff?

Analysis Design
From the previous two comparisons, we showed that different diff algorithms can have impacts on the results of metrics collection and bug-introduction identification (SZZ algorithm). In this section, we analyze the quality of diff to clarify which results are appropriate and which algorithm we should use. For this analysis, we used the same dataset that had been used in Section 5 and Section 6, shown in Table 9. From the CI-Java projects, we considered all modified files in all commit IDs to be targeted, while of the Apache projects, files changed in all bug-fix commit candidates are targeted. In each project of the first group, we analyzed the files that have different locations of the  The output of Histogram is better.

Myers
The output of Myers is better. Both The outputs of both algorithms have small differences and both of them are reasonably good. Both partially One algorithm is better in some parts of the output and the other algorithm is better in the other parts of the output. None None of the algorithms produce good output.
inserted and removed lines from the execution of the two diff strategies. While in the second group, only the files that have a different location of the deleted lines were analyzed. We divided the comparison into two categories: (i) in-code diff and (ii) in-non-code diff. The first category of diff means the different diff lists generated by both algorithms are lines of code or a block of code in a source code file. Otherwise, the second diff implies the disagreement between these two algorithms are other than a line of code, for example a change of comments, or a change in a non-code file, such as a modification in a text file.
Qualitative analysis between the two diff algorithms was performed manually by the first two authors in multiple steps. Initially, the first author made a list of all files from the two project groups. From this list, the sample size of files was counted using the tool provided in a survey system 15 to statistically represent sample from files in each project, so that the conclusions about the quality of the diff algorithm would generalize to all files in all projects with a confidence level of 95% and a confidence interval of 5. As can be seen in Table  9, the total number of files summarized from all project groups is 21,590. From this population, we selected random samples of 377 files. In the second step, the first two authors of this paper independently compared two diff outputs from the Myers and Histogram algorithms in each file applied to the top 30 files in the sample. We generated five categories to specify the comparison result between two diff algorithms as described in Table 10. The comparison results between two authors from 30 files were subsequently computed to find the kappa agreement 16 . We obtained 76.67%, which is categorized into 'sub- stantial agreement' (Viera and Garrett, 2005). Based on this agreement, the remaining sample data were investigated by the first author. Table 11 shows how well both diff algorithms work in presenting the changes of code. It can be seen that Histogram outnumbered the other results in the in-Code diff category, which emphasizes that this algorithm is substantially better to differentiate the changes of code specifically. Figure 14 shows how the Histogram algorithm provides better output of code changes compared with the Myers. We extracted the diff from the file AmqpMessage.java 17 in commit f56ea45e5 from the project of ActiveMQ. From this example, we can see two algorithms produced different results. It is true that none of the algorithms are incorrect in describing changes. However, the Histogram algorithm provides a reasonable diff output better describing human change intention, as the if -statement is moved to a new method and a new method call is added. While from the result of Myers, it is not clear how developer changed the code. Lines that have not modified were identified as removed from the original positions (line 18 and 19) and added to the new positions (line 6 and 7).

Results
This manual investigation also highlighted that the Myers and Histogram algorithms have almost the same ability to extract the diffs from non-code changes. As shown in Table 11, their percentages are nearly equals in the in-Non-Code diff (4.8% files are better using the Histogram and 5.3% files are preferable using the Myers). This is even strengthened by the high percentage of both diff algorithms' application that resulted in the same quality for the same files (see the example in Figure 15), which reached 23.1%. This quantification reveals that we can use any of these algorithms to produce the diff from non-code changes. As shown in Figure 15, both diff algorithms worked This library is distributed in the hope that it will be useful, + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of This library is distributed in the hope that it will be useful, + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of  Due to the different procedures between Myers and Histogram in identifying the changed lines of code, they possibly generated different diff results. Our manual comparison found that their differences were the number of the changes, the order of the changed lines, or even the detected added and deleted code. They certainly affect the readability of the diff outputs, in other words, the quality of the diff results produced by the two diff algorithms were different. Importantly, our results provide evidence that Histogram frequently produced better diff results compared to Myers in extracting the differences in source code.

Threats to Validity
Threats to the construct validity appear in the SZZ application. We used a small number of keywords to detect commit messages that describe fixing bugs. This limited our ability to extract all potential candidate bug-fixing commits. Even so, the commits that should not be identified as bug-fixing commits were also possible to be collected as long as they included the keywords in their log messages. However, since our focus is to investigate the level of differences of the diff lists produced by Myers and Histogram, the impact of the incorrect commits to the study result is small. Threats to the external validity emerge regarding the repository used in our experiments. Although we analyzed 24 OSS Java projects mined from Git repositories, we cannot generalize our study results to other open source projects nor industry.
To reduce the threats to reliability, we make our dataset publicly available. We provided lists of our collected files identified by the Myers and Histogram algorithms which were used in the three empirical analyses (see on GitHub 19 ).

Conclusion
To understand the impact of using different diff algorithms, Myers and Histogram, we first clarified applications of diff by conducting a systematic mapping of papers published between 2014 and 2017. We then empirically analyzed the impact in three major applications: (i) code churn metrics, (ii) SZZ algorithm, and (iii) patches extraction.
Our quantitative analyses has shown that the different diff algorithms can report different amount of changed lines, identify different change locations. Our qualitative investigation revealed that Histogram is better for describing code changes. Since diff is the fundamental tool for various software engineering tasks, considering limitations and advantages of algorithms is important.