1 Introduction

Software development teams use issue-tracking systems (ITS) (or issue trackers) to manage and maintain software products and their requirements. Popular ITS in practice are Bugzilla,Footnote 1 GitHub Issues,Footnote 2 and JIRA,Footnote 3 with common features like issue creation and tracking, communication around issues, release planning, issue triaging, and issue dependency management. As a central entity, issues have many properties, including their type [1] (e.g., requirement, user story, feature request, or bug report), priority (e.g., low, high, critical), and assignee. ITS and issue management, in general, have been the focus of software engineering and requirements engineering research for over a decade, with promising automation approaches, for instance, on issue type prediction (and correction) [2,3,4,5,6], priority (and escalation) prediction [7,8,9,10], or triaging and assignment [11, 12].

Issues are often interconnected via links [1, 13], which enable stakeholders to capture the relationships between the issues and to structure the overall project knowledge [14]. These links can have different types depending on the issue tracker and the project. Popular link types are Relate for capturing a general relation; Subtask and Epic for capturing issue hierarchies; as well as Depend or Block links for capturing causal or workflow dependencies. Also, Duplicate links are particularly popular in open-source projects, where many stakeholders and users independently report issues. This specific link type has attracted much attention from software engineering research in recent years [15,16,17,18].

In general, research has found that linking helps reduce issue resolution time [13] and prevent software defects [19]. Missing or incorrect links can particularly be problematic for requirements analysis and release planning [20]. For instance, missing Depend or Block links to issues assigned to a specific release might be crucial for that release. Similarly, missing duplicate links might lead to missing additional context information [21] which can particularly be relevant for reproducing and resolving the issue.

With the evolution of software products over time, a project can quickly accumulate thousands of issues, any of which could be linked to newly submitted issues. Correctly identifying and connecting issues quickly become difficult, time-consuming, and error-prone [22, 23]. This is even worse in popular issue trackers like those of Apache, RedHat, or Qt which are open to users and other stakeholders. The problem also becomes more complex when link types are taken into account: the issue creator or maintainer does not only have to decide if a link between two issues exists but also what the correct link type is.

Our work aims at paving the way for link (type) recommenders, that support stakeholders in issue linking and issue management in general. This paper extends our work published at RE22 [24] by adding an in-depth analysis of linking practices, repository structures, and inconsistencies in issue linking to further understand the intricacies of linking in practice. We extend our machine learning experiments by evaluating four prediction strategies. We rerun all experiments and extend the correlation analysis with characteristics extracted from repositories. Our contribution is threefold. First, we compare the characteristics of various repositories, link types, and how linking is used in practice. Second, we compare multiple state-of-the-art, end-to-end deep-learning models to predict and classify the issue links in different settings. Third, we investigate potential correlations between the model performance and the characteristics of the repositories and issues.

There are barely any limitations to the applicability of our model, as the issue types are not predefined but learned from data. Our evaluation shows that a BERT model using the description and title of the two involved issues is viable for classifying typed links with a macro F1-score of 0.63. The results are encouraging, particularly since up to 13 link types are predicted. Restricting the prediction to only the absence or existence of a link improves the macro F1-score to 0.95. The results reveal insights into link type prediction (errors) and how to design and apply link prediction in practice. In addition to the recently published dataset of JIRA repositories [1], we share our code and analysis scripts to ease replication.Footnote 4

The remainder of the paper is structured as follows. Section 2 outlines our research questions, method, data, and the various classification models used. In Sect. 3, we explore the dataset and quantify repository, link type, and linking characteristics. In Sect. 4, we investigate linking inconsistencies in the dataset. Section 5 presents the performance results of the BERT model for five different (typed) link prediction tasks across 15 different issue repositories. In Sect. 6, we correlate the results of Sect. 5 with the characteristics from Sect. 3 to explain model performance. In Sect. 7, we discuss our findings, their limitations, and implications. Finally, we discuss related work in Sect. 8 and conclude the paper in Sect. 9.

2 Research setting

We outline our research questions and data, as well as the research method and machine learning models used.

2.1 Research questions and methodology

Motivated by (a) the importance of the typed link prediction in issue management [13, 19, 20], (b) recent work on issue duplicate prediction [16], (c) recent advances of transformer-based machine learning technique for natural language processing [25], as well as (d) the availability of large issue datasets [1], we studied typed links prediction in issue trackers focusing on four main questions.

  • RQ1. How do stakeholders use issue links? How do the links differ regarding the properties of issues they connect?

  • RQ2. What linking inconsistencies exist and how prevalent are they?

  • RQ3. How well can state-of-the-art machine learning predict different links in issue trackers?

  • RQ4. What prediction errors can be observed and what leads to wrong predictions?

For answering the research questions, we used a large dataset originally consisting of 16 public JIRA repositories [1]. This dataset matches our research goals. Not only is JIRA one of the most popular tools for issue management in practice;Footnote 5, Footnote 6, Footnote 7 it is also well-known for its support for various link types which can be customized depending on the project needs. Other available datasets often only focus on the specific link type Duplicate which has intensively been studied in software engineering literature so far [15, 16, 26]. Bugzilla also supports multiple link types such as Relate and Depend, but JIRA repositories usually include a larger variety of default link types such as Epic and Subtask.

RQ1 aims to understand how stakeholders use linking and link types in issue trackers. First, we looked at the different link types used across repositories. Then, we explored the project and maintainer boundaries. For example, suppose two different sets of maintainers who are responsible for different issues in a repository. In that case, they might use link types and linking differently, which would create a boundary resulting in two issue graph components. Finally, we explored the properties of the link types: including the text similarity, the text length, and the difference in text length of linked issues for a specific type.

For RQ2, we mined and analyzed possible inconsistencies of specific link types which we deducted from definitions of link types. The results of RQ2 inform us about underlying data quality concerns that could influence the machine learning model performance. For instance, If the dataset misses a large number of certain links, the model might struggle to predict that link type.

For answering RQ3, we built, trained, and compared multiple deep learning end-to-end models that predict the existence of a link with its type for an issue pair. This is our baseline prediction strategy. We then used the top-performing model and trained it for different prediction strategies to evaluate if we can improve the accuracy:

  • No Relate: Predict the link type except the ambiguous Relate link type or the absence of a link for an issue pair.

  • Only Link Types: Predict only the link type and disregard Non-Links and Relate.

  • Categories: Manually group similar link types into categories [27], which we then predict instead of the more fine-grained types.

  • Only Linked: Predict the presence or absence of a link.

These different strategies still support stakeholders in their link management but might show different performances across the repositories.

For RQ4, we used the Pearson correlation with the characteristics studied in RQ1 and RQ2 with the prediction performance of each strategy (RQ3). This gives us insights into which characteristics (such as repository structure or occurrence of inconsistencies) influence the model performance and when a specific strategy might be applicable. For instance, if the repository contains many inconsistencies for a specific link type, the prediction performance can suffer. Fixing the inconsistencies could then increase the performance. The results help design and adapt link recommender systems for different contexts.

2.2 Research data

Table 1 summarizes the analyzed issue repositories. The table shows the year of creation, number of issues, number of links, unique link types, coverage, number of projects, and the share of cross-project links. Coverage represents the number of issues having at least one link. The share of cross-project links is the share of links that connect issues of different projects in a repository.

Table 1 Studied JIRA repositories in alphabetical order collected in January 2022

The investigated repositories vary along the reported properties. The number of reported issues ranges from 1867 to 1,014,926, while the number of links ranges from 44 links in Mindville to 255,767 links in Apache. The repositories also vary in terms of link types: Mindville uses 4 unique link types, while Jira (corresponding to the development of the JIRA issue tracker) and Apache use 16 unique link types. As 44 training points are too low for a stratified split we excluded Mindville from the analysis. On average, the coverage is about 36% meaning that 36% of all issues are part of a link. The coverage ranges from 4.0% in Mindville to 54.9% and 53.7% in Hyperledger and Mojang, respectively. Except in Jira, RedHat, and MongoDB links rarely cross project boundaries. The majority (95% on average) of links are between issues of the same project. The Jira repository shows a high coverage among its projects (46.7%).

We excluded links to private issues from the analysis because we have no further information about private issues. We also removed multi-links, i.e., when two public issues were part of multiple links with different types. This affected 1.1% of all available links. We also focus our study on the link types and disregard the link direction. This decision removes the potential influence of incorrectly specified link directions. Moreover, link direction does not change the underlying link type semantic. We did not filter incorrect links (besides multi-links), meaning either links that are falsely set between two unrelated issues or an incorrect link type. For instance: two issues are linked by a Block link, but it is actually a Cause link. Furthermore, we did not check the dataset for missing links. There are likely cases of incorrect or missing links in the dataset, however, identifying the extent would need a study investigating a large sample with stakeholders of a repository.

2.3 Machine learning models

We focused our research on a minimal model input consisting of the title and description. These are universal issue properties (independently of the tracker and project at hand) and are usually available at the issue creation time. We did not study the usage of additional features for the prediction, such as the issue type or status, which might influence the prediction performance. Such features might contain information about the label (link types), which could potentially bias the model or overestimate its performance (for example, the resolution for duplicate issues could be “duplicate”). Furthermore, issue reporters might not be knowledgeable about the issue tracking system itself and set some properties incorrectly (for example, issue types [3]). A model based only on the title and description can be further enhanced based on the specific project knowledge [14].

The labels are the link types used by the stakeholders in the repository. We also added Non-Links as a label by randomly selecting closed issue pairs that do not have the resolution duplicate. For instance, the model trained on Qt had eight labels Relate, Subtask, Depend, Epic, Duplicate, Split, Replace, and Non-Links. We kept the number of Non-Links in the experiment equal to the mean counts of the other labels this leads to enough data points so that the model can learn but also not too many data points which could create an imbalance. In the strategy Only Linked, we created more Non-Links to balance the class of links and Non-Links. We then randomly split the data for each repository into a training (80%) and a test set (20%) stratifying by link type. We took 20% of the training data for the model validation, resulting in 64/16/20 train-validate-test sets.

We checked the Software Engineering and the Natural Language Processing (NLP) literature to select the model’s architecture. In the NLP community, transformers such as BERT [25] and DistilBERT [28] have recently attracted much attention for various NLP tasks. In Software Engineering different approaches for duplicate detection have been recently suggested, with two main model architectures. Single-Channel architectures [15] take the word representation (Word2Vec, FastText, or GLOVE) and encode each issue separately with a siamese network, consisting of a CNN, LSTM, and a feed-forward neural network. Then, the two encoded issues are fed into another forward neural network to determine if one issue is a duplicate of the other. Dual-Channel architectures [16] use the word representations of two issues (which are two matrices of word embeddings) to construct a tensor by putting the two matrices on top of each other. Thus, in this architecture, the two issues are encoded jointly. We experimented with BERT, DistilBERT, the Single-Channel, and Dual-Channel architectures using FastText and Word2Vec. DistilBERT is a smaller transformer model trained to imitate BERT through knowledge distillation. It retains most of BERT’s general language understanding capabilities while using 40% fewer model weights.

In preliminary experiments, we found that BERT outperformed all other models in all setups: BERT outperformed DistilBERT by an average of 0.05 F1-score, the Single-Channel models by an average of 0.21, and the Dual-Channel models by an average of 0.26. The source code and results of all evaluated models are included in the replication package. In the remainder of the paper, we focus on discussing BERT’s results as the best-performing model.

We concatenated the title and description of both issues into one string and used this as input for the BERT model. Then, we tokenized this input string with the tokenizer of bert-base-uncased / distilbert-base-uncased which is a trained WordPiece tokenizer. We truncated the token sequence at 192 tokens with a longest first strategy. That is, if possible, we kept all tokens and otherwise truncated the longer of the two issues first. The [CLS] token output of the BERT model was then fed into a dense layer classification head which predicts the label of the link. For the training, we chose AdamW and use the default learning rate of \(5e^{-5}\) and weight decay of 0.1. We ran the training on NVIDIA Tesla K80 GPUs using the largest possible batch size that fit into the GPU memory which was 96 with BERT and 128 with DistilBERT. We trained for 30 epochs and evaluated the model on the validation set after every epoch reporting only on the model with the highest validation F1-score.

We report the F1-score per link type and the macro and weighted averages of the model per repository. Detailed tables for recall and precision are included in the replication package. We also present the normalized confusion matrix for each repository. We chose the conservative macro F1-score as our primary metric to evaluate performance. Weighted averages tend to overestimate model performance as models tend to predict instances of majority classes better since there are more data points to learn from [29]. Furthermore, neither using class weights nor SMOTE as strategies to counter the class imbalance showed any improvements.

3 Issue link usage

In this section, we explore how the studied communities use link types and linking. We examine link type frequencies, how links structure the repositories, and examine textual properties of linked issues. We aim to extract commonalities and differences in link types and linking across the repositories in the dataset.

3.1 Popular link types

Table 2 Popular link types and their usage shares in percent across the studied JIRA repositories

Out of the 33 unique link types found in the dataset most link types appear in less than half of all repositories. Uncommon link types only represent a small share of the links. Table 2 shows the frequencies of link types that are used by at least seven repositories. The highest variance can be observed for Relate, Duplicate, Subtask, and Epic. We focus our study on link types that have a share greater than 1% in the respective repository. For instance, Qt’s Clone accounts only 0.1% of all of Qt’s links and thus will not appear in the results in Sect. 5. Additionally, to ensure a minimum amount of comparability and generalizability across the repositories we exclude link types from the analysis if their total share across all repositories is less than 2% (which leads to excluding Split and Bonfire Testing from the analysis).

As a result, our analysis focuses on the following common link types: Relate, Duplicate, Subtask, Depend, Epic, Clone, Incorporate, Cause, and Block. A detailed description of the link types can be found in the dataset paper by Montgomery et al. [1]. We particularly note the difference between Duplicate and Clone: Duplicate links represent accidentally created reports of the same issue, whereas Clone links are automatically created when a user uses the “clone” feature of JIRA. The link types Block and Depend as well as Epic and Subtask also denote similar relationships. For instance, in Bugzilla, if issue A blocks issue B, then issue B is denoted as depending on issue A. However, as some issue trackers such as Apache and JiraEcosystems use both types in parallel we refrained from merging them. While they might seem to denote the same relation, stakeholders don’t seem to always use them interchangeably. Similarly, Epic and Subtask links seem like the same link type from different directions. However, they each have their independent fields in JIRA and thus denote a different link type.

There are patterns across the repositories which might be explainable by the different issue-tracking system functionalities [30]. Mojang uses mainly Duplicate links; 90% of all links have the type Duplicate. Duplicate are prominent when many individual software users create many issue reports [31]. Considering that Mojang’s JIRA is used to track the bugs of the popular game Minecraft, it might make sense that they use JIRA as a communication tool with their user base.Footnote 8 Hyperledger and JiraEcosystem prefer hierarchical types; they have many Epic and Subtask links which could indicate a focus on using their issue repository to coordinate and structure their development. Repositories like IntelDAOS, Jira, MariaDB, MongoDB, Sakai, and Spring primarily use Duplicate and Relate links and fewer Subtask and Epic links. Some repositories like Apache, JFrog, Qt, RedHat, SecondLife, and Sonatype are more balanced and use the common types more or less equally. Lastly, some repositories define a certain link type and use it consistently like Split in Qt or Bonfire Testing in MariaDB, RedHat, and Sakai.

Finding 1 Despite the large variety of issue link types used by open source communities, a common set of types covers the various repositories to a large degree (mean coverage of 96.6%), suggesting their universality. However, repositories differ greatly in the prevalence of those link types which indicate the heterogeneity of linking and issue management in general.

3.2 Linking practices and repository styles

To investigate structural characteristics of the repositories, we analyzed the links that connect different projects (cross-project links) and the stakeholders that maintain the links. Each repository includes different projects which may be maintained by different stakeholders. The projects might be more or less connected depending on the internal organization and collaboration. For instance, if two teams are responsible for a different set of projects their understanding and usage of issue properties and link types might differ. This can affect machine learning as the patterns of usage of a specific link type might be different across these teams, even if they belong to the same repository.

In the first step, we analyzed the number of issues per project. When manually exploring the repositories, we observed two trends. On the one hand, some repositories have a main project containing the majority of issues, for instance, Qt, with 97,172 issues in QTBUG followed by 25,249 issues in the next largest project QTCREATORBUG. On the other hand, some repositories are more scattered with their largest projects having about the same number of issues like Apache with 37,443 issues in SPARK, 35,390 issues in FLEX, and  26,421 issues in HBASE. To formally distinguish between such repositories, we propose a heuristic boundary based on the ratio between the largest and second-largest projects. We define the boundary as the point at which the largest project is at least twice as large as the second-largest.

After that, we analyzed cross-project links and how they structure the projects inside a repository. If we view the projects of repositories as nodes and the links across projects as edges, the connected components indicate the project structure of a repository. A high number of disconnected components could indicate lower internal collaboration across projects.

In the next step, we analyzed how maintainers structure the repository. We extracted all changes made to any link and the responsible maintainer for the change. We counted how many changes each maintainer made in the repository. We identified the subset of maintainers responsible for 90% of all link changes to distinguish stakeholders whose main task is link maintenance from others. After this, for each link maintainer we looked up in which projects they made their changes. This way we know with which project each maintainer primarily interacts. With this information, we can build a bipartite graph with the projects on one side and the maintainers on the other. We used different thresholds for the share of changes of a maintainer (\([0.05\%, 0.15\%, 0.25\%]\)) to decide if an edge between a maintainer node and a project node exists. For example, suppose one maintainer did 50% of their link changes in Project A, 20% in Project B, 3% in Project C, and 27% in Project D. For the threshold of \(0.25\%\), we add an edge from the maintainer to Project A and D. For the threshold of \(0.15\%\), we also add an edge to Project B and for \(0.05\%\) also to Project B, but never to Project C. The number of components of the bipartite indicates collaboration from the perspective of maintainers.

Table 3 Projects and (link) maintainers across the repositories: total, active in the last 5 years, and last 2 years

Table 3 shows the results of this analysis. The table shows the total number of projects, the active projects in the last five and two years, the number of project components, and the number of maintainer components with varying thresholds. Out of the 1,276 projects, 85.3% were active in the last five years, and 75.9% in the last two years indicating the recency and significance of the dataset. Repositories with fewer projects tend to be more stable regarding activeness. Of the 27,882 link maintainers, 54.1% were active in the last five years, and 29.6% were active in the last two years. The number of active maintainers is drastically lower than the total number, indicating that the understanding of link types may change over the years. Thus, we restricted the project and maintainer component analysis described above to the last two active years. Additionally, the number of maintainer components tends to be higher than the number of project components. So even inside project components individual stakeholders could be responsible for different projects. We observed that project-distributed repositories tend to have more components than project-centric repositories.

Finding 2 Issue repositories differ in their structures: concerning the distribution of issues across the projects, the cross-project links, and link maintainer behavior. Some repositories, like Qt, have a main project with rather invisible boundaries. Other repositories have rather disconnected projects which can be considered as multiple sub-repositories (possibly representing different communities). When applying machine learning models on issue reports, repository structures should be considered.

3.3 Link type characteristics and similarities

We investigated textual properties corresponding to the link types. We calculated the TF-IDF vectors of the title and description of the involved issues and used cosine similarity to calculate the similarity of the issue pairs on a text-token basis. We also looked at the length of the textual descriptions of linked issues (word count of title and description of issue pairs) as well as the difference in length between two linked issues. On the one hand, long issues are often truncated by transformers as these models only take in a limited number of tokens. On the other hand, short issues might not contain enough information to produce a reliable prediction. Therefore, text length is a factor that potentially influences the applicability of transformer models to predict links.

Table 4 Median cosine similarity scores of linked issues on text-token-level

Table 4 shows the median of the cosine similarity of two linked issues. We see that Clone-linked issues are the “most similar” with an average of 0.83. This makes sense as these links are created by a feature in JIRA that clones the text with only small changes made by the issue creators. It might be interesting to investigate Clone links that have dissimilar issues particularly if this was due to incremental changes over time. Duplicate correspond to the second most similar issues with around 0.36 cosine similarity score on average and a large gap to the similarity score of the Clone links. Mojang which mostly uses Duplicate has a very low similarity (0.21) of its Duplicate issue pairs and Clone issue pairs (0.39). Clone linked issues are also comparatively dissimilar in SecondLife (which does not use Duplicate links) and Spring. Relate links represent the third most similar issues with a mean score of 0.29. Jira’s Relate links are used for highly similar issues (0.89). Interestingly, Epic and Subtask linked issues are very dissimilar. Since these are hierarchical relationships, one would assume that one of the issue texts is contained in the other issue.

Fig. 1
figure 1

Boxplots for the length metrics

Figure 1 shows boxplots for the median text length and median text length difference of all repositories across link types. We observe that Epic and Subtask tend to connect shorter issues than other link types, while issues connected via Cause, Relate, Duplicate, and Block are more verbose. Moreover, in line with the similarity results we can observe that two issues linked by a Clone link are very similar in length, while other link types tend to have a higher difference in the word count. Most link types are similar in text length and text length difference across repositories, except Block and Incorporate. Detailed numbers and tables are in the replication package. Further analysis of the type of text contained in the issue descriptions such as stack traces or code fragments could reveal further interesting differences and commonalities that affect text length.

Finding 3 The textual similalrity of two linked issues varies strongly for the different link types. Clone issues are the most similar in text followed by a wide gap and then by Duplicate indicating that these two link types are syntactically different. Relate follows closely behind Duplicate links. The structural link types Epic and Subtask tend to connect short and rather token-dissimilar issues (as Non-Links) and reflect quite short texts. Cause, Relate, and Duplicate and are among the longest linked issues.

4 Inconsistencies in issue linking

In this section, we present possible inconsistencies we observed during the analysis of the dataset. These inconsistencies can potentially affect the applicability of any machine learning model. If the training data has quality issues, the model and its output could be compromised.

4.1 Empty descriptions

When familiarizing ourselves with the dataset, we saw cases of issues containing neither a title nor a description. Almost all issues have a title. However, 4.2% of the issues have no description. We analyzed the description lengths of individual issues and found that MariaDB has the overall longest description (939.60 characters, on average) and Sonatype has the shortest description (392.57 characters, on average). Some issue repositories have many issues without descriptions, namely Mindville (25.91%), JFrog (13.16%), and SecondLife (9.80%). However, for the majority (9 out of the 16 repositories) more than 99% of all issues have a description.

4.2 Faulty links and reference comments

Other inconsistencies in the dataset are missing or incorrect links. Fully identifying and analyzing those links would require the detailed input from the stakeholders in the corresponding projects. However, a partial analysis is possible only based on the data. Incorrect links refer to links that should not be there or links with a wrong link type. Multi-links are a subcategory of incorrect links, particularly in the case of contradicting link types. Non-multi-links could potentially be incorrect. One case of missing links are known but undocumented links, which are, e.g., discussed in the comments. Therefore, we identified and analyzed the comments mentioning another issue report. We refer to them as reference comments.

To identify reference comments, we used regular expressions following the naming schema of issue reports in JIRA. This corresponds to the project name in upper letters, followed by a "-" and a number, making it easy to filter comments, for instances, containing the name of another issue report. As we do not have the comment data for MariaDB and Mindville, we omitted these repositories from this analysis.

Table 5 Issue reports with comments mentioning other issues

Table 5 shows the share of reference comments, the share of issues with such comments, and the number of issues with a reference comment that do not have any links. On average, 16.7% of all comments mention another issue report and 28.2% of all commented issue reports have a comment mentioning another issue. Of these issue reports with reference comments, almost half include a link and the other half not. While we found no link in 95.6% of Sonatype issues with reference comments, for Mojang, only 8% of all issues with a comment mentioning another issue do not include any link.

Finding 4 Missing issue descriptions are rare, but could be considered a quality defect, as a title alone might not include enough information to inform an accurate link prediction. Furthermore, issue referencing is fairly popular in issue discussions. Our preliminary analysis of the discussions suggests that about 14% of all commented issues miss links in the reports.

4.3 Unfitting issue type

JIRA offers several default issue types. Stakeholders may add customized types as well. We read through the most frequent issue types of each repository. The common types were Bug, Subtask, Task, Improvement, New Feature, Story, and Epic. Most repositories mainly consist of Bugs. Exceptions are Jira, Mindville, IntelDAOS, and Sonatype. Jira and Mindville contain more Feature Requests than Bugs. IntelDAOS frequently uses User Stories. Finally, Hyperledger, MariaDB, MongoDB, and RedHat frequently use (Sub)Tasks while MongoDB, Apache, and Spring frequently use Enhancements.

Some issue types and link types have an apparent relationship. Most prominently, an Epic or Subtask link should consist of at least one issue with the corresponding issue type (or a similar one). In the whole dataset, only JiraEcosystems (2 cases) and MongoDB (1 case) had instances of Epic links where neither issue was an epic. Thus, repositories that use Epic as a link type have a high quality for those links. As for Subtask links, most repositories use these links between two issues where one issue is a task, subtask, or another equivalent issue type. Here, we found more exceptions than for Epic. Apache has 320 Subtask links (0.4% of all Subtask links) with incorrect types and Jira has 1, where both issues have subtasks as issue type. Spring has 66 cases, where a Subtask link is set for issues with type Backport.

We found further inconsistencies in the dataset when looking at the reverse, i.e., issues of the type epic or subtask.

Table 6 Issue reports with type Epic or Subtask with no correct link

Table 6 shows the count and share of epic and subtask issues with no links to another issue. However, as described in Sect. 2 part of the ITS can be private. During a manual investigation, we noticed that this is often the case for epics. So, while our dataset has apparent inconsistencies, this might not hold true for missing links.

As other link and issue types do not have such a tightly coupled relationship, we cannot conduct a similar analysis without involving the stakeholders.

4.4 Inconsistent status and resolution properties

Status and resolution are joint properties as resolutions are only set for closed issues. We mapped the issue status to open, closed, and in progress by using the definitions provided by the JIRA repositories. The repositories can decide if a (custom) status is open, in progress, or closed. There were cases where the status and resolution of an issue do not match, pointing towards bad issue report quality (for example an open status with duplicate as resolution). There were also cases with the status set to closed but without a resolution. We did not further investigate these cases as they do not directly impact issue linking.

Table 7 Links with type Duplicate and incorrect resolutions property

As with the previous section, an apparent causal relationship exists between the link type Duplicate and the resolution duplicate. Table 7 shows inconsistencies of Duplicate links and duplicated issues. The first half of the table shows the percentage of these links where none of the issues has the resolution duplicate and the percentage where both issues have the resolution duplicate. On average, 22.9% of Duplicate links have issues where neither issue is resolved with duplicate which indicates faulty information in either the resolution or the Duplicate links. In 2.96% of Duplicate links, both issues have resolution duplicate which is also questionable. In total, almost 26% of all Duplicate links seem flawed, the exception being Mojang. These cases could be automatically filtered, investigated, and fixed by stakeholders.

Fig. 2
figure 2

Bug MCL-7813 closed as a duplicate but with a comment from a bot mentioning the original issue report in Mojang

The second half of the table shows the percentage of duplicated issues which have no link, the percentage of these issues which have neither a Duplicate nor a Clone link, and the share of duplicate issues without a link that have a reference comment. For this analysis, we used the uncleaned dataset and counted the Clone link type as Duplicate, in case the stakeholders use these interchangeably to not underrate the quality of the Duplicate links. We observe that 33.18% of all duplicate issues are not linked at all and 44.72% of all duplicate issues have no Duplicate or Clone link. Of these missing links 46.6% are known but not documented. Overall, the link data quality of Duplicate links and duplicate issues of JIRA repositories seems lacking. Figure 2 shows an example of a duplicate issue from Mojang in which the original issue report is not linked but only mentioned in a comment by a bot.

Finding 5 Overall, the link quality of Epic links and Subtask links seems high as almost all those links connect two issues with correct issue types. However, Epic issues are more difficult to evaluate as some are private. In contrast, the quality of Duplicate links seem rather lacking. Over a fifth of all Duplicate links connect issues where none of them has the appropriate resolution. Moreover, 45% duplicated issue reports are missing a Duplicate or Clone link that points to the original issue report. 46.6% of all Duplicate issues that do not have a link include a comment mentioning another issue.

5 Link prediction

After analyzing characteristics of link (type) usage and identifying some inconsistencies, we discuss the results of the BERT-base typed link prediction model described in Sect. 2.3. We also examine other prediction strategies that aim at improving the accuracy.

5.1 Baseline prediction

Table 8 F1-scores for predicting issues links and their types across the studied JIRA repositories

Table 8 shows the common link types and their respective F1-scores for each repository. We also present the mean, median, and standard deviation per link type as well as the macro and weighted F1-score per repository. The values partly differ from those reported in our previous work [24], because we reran the training with a different random initialization for the extended evaluation reported in this paper. Overall the model macro F1-score achieves 0.63 on a multi-class problem with a median of 7 classes per repository. The weighted F1-score goes up to 0.73.

As expected the results differ across repositories. JFrog only has a macro F1-score of 0.46, while Mojang has a macro F1-score of 0.88. Some of the variances across the repositories can be explained by the size of the training set. For instance, JFrog, SecondLife, and Sonatype contain less than 5k links each whereas Mojang contains roughly 200k links. We also observe that the performance of the model differs per link type. The model predicts Subtask and Epic links consistently with a top performance, while the prediction of Duplicate, Depend, Incorporate, and Cause seem to be less accurate. The link types Relate, Clone, Block show mixed results across the repositories. Non-Links shows good performance for all but one repository.

Figure 3 plots the macro F1-score against the standard deviation of the F1-scores across the link types for each repository. A higher standard deviation means that the model performs well for some but not for all link types. A low standard deviation means that the model performs similarly for all types. Mojang with a lot of available training data and only three different predicted link types (Duplicate, Relate, and Non-Links) performs best (highest macro F1-score with lowest standard deviation). The next cluster consists of Hyperledger, Jira, MariaDB, and MongoDB with macro F1-scores ranging from 0.72 to 0.75. Then, IntelDAOS, Qt, RedHat, Sakai, and Spring perform slightly worse: their macro F1-scores lie between 0.62 and 0.67. The last cluster consists of Apache, JFrog, JiraEcosystems, Secondlife, and Sonatype. These repositories all with lower coverage have a macro F1-score less than 0.60. The case of Apache is particularly interesting as it has the highest number of links and issues and one of the highest number of predicted classes. However, from Sect. 3 we know that Apache is the most fractured repository which might affect prediction performance.

Fig. 3
figure 3

Performance per repository according to macro F1-score and standard deviation for the studied link types in the baseline model

Finding 6 A general BERT model applied to issue titles and descriptions predicts the typed links with a promising mean macro F1-score of 0.63 across 15 repositories. Some repositories and link types show a top performance, while the model achieved moderate performance for others.

5.2 Individual link types and confusion analysis

Fig. 4
figure 4

Normalized confusion matrices for each repository. Rows are sorted by the support of link type, meaning the majority class is always in the first row. The columns are in the same order as the rows

We also looked closely at the various link types. Figure 4 shows the confusion matrices for each repository ordered by the number of data points per link type in the test set. We identified which link types were confusing to the model and, thus, consequently mislabeled. Overall Relate often represents the majority class and is commonly predicted by the model for other link types. This sometimes happens for Duplicate but on a much smaller scale. We also observe that Clone and Duplicate are well distinguished by the model.

5.2.1 Relate links

These links are provided by default in JIRA. Relate links are particularly ambiguous as they describe any underlying relationship between two issues. They can potentially express any link type in the repository, an undefined link type, or stakeholders might use them as a placeholder because the exact type is to be determined. This could explain why the model often mistakes other link types as Relate links. The F1-score ranges from 0.54 up to 0.88 overall. Relate links are predicted fairly well with an average macro F1-score of 0.69.

5.2.2 Duplicate and clone links

We initially assumed that the model would struggle to distinguish Clone from Duplicate links and vice versa. Surprisingly, the class Clone did not confuse the model. Clone and Duplicate were rarely mistaken for each other as shown in the confusion matrices of Apache, Hyperledger, IntelDAOS, Jira, JiraEcosystem, RedHat, and Sakai. This suggests that they exhibit a difference that the model can recognize. Duplicate links have rather mixed F1-scores ranging from 0.27 to 0.97. Only in Jira and Mojang Duplicate links were classified more precisely with 0.70 and 0.97 F1-scores respectively. The other repositories achieved a macro F1-score up to 0.55. In contrast to Duplicate issues that are usually created independently by different stakeholders, Clone links are usually intentionally created via the “Clone Issue” feature of JIRA. These links seem to be easier predictable by the model with F1-scores ranging from 0.53 to 0.84.

5.2.3 Subtask and epic links

The model showed top scores for classifying Subtask and Epic links with an average F1-score of 0.89 and 0.97, respectively. Subtask links ranging from 0.80 up to 0.95 and Epic links ranging from 0.96 up to 0.99 both have a low standard deviation. Subtask and Epic links have an extra section in the issue report in JIRA and are not grouped under the “Issue Links” section. Surprisingly, the model is able to differentiate Subtask links from Epic links although the performance for Subtask is slightly lower when the repository included the type Epic. Since Epic and Subtask have the highest prediction performance and can be differentiated from each other as well as Non-Links, we can conclude that the model seems to learn beyond the lexical similarity of issues.

5.2.4 Depend, incorporate, block, and cause links

All of these link types show either varying or relatively low performances. The F1-scores of Depend range from 0.23 to 0.75. This could be explained by the link type share. Qt with the best performance has approx. 16% of its links as Depend links followed by MongoDB with approx. 23% Depend links. When the performance is low, we observed that either Depend is not used frequently in the repository (the absolute count \(<300\)), or it is confused with Relate (for Spring).

The performance for predicting Incorporate links also varied with a standard deviation of 0.28 and macro F1-scores ranging from 0.00 up to 0.78. In JFrog and SecondLife, the number of training examples was less than 50 (44 and 14 respectively) explaining the 0.00 in both rows. In cases where the performance is high, there is at least 1000 training examples in the repository.

Block links had a moderate performance with a standard deviation of 0.09 and macro F1-scored from 0.41 up to 0.66. Even though three repositories frequently use this link type it only has an average of 4.8% share across all repositories. It seems that the model only struggles to distinguish this link type from Relate but not from other – semantically rather similar – link types such as Depend or Cause. We can also observe that Block and Depend are generally used separately: either Block is used or Depend– except in Apache and JiraEcosystem. Finally, we observe that the performance of Block is high if the repository uses this type in more than 10% of its links.

Cause links have a very low average share and performance, ranging from 0.11 up to 0.45. MariaDB had the best performance. But even there, the model was unable to discern this link type from Relate. The reason for this might simply be that a causal relation is harder to learn while simultaneously not having enough example data. The link type Cause, performing worst, bears the highest difference in text lengths and the longest texts. One possible explanation is that corresponding descriptions might contain a lot of information and require in-depth human reasoning. Furthermore, the BERT model cuts the text after a number of words and will likely not see the full text of all links which could be a factor leading to the limited performance of Cause texts.

5.2.5 Non-links

We observed an encouraging performance for predicting Non-Links except for IntelDAOS. Overall, the model did not struggle to identify links from Non-Links. On average, Non-Links have the third best prediction performance after Epic and Subtask links. The case of Mojang which only has 3 link types to predict is an indicator of how well the model can perform as a predictor of “linked issues” vs. “non-linked issues”.

Finding 7 The BERT model is accurate at predicting the mere existence of links. The structural link types Subtask and Epic perform very well across all repositories. Duplicate does not perform well overall, but the model can distinguish Clone links from Duplicate links. Depend, Incorporate, and Block can be distinguished when they are frequently used. Cause corresponds to the lowest prediction performance. Relate seems to be the most confusing for the model as it is likely a jack-of-all-trades type used to label a link when stakeholders are unsure which other type fits.

5.3 Other prediction strategies

To use the prediction model in practice, its performance must be high. Therefore, we defined and evaluated different prediction strategies that would likely improve the performance while still providing a practical support to stakeholders during issue management.

We derived four link prediction strategies from the prior analysis. The first strategy, No Relate omits Relate links as they are unspecific by nature and might lead to many false predictions as seen in the previous subsection. The second strategy, Only Link Types predicts only the link types, leaving out Relate and Non-Links. Potentially this strategy could be used to turn Relate links into specific link types which could be used to refine the dataset and therefore improve the performance of the original prediction task. Alternatively, a two-step process can be used here to first identify links and then specify their type. The third strategy, Categories changes the prediction task by grouping similar link types and leaving the stakeholders to decide on the specific type. The fourth strategy, Only Linked disregards all link types and predicts if any type of link exists leaving the stakeholder to choose the specific type.

Table 9 Comparison of the different prediction improvement strategies macro F1-score and the baseline from Table 8

Table 9 shows the macro F1-score of each strategy per repository and its improvement over the baseline prediction model. The confusion matrices for each strategy are in the replication package. The second and third last rows show the mean and median of the rows above. Therefore, there might be a difference between the mean or median F1-score of the baseline and the mean or median F1-score of a strategy.

5.3.1 No relate

First, we observed that Relate is a problematic link type to predict and is often confused. We removed Relate from the training and test set and retrained our models. The average macro F1-score across all repositories improved to 0.70 which is also partly expected as we removed one class. SecondLife is the only repository where this task setting leads to performance deterioration. The largest improvements are for the class Duplicate (+0.22), Block (+0.11), and Cause (+0.14). However, the confusion matrices show some mix-ups, particularly, Duplicate seems to be predicted for other link types. Similarly but on a smaller scale, we see this for Depend and Block.

5.3.2 Only link types

The model for this strategy is trained on all links except the Relate links. This alleviates the ensuing prediction problems of likely bad data quality for the Relate link type. We have similar patterns to the baseline and the No-Relate strategy. However, this prediction strategy leads to higher performance for IntelDAOS, MongoDB, and Sonaytype in comparison to the No Relate strategy.

5.3.3 Categories

With a mean of 10.19 link types per repository, some types are likely semantically related or partly redundant such as Epic and Subtask or Block and Depend. Stakeholders in a project are better aware of which link types are similar. As our model is general to various types no additional changes are needed except for grouping the labels from certain types to a category. Then a stakeholder with knowledge about the project can choose the correct link type from the predicted category. The problem of General Relation links is also present in this case. Composition links are also the easiest to predict, Duplication, Temporal / Causal, and Workflow links are harder to predict. Unfortunately, overall the uniform grouping of the link types did not improve the average performance. Apache and Sonatype improved significantly for this task, while Spring decreased significantly (\(-\)0.41). This performance decrease is likely due to the heterogeneity of the link types even if they belong to a similar category.

5.3.4 Only linked

Here, the model is trained on the dataset where all linked issues are in one class and randomly created Non-Links in the other class. We selected an equal number of Non-Links and linked issues to obtain a perfectly balanced dataset. This model performs stellar with an average macro F1-score of 0.95. Only for IntelDAOS and SecondLife, the macro F1-score was less than 0.93. The good performance is expected as the task is easier: there are only two classes to predict and numerous training points.

Finding 8 Our end-to-end model for predicting links (without types) achieved the best overall performance with a mean F1-score of 0.95: high enough to be applicable in practice. Removing the general-purpose (and often confused) link type Relate also helps lift off the macro F1-score to \(\ge\).80, but only for 4 out of the 15 studied repositories. Grouping semantically similar link types for prediction barely makes any difference, on average.

6 Performance analysis

In this section, we explore possible causes for the performance differences across the repositories and link types.

6.1 Repository properties

We calculated the Pearson correlation and p-value of different repository characteristics from Sects. 3 and 4 to find out what characteristics correlate with the macro F1-scores of the various strategies. Table 10 shows the the results. In the following, we discuss each significant correlation for each model. Overall, the significant correlations differ between the prediction strategies, particularly typed versus non-typed link prediction.

Table 10 Pearson correlation coefficients between macro F1-scores of the link prediction strategies and properties of the repositories with p values below 0.05

Baseline. The baseline model’s performance seems independent of the number of issues in the issue tracker. As expected, we observe a strong positive dependency with the coverage. Moreover, we observe positive correlations with the number of links per project, issues, and links per maintainer. We also observe negative correlations with the quality characteristics of Duplicate links.

No relate. Looking at the correlations, we observe that the performance of this improvement strategy depends on similar properties as the baseline approach. The difference is that the multi-links in the repositories influences the performance negatively (\(-\)0.69).

Only Link Types. Predicting only link types seems to only negatively correlate with the share of Subtask links in the repository.

Categories. The performance of this strategy correlates positively with the coverage (0.61), something we observed only for this strategy.

Only linked. This model performance is positively dependent on the number of issues (0.59) and the number of links (0.69). Its performance depends negatively on the number of multi-links (\(-\)0.54) and the share of issues with empty descriptions (\(-\)0.64).

Table 11 shows the correlation of the analyzed link types properties averaged across repositories with the corresponding macro F1-scores.

Table 11 Correlation coefficient and p Value of macro F1-scores of the classifier to properties of the link types

We observe that the only significant correlation is the text length. Cosine similarity only has a small insignificant correlation confirming that the model is not simply making predictions based on text similarity. Additionally, we observe that links that are mistaken by the model as Relate links often connect lengthier issues than issues of correctly classified links as well as issues of other mislabeled links (aside from Relate).

Finding 9 Depending on whether to predict the mere existence of links or to predict the link types, different properties of the repositories have significant correlations with the prediction performance. While the prediction of links depends on the number of issues and links in the repository, predicting the link types correlates with the coverage. A higher coverage can indicate a more “careful/thorough” linking. Moreover, the more issues and links a maintainer is responsible for, the more homogeneous a repository (and the linking) is likely to become. More homogeneous repositories have higher accuracy for predicting link types.

6.2 Link type properties

We explored individual link types in depth. Table 12 shows the correlations of each link type and its properties in the repositories. For this, we excluded the outlier Mojang as it has only three types and very good performance.

Table 12 The correlation coefficients between the F1-scores and properties of the link types

Unsurprisingly, we observe that the share of a link type in the training data correlates positively with the achieved performance except for Epic and Block. This observation is significant for Relate, Duplicate, Depend, and Incorporate. This is aligned with our previous observation that Depend and Incorporate require more data to be classified precisely.

Finally, we calculated the Pearson correlations between the performances of different link types. That is, a link type existence might impact another link type performance like Depend and Block. We found that Subtask is harder to predict if Epic is present in the repository and vice versa with a correlation of \(-\)0.85. Similarly, Block and Duplicate influence each other negatively (\(-\)0.85). Furthermore, the link type pairs Block and Clone, Block and Cause, and Cause and Incorporate strongly correlate positively (\(\ge\) 0.9) with each other. Clone and Incorporate (0.84) as well as Epic and Relate (0.76) also positively depend on each other.

Finding 10 Generally, text length of linked issues strongly correlates negatively (-0.725) with the model performance. So shorter issues seem less confusing for the model. Links mislabeled as Relate links tend to connect lengthier issues than a) correctly classified links and b) other mislabeled links. Unsurprisingly, the share of the link type correlates positively for the link types Relate, Duplicate, Depend, and Incorporate.

7 Discussion

We studied how practitioners use links and link types in Sects. 3 and 4. Despite the rather large variety of the link types originally identified in the dataset and despite the heterogeneity of repositories, we found a set of common link types that cover most links (\(\ge 95\%\)) across repositories. This suggests that most linking needs of stakeholders can be satisfied with these nine link types. However, we saw that the prevalence for specific link types greatly differs across repositories, indicating different issue linking styles (and likely issue management styles in general). Furthermore, when looking at the projects and maintainers in the repositories, we found that some repositories seem rather a collection of different sub-repositories, while others are more interconnected. Additionally, we observed multiple logical inconsistencies in the issue links and other properties of the issues, which in turn could suggest a non-careful linking, possibly compromising the quality of the data to train a prediction model and thus the prediction reliability in general.

In the following, we discuss the implications of our results for issue management practice, particularly the applicability of the link prediction models. We then discuss the importance of data quality as a prerequisite for correct prediction. Finally, we outline further research directions and summarize the main threats to validity.

7.1 Applicability of typed link prediction in practice

While previous work has intensively studied duplication links (details in Sect. 8), this work is among the first to study the reliable prediction of issue links of different types (i.e. typed links). One immediate use case is to recommend to stakeholders missing issue links or to highlight incorrect link types. Our state-of-the-art BERT model achieved an average macro F1-score of 0.63 and a weighted F1-score of 0.73 across the 15 studied repositories. These are promising results considering that the median amount of predicted classes is 7 (as opposed to, for example, the simpler binary classification for duplication). When changing the task to only predicting the existence or absence of links between issues, we reach an average macro F1-score of 0.95, good enough to be applicable in practice. In repositories that contain a lot of incorrect links, the model might result in a worse performance. Furthermore, missing links in the training data will likely reduce the model ability to predict missing links in new data.

Our baseline model is fairly general as it is end-to-end (without feature engineering) and it only uses the title and description of the issues as input features. Additionally, even smaller repositories like IntelDAOS (2599 links) showed a fair performance. Indeed, we found no significant correlation with the number of links (the training data) when predicting various typed links. Our model works with mostly original data only with as little basic pre-processing as possible. The model can further be optimized to consider the individual repositories, project, and internal workflows.

The model was quite precise at predicting the hierarchical links Epic and Subtask. It also showed similarly high accuracy for non-linked issues. However, the model was not as good at identifying Duplicate links. Counter-intuitively, it did not struggle to distinguish Clone from Duplicate links. This is likely due to the way Clone links are created as JIRA allows stakeholders to clone an issue, which creates a new issue with all the properties of the cloned issue as default. This explains the high cosine similarity of Clones. In contrast, issues of a Duplicate link are usually created by two different contributors, who are often unaware. As only Clone links are created automatically, a link prediction tool could also ignore them.

Other link types need a critical mass to achieve good results. The link type Cause has the lowest accuracy, which might be due to the fact that it is barely used or due to the length of the connected issue pairs. We truncated the texts, meaning that longer texts were cut. This affected link types connect longer issue text. One reason for long texts is non-natural language fragments such as stack traces or code. It is also possible that stakeholders incorrectly link these issues with the Relate link type. While manually reviewing the data, we observed issues linked through Relate that contain stack traces or other pieces of non-natural language.

7.2 Data quality as prerequisite for correct prediction

Since we only use the textual descriptions for training the link prediction models, we assume that the quality of the issue text (i.e. missing, redundant, or ambiguous information) directly impacts prediction quality. In Sect. 4, we observed inconsistencies that affect the quality of the data which we analyzed in conjunction with the performance of the model to find significant correlations in Sect. 6. We saw that the Only Linked prediction model correlates negatively with the absence of description texts. Other quality indicators also showed a negative dependency with the model performance. For instance, the issues BE-213Footnote 9 and BE-94Footnote 10 in Hyperledger are linked as Clone and mislabeled as Epic by our model. Both of them only have a title (“Footer Components” and “Header Components”) but only an empty description. Similarly, the issues CB-12181 and CB-12176 in Apache are linked as Depend and mislabeled as Subtask. While the first issue has “See subtasks” as its description, the second issue is missing a description. Such low data quality affects prediction quality as we have less information to feed the model.

In Sect. 4, we saw that duplicated issues are in general of lower quality with many missing links. We also observed that the performances of the baseline and No Relate task correlate with duplicate quality. Interestingly, the number of known but undocumented duplicated links correlates positively on the performance of the task. This needs to be further investigated.

The link types Subtask and Epic can be distinguished quite well with the text alone likely due to the unique way they are treated in JIRA. The main difference to other link types is that they have their own sections in the issue. Thus, stakeholders might treat them more carefully. Furthermore, Epic and Subtask are often used to structure the ITS. It is likely that Epic and Subtask links are created with intent and care, usually by stakeholders deeply involved with the repository with a planning role. We saw that the average quality of these links is fairly high in the dataset in Sect. 4. The issues are likely created at issue creation time in analysis and planning tasks. This likely impacts the quality of the issue text as well as the correctness of the links.

In general, the quality of the issues and the linking (i.e. the correctness of the link and its type) is only as good as the carefulness of the people who create them. This is apparent in the seemingly rather average/low quality of Relate links which is the most popular link type and at the same time most confused by the model at the same time. One interpretation might be that the typed link prediction might suffer from the low quality of Relate links misused by stakeholders when they are uncertain whether other specific types. Thus Relate might contain a lot of data points that should have another label. For instance, the issue ZOOKEEPER-3920Footnote 11 in Apache has a Relate link to ZOOKEEPER-3466 and ZOOKEEPER-3828, while the comments discuss that this should be a Duplicate. When Relate links are removed from the prediction task, we achieve less confusion, and the macro F1-score improves to 0.70. However, not all repositories benefit equally well from this task setting.

Finally, issue data quality might be affected by the role of stakeholders creating the issues or the links (user, developer, analyst, product manager, etc.) [32]. Zimmerman et al. [33] found that most issues reported by users miss important information for the developers, such as steps to reproduce or stack traces. As users are often unaware of what makes good issue reports, they are likely to create lower-quality issues. In contrast, analysts or developers are more aware that certain details are important to implement a requirement, fix a bug, or complete a task. Additional investigation of the impact of project roles on data quality in issue trackers is needed to draw better conclusions.

7.3 Further implications for research

Overall, we think that the diversified investigation of issue link types opens up completely new perspectives to better understand issue tracking practice, how it should change, or how it can be supported by better tools. In the first part of this paper, we investigated issue link usage. This analysis is the first step to capturing and understanding how practitioners manage links. This should be examined further with qualitative and experimental methods to explore commonalities and differences in repositories.

We examined several inconsistencies for Duplicate, Epic, and Subtask links and issues. The effect of these link inconsistencies on issue tracking and issue management—as issue-tracking smells [34] and community smells [35] in general—need to be studied in more detail. For a complete picture, this analysis should be extended to the other link types. For instance, causal or temporal link types (such as Block, Depend, or Cause) pose a difficult challenge. The priority, severity, and fix version need to be consistent with the direction of these links. For example, blocking issues should be fixed in an earlier or the same version as the blocked issues [36]. However, parsing and ordering fix versions is not trivial, as it depends on the individual release planning of the community. For this, we would need to interview stakeholders to specify and understand individual rules for each issue repository. Similarly, for other composition link types such as Incorporate and Split, the individual parts need to be fixed before the parent-issue report can be resolved. Workflow link types, such as Bonfire Testing also need an in-depth investigation to correctly understand the context and workflow of each repository.

We found repository and issue properties that correlate with the prediction performance. Particularly, those that describe the heterogeneity and quality of the dataset. Coverage (the share of issues that are part of a link) correlates positively with the performance of the baseline model and could be an indicator of higher data quality making it easier for any model to learn. A higher coverage indicates that stakeholders try to link as many issues as possible or place a higher value on linking as a “best practice” to structure and manage the project knowledge [37]. Furthermore, individual link type performance in repositories differs. We think that two important factors which require more extensive research are the link quality and the heterogeneity of a repository.

The link quality attributes (such as the share of multi-links, empty descriptions, or duplicated issues without links) correlate negatively with some of the models, as shown in Table 10.

The heterogeneity of issue trackers is an interesting factor for typed link prediction. The number of issues or links per maintainer positively correlates with the performance of the baseline: The more issues or links maintained by one person, the better the model. In other words, the fewer people involved, the less heterogeneity and more standardization can likely be observed in the repository. Another indicator of heterogeneity is the number of projects, maintainers, and their components in a repository which we examined in Sect. 3. We did not find a significant correlation between the performance of the model and the number of cross-project links, the number of projects, the number of maintainers, or the number of project- or maintainer-components. However, it might make sense to evaluate the prediction model per project or cluster of projects per repository instead as links and their types inside a single project are likely to be more homogeneous than in the whole repository. Additionally, the usage of semantically overlapping link types (such as Depend and Block or Epic and Subtask) in a repository might point towards a certain heterogeneity. This could be an indicator for “similarly managed” projects. For instance, one group might mainly use Block for their issues, while another group uses Depend. The rationale of such practices should be investigated further.

Finally, this paper investigated horizontal linking (issues to issues) which interplays with vertical linking (issues to other artifacts such as commits). Schermann et al. [38] examined and created a heuristic model to detect missing links between issues and commits. Commit conventions and missing commit links might be another interesting quality factor for the applicability of (typed) link prediction in practice. As high coverage seems to be a predictor for good typed link prediction, issues without any links to other issues or commits should be investigated.

7.4 Threats to validity

As we did not restrict the data used for training and testing, it is largely representative of the way stakeholders use issue links and their types in practice—at least in open source setting. The only restriction we placed was the 1%-share lower bound for link types to be included in the training and testing data. As it is very conservative it should not introduce a bias that overestimates the quality of the model.

We studied issue linking and evaluated link prediction strategies on 15 different popular JIRA repositories. Thus our results and models are fairly generalizable for repositories that use JIRA. We did not study Bugzilla and GitHub data. Thus, the results and prediction models presented in this work might differ for other issue trackers. As Bugzilla default types (Relate, Duplicate, and Block) are a subset of JIRA default types, we think that our models would achieve similar performances for Bugzilla. GitHub does not offer a specific link functionality. Thus, the model might be harder to apply there because a different labeling is needed.

We discussed in detail potential quality issues and differences between the repositories that influence the prediction performance. The labels, i.e., the link types are made by humans. They can contain incorrect links [39] and missing links. We cannot fully analyze these defects without help from the stakeholders who create and use these issue data. However, we evaluated simple indicators such as multi-links and missing Duplicate links. We saw that missing Duplicate links often influenced the prediction performance negatively. Additionally, we removed 1.1% of the data as they contained multiple links with different types between two issues which are an indicator of incorrectly set links. After reviewing them manually, we noticed that such links were often conflicting. Moreover, as the percentage of multi-links was small, we chose to remove them since they do not warrant to use of a multi-class multi-label classifier.

We added randomly created Non-Links to the dataset. We chose to add as many as the mean number of other link types to avoid a majority class which might bias the results. For the Only Linked prediction strategy, we added more Non-Links to create a balanced training set which might overestimate performance compared to the other approaches. However, the other prediction strategies did not struggle in differentiating Non-Links from typed links. The sampling of the Non-Links from closed issues can have an effect on the performance. Additionally, the project or other non-studied aspects might have an effect.

Finally, another possible limitation is the evolution of the data over time in issue trackers. If a repository has been in use for a long time, the implicit definition of link types can change too. Issues themselves change over time and the links might not be updated accordingly and certain link types might fall out of favor over time. We also did not split our training and test set according to the creation time of the issues or links which can potentially decrease the performance [40].

8 Related work

With the rise of agile, requirements knowledge [14] is often collected and tracked in issue trackers. This led to a generation of large amounts of data in issue trackers, which is often too much to understand and handle manually. In a case study, Fucci et al. [22] interviewed ITS users and found that information overload is one of their biggest challenges. Interviewees of the study expressed the need for a requirements dependency identification functionality to reduce the overhead of discovering and documenting dependencies manually. Our work paves the way to tackle this need by studying in detail typed link prediction in issue trackers. Franch et al. [41] also aim at supporting agile practices in requirements elicitation and management with situational method engineering.

Requirements Engineering research has largely studied the dependency between issues/requirements and software artifacts: a topic known as traceability [42, 43]. Our work focuses on horizontal traceability between issue reports. Similar to our work, Lin et al. [44] found that the traceability links between issue description to source code can be predicted with BERT which outperforms the traditional information retrieval methods achieving F1-scores of 0.612 and 0.729. Typed link prediction has similar application problems as traceability such as poor quality in issue trackers found by Merten et al. [45]. Additionally, Seiler et al. [46] conducted an interview study about the problems of feature requests in issue trackers and found that unclear feature descriptions and insufficient traceability are among the major issues in practice. Our findings about the heterogeneity of linking practices among the repositories go into the same direction.

Concerning issue link prediction, Duplicate is the most widely researched type, as duplication detection is a tedious task [21] when curating issue trackers. Deshmukh et al. [15] proposed a single-channel siamese network approach with triplet loss which achieves an accuracy close to 90% and a recall rate close to 80%. He et al. [16] proposed a dual-channel approach and achieved an accuracy of up to 97%. Rocha et al. [47] created a model using all “Duplicate” issues as different descriptions of the same issue and report a Recall@25 of 85% for retrieval and an 84% AUROC for classification. All three works [15, 16, 47] use the dataset provided by Lazar et al. [18], containing data mined from four open-source Bugzilla systems: Eclipse, Mozilla, NetBeans, and OpenOffice. Our work focuses on heterogeneous link types and linking practices as observed in one of the most popular issue trackers, JIRA.

There also exist studies researching other link types between issues and link usage. Thompson et al. [20] studied three open-source systems and analyzed how software developers use work breakdown relationships between issues in JIRA. They observed little consistency in the supported relationships. Li et al. [13] examined the issue linking practices in GitHub and extracted emerging linking patterns. They categorized link types into six categories: “Dependent”, “Duplicate”, “Relevant”, “Referenced”, “Fixed”, “Enhanced”. All rare link types were assigned the category “Other”. They discovered patterns for automatic classification; “Referenced” links usually refer to historic comments with important knowledge and that “Duplicate” links are usually marked within the day. Tomova et al. [48] studied seven open-source systems and reported that the rationale behind the choice of a specific link type is not always obvious. The authors also found that Clone  links are indicative of textual similarity. Issues linked through a Relate  link present varying degrees of textual similarity and thus require further contextual information to be accurately identified. We observed similar trends and hypothesized that Relate links might be a jack-of-all-trades type. Furthermore, while we found that some link types have distinct textual similarities, they are not unequivocally identifiable only based on textual similarity.

Unlike our approach, most previous works view link types in isolation. “Requires” and “Refines” links were examined by Deshpande et al. [49], who extracted dependencies on two industrial datasets achieving an F1-score of at least 75% in both training sets. Block links were examined by Cheng et al. [50] based on the dataset of Lazar et al. [18]. They predicted the Block link type with an F1-Score of 81% and AUC of 97.5%. Desphande et al. [51] also compared a BERT model with a Random Forest model for requirement dependency classification. They went beyond the F1-score and also examined the return on investment (ROI) of both models. In our previous experiment, traditional models only achieved an average macro F1-score of 0.27 [24] on the dataset compared with the 0.63 of the BERT model. However, for the prediction strategy Only Linked, it might be worth to compare ROI with traditional machine learning models.

We recently evaluated state-of-the-art duplicate detection models on the same dataset used in this work. We found that these models struggle with distinguishing Duplicate from other link types [27]. In this work, we focus on predicting multiple user-defined link types. We present and evaluate a BERT prediction model for typed links as used in practice. Nicholson et al. [52, 53] also researched typed link prediction between issues in Apache. They analyzed the link types and tried to find patterns to predict missing links [52]. They evaluated several traditional machine learning approaches and achieved a weighted F1-score of about 0.563 up to 0.692 across the three projects HIVE, FLEX, and AMBARI [53].

We found that data quality affects typed link prediction as well. Links are harder to classify if issue descriptions are vague or contain other defects. Dalpiaz et al. [54] created an approach to improve the quality of user stories, a type of issue, by removing linguistic defects. Recently, Qamar et al. [34] investigated bug-tracking smells and Tamburri et al. [35] community smells. These smells potentially reflect underlying quality issues. Link quality is directly affected by requirements quality.

9 Conclusion

We investigated the characteristics of issue linking in open source issue repositories and found nine popular link types. Some repositories tend to prefer specific link types, like Epic and Subtask, while others prefer to use Relate or Duplicate links. Furthermore, we explored the projects, maintainers, and links structures. We found that some repositories are more fragmented than others. Lastly, we investigated inconsistencies in linking and found that Duplicate links are often missing and that many links are discussed in comments without being set explicitly in the report. Our findings highlight the diversity and heterogeneity of issue tracking and issue linking in practice—something that should be considered when designing scientific studies and building tool support.

Training a BERT model on the titles and descriptions of issue pairs, we achieved good performance predicting typed issue links on most studied repositories; and consistently excellent performance for predicting Epic and Subtask links. We evaluated four other task settings, removing Relate as they are often confused by the baseline, only predicting link types, grouping link types and predicting categories, and predicting the absence or existence of a link. Predicting only if issues are linked achieves F1-scores from 0.89 up to 0.99.

Our detailed analysis revealed that by better understanding the data and improving the issue quality (for the training and prediction) and the link quality (for the training), the prediction performance will likely improve further. Future work should carefully consider data quality for building issue prediction models and focus on further understanding and exploring the underlying data as well as heterogeneous issue management practices. For this, qualitative and experimental research is needed to understand how and why stakeholders use the links and link types.