Skip to main content

Automated issue assignment: results and insights from an industrial case

Abstract

We automate the process of assigning issue reports to development teams by using data mining approaches and share our experience gained by deploying the resulting system, called IssueTAG, at Softtech. Being a subsidiary of the largest private bank in Turkey, Softtech on average receives 350 issue reports daily from the field, which need to be handled with utmost importance and urgency. IssueTAG has been making all the issue assignments at Softtech since its deployment on Jan 12, 2018. Deploying IssueTAG presented us not only with an unprecedented opportunity to observe the practical effects of automated issue assignment, but also with an opportunity to carry out user studies, both of which (to the best of our knowledge) have not been done before in this context. We first empirically determine the data mining approach to be used in IssueTAG. We then deploy IssueTAG and make a number of valuable observations. First, it is not just about deploying a system for automated issue assignment, but also about designing/changing the assignment process around the system. Second, the accuracy of the assignments does not have to be higher than that of manual assignments in order for the system to be useful. Third, deploying such a system requires the development of additional functionalities, such as creating human-readable explanations for the assignments and detecting deteriorations in assignment accuracies, for both of which we have developed and empirically evaluated different approaches. Last but not least, stakeholders do not necessarily resist change and gradual transition helps build confidence.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. https://softtech.com.tr

  2. https://www.isbank.com.tr

  3. https://www.ibm.com/products/maximo

  4. https://www.atlassian.com/software/jira

References

  • Ahsan SN, Ferzund J, Wotawa F (2009) Automatic software bug triage system (bts) based on latent semantic indexing and support vector machine. In: 2009 fourth international conference on software engineering advances. IEEE, pp 216–221

  • Alenezi M, Magel K, Banitaan S (2013) Efficient bug triaging using text mining. JSW 8(9):2185–2190

    Article  Google Scholar 

  • Antoniol G, Ayari K, Di Penta M, Khomh F, Guéhéneuc YG (2008) Is it a bug or an enhancement?: A text-based approach to classify change requests. In: CASCON, vol 8, pp 304–318

  • Anvik J (2007) Assisting bug report triage through recommendation. PhD thesis, University of British Columbia

  • Anvik J, Murphy GC (2011) Reducing the effort of bug report triage: recommenders for development-oriented decisions. ACM Transactions on Software Engineering and Methodology (TOSEM) 20(3):10

    Article  Google Scholar 

  • Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug?. In: Proceedings of the 28th international conference on software engineering. ACM, pp 361–370

  • Baysal O, Godfrey MW, Cohen R (2009) A bug you like: a framework for automated assignment of bugs. In: 2009 IEEE 17th international conference on program comprehension. IEEE, pp 297–298

  • Bettenburg N, Just S, Schröter A, Weiss C, Premraj R, Zimmermann T (2008a) What makes a good bug report?. In: Proceedings of the 16th ACM SIGSOFT international symposium on foundations of software engineering. ACM, pp 308–318

  • Bettenburg N, Premraj R, Zimmermann T, Kim S (2008b) Duplicate bug reports considered harmful. . . really?. In: 2008 IEEE international conference on software maintenance. IEEE, pp 337–345

  • Bhattacharya P, Neamtiu I, Shelton CR (2012) Automated, highly-accurate, bug assignment using machine learning and tossing graphs. J Syst Softw 85(10):2275–2292

    Article  Google Scholar 

  • Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin

    MATH  Google Scholar 

  • Breiman L (2001) Random forests. Machine Learning 45(1):5–32

    Article  Google Scholar 

  • Breiman L (2017) Classification and regression trees. Routledge, Evanston

    Book  Google Scholar 

  • Canfora G, Cerulo L (2006) Supporting change request assignment in open source development. In: Proceedings of the 2006 ACM symposium on Applied computing. ACM, pp 1767–1772

  • Chen L, Wang X, Liu C (2011) An approach to improving bug assignment with bug tossing graphs and bug similarities. JSW 6(3):421–427

    Article  Google Scholar 

  • Dedík V, Rossi B (2016) Automated bug triaging in an industrial context. In: 2016 42th euromicro conference on software engineering and advanced applications (SEAA). IEEE

  • Giger E, Pinzger M, Gall H (2010) Predicting the fix time of bugs. In: Proceedings of the 2nd international workshop on recommendation systems for software engineering. ACM, pp 52–56

  • Helming J, Arndt H, Hodaie Z, Koegel M, Narayan N (2010) Automatic assignment of work items. In: International conference on evaluation of novel approaches to software engineering. Springer, pp 236–250

  • Hocking TD, Schleiermacher G, Janoueix-Lerosey I, Delattre O, Bach F, Vert JP (2013) Learning smoothing models of copy number profiles using breakpoint annotations. BMC Bioinformatics 14:164

    Article  Google Scholar 

  • Jalbert N, Weimer W (2008) Automated duplicate detection for bug tracking systems. In: 2008 IEEE international conference on dependable systems and networks with FTCS and DCC (DSN). IEEE, pp 52–61

  • Jeong G, Kim S, Zimmermann T (2009) Improving bug triage with bug tossing graphs. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering. ACM, pp 111–120

  • Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features Proceedings of the 10th European conference on machine learning, Springer, ECML’98, pp 137–142

  • Jonsson L, Borg M, Broman D, Sandahl K, Eldh S, Runeson P (2016) Automated bug assignment: ensemble-based machine learning in large scale industrial contexts. Empir Softw Eng 21(4):1533–1578

    Article  Google Scholar 

  • Kagdi H, Gethers M, Poshyvanyk D, Hammad M (2012) Assigning change requests to software developers. Journal of Software: Evolution and Process 24(1):3–33

    Google Scholar 

  • Killick R, Fearnhead P, Eckley IA (2012) Optimal detection of changepoints with a linear computational cost. J Am Stat Assoc 107(500):1590–1598

    MathSciNet  Article  Google Scholar 

  • Lamkanfi A, Demeyer S, Giger E, Goethals B (2010) Predicting the severity of a reported bug. In: 2010 7th IEEE working conference on mining software repositories (MSR 2010). IEEE, pp 1–10

  • Lavielle M, Ere G (2007) Adaptive detection of multiple change-points in asset price volatility. Long Memory in Economics

  • Lin Z, Shu F, Yang Y, Hu C, Wang Q (2009) An empirical study on bug assignment automation using chinese bug data. In: 2009 3rd international symposium on empirical software engineering and measurement. IEEE, pp 451–455

  • Linares-Vásquez M, Hossen K, Dang H, Kagdi H, Gethers M, Poshyvanyk D (2012) Triaging incoming change requests: bug or commit history, or code authorship?. In: 2012 28th IEEE international conference on software maintenance (ICSM). IEEE, pp 451–460

  • Manning C, Raghavan P, Schütze H (2010) Introduction to information retrieval. Nat Lang Eng 16(1):100–103

    Article  Google Scholar 

  • Matter D, Kuhn A, Nierstrasz O (2009) Assigning bug reports using a vocabulary-based expertise model of developers. In: 2009 6th IEEE international working conference on mining software repositories. IEEE, pp 131–140

  • Menzies T, Marcus A (2008) Automated severity assessment of software defect reports. In: 2008 IEEE international conference on software maintenance. IEEE, pp 346–355

  • Murphy G, Cubranic D (2004) Automatic bug triage using text categorization In. In: Proceedings of the sixteenth international conference on software engineering & knowledge engineering, Citeseer

  • Nagwani NK, Verma S (2012) Predicting expert developers for newly reported bugs using frequent terms similarities of bug attributes. In: 2011 ninth international conference on ICT and knowledge engineering. IEEE, pp 113–117

  • Pandey N, Sanyal DK, Hudait A, Sen A (2017) Automated classification of software issue reports using machine learning techniques: an empirical study. Innov Syst Softw Eng 13(4):279–297

    Article  Google Scholar 

  • Park JW, Lee MW, Kim J, Sw Hwang, Kim S (2011) Costriage: a cost-aware triage algorithm for bug reporting systems. In: Twenty-Fifth AAAI conference on artificial intelligence

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. (2011) Scikit-learn: machine learning in python. Journal of Machine Learning Research 12 (Oct):2825–2830

    MathSciNet  MATH  Google Scholar 

  • Podgurski A, Leon D, Francis P, Masri W, Minch M, Sun J, Wang B (2003) Automated support for classifying software failure reports. In: 25th international conference on software engineering. 2003, Proceedings. https://doi.org/10.1109/ICSE.2003.1201224, pp 465–475

  • Pressman RS (2005) Software engineering: a practitioner’s approach. Palgrave Macmillan

  • Raschka S (2018) Mlxtend: providing machine learning and data science utilities and extensions to python’s scientific computing stack. J Open Source Software 3(24):638

    Article  Google Scholar 

  • Ribeiro MT, Singh S, Guestrin C (2016) Why should i trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1135–1144

  • Shokripour R, Kasirun ZM, Zamani S, Anvik J (2012) Automatic bug assignment using information extraction methods. In: 2012 international conference on advanced computer science applications and technologies (ACSAT). IEEE, pp 144–149

  • Tamrawi A, Nguyen TT, Al-Kofahi JM, Nguyen TN (2011) Fuzzy set and cache-based approach for bug triaging. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on foundations of software engineering. ACM, pp 365–375

  • Ting KM, Witten IH (1999) Issues in stacked generalization. Journal of Artificial Intelligence Research 10:271–289

    Article  Google Scholar 

  • Truong C, Oudre L, Vayatis N (2018a) Ruptures: change point detection in python. arXiv:180100826

  • Truong C, Oudre L, Vayatis N (2018b) Selective review of offline change point detection methods. arXiv:180100718

  • Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) An approach to detecting duplicate bug reports using natural language and execution information. In: Proceedings of the 30th international conference on software engineering. ACM, pp 461–470

  • Weiss C, Premraj R, Zimmermann T, Zeller A (2007) How long will it take to fix this bug?. In: Fourth international workshop on mining software repositories (MSR’07: ICSE Workshops 2007). IEEE, pp 1–1

  • Wolpert DH (1992) Stacked generalization. Neural Networks 5 (2):241–259

    Article  Google Scholar 

  • Wu W, Zhang W, Yang Y, Wang Q (2011) Drex: developer recommendation with k-nearest-neighbor search and expertise ranking. In: 2011 18th Asia-Pacific software engineering conference. IEEE, pp 389–396

  • Xia X, Lo D, Wang X, Zhou B (2013) Accurate developer recommendation for bug resolution. In: 2013 20th working conference on reverse engineering (WCRE). IEEE, pp 72–81

  • Xie X, Zhang W, Yang Y, Wang Q (2012) Dretom: developer recommendation based on topic models for bug resolution. In: Proceedings of the 8th international conference on predictive models in software engineering. ACM, pp 19–28

  • Zhang H, Gong L, Versteeg S (2013) Predicting bug-fixing time: an empirical study of commercial software projects. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, pp 1042–1051

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ethem Utku Aktas.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by: Per Runeson

Appendices

Appendix A: Evaluating Existing Issue Assignment Approaches

In this section, we discuss the details of the studies we have carried out to determine the issue assignment approach to be used by IssueTAG.

A.1 Approach

We have evaluated a number of classification-based approaches, each of which had been shown to be effective for automated issue assignment (Murphy and Cubranic 2004; Anvik et al. 2006; Bhattacharya et al. 2012; Anvik and Murphy 2011; Jonsson et al. 2016), by using the issue database maintained by Softtech since December 2016.

A.1.1 Representing Issue Reports

Given an issue report, we first combine the “description” and “summary” parts of the report and tokenize the combined text into terms. We then remove the non-letter characters, such as punctuation marks, as well as the stop words, such as “the” and “a,” which are extremely common words of little value in classifying issue reports (Manning et al. 2010). We opt not to apply stemming in this work as an earlier work suggests that stemming has a little effect (if any at all) in issue assignments (Murphy and Cubranic 2004), which is also consistent with the results of our initial studies where stemming slightly reduced the assignment accuracies.

We then represent an issue report as an n-dimensional vector. Each element in this vector corresponds to a term and the value of the element depicts the weight (i.e., “importance”) of the term for the report. The weights are computed by using the well-known tf-idf method (Manning et al. 2010).

The tf-idf method combines two scores: term frequency (tf ) and inverse document frequency (idf ). For a given term t and an issue report r, the term frequency tf t,r is the number of times t appears in r. The more t appears in r, the larger tf t,r is. The inverse document frequency of t (idft), on the other hand, is:

$$ idf_{t}=\log(\frac{N}{df_{t}}), $$
(1)

where N is the total number of issue reports and dft is the number of issue reports, in which t appears. The fewer the issue reports t appears in, the larger idft is.

Given tf t,r and idft, the tf-idf score of the term t for the issue report r is computed as follows:

$$ \text{tf-idf}_{t, r}= tf_{t, r} * idf_{t}. $$
(2)

Consequently, the more a term t appears in an issue report r and the less it appears in other issue reports, the more important t becomes for r, i.e., the larger tf-idft,r is.

A.1.2 Issue Assignments

Once an issue report is represented as an ordered vector of tf-idf scores, the problem of assignment is cast to a classification problem. In particular, the development team, to which the issue report should be assigned, becomes the class to be predicted and the tf-idf scores of the report become the attributes, on which the classification will be based on.

We train two types of classifiers: level-0 and level-1 classifiers. A level-0 classifier corresponds to a an individual classifier. A level-1 classifier, on the other hand, is obtained by combining multiple level-0 classifiers using stacked generalization – an ensemble technique to combine multiple individual classifiers (Wolpert 1992). All the classifiers we experiment with in this study have been shown to be effective for automated issue assignment (Murphy and Cubranic 2004; Anvik et al. 2006; Bhattacharya et al. 2012; Anvik and Murphy 2011; Jonsson et al. 2016).

For the level-0 classifiers, we use multinomial naive bayesian (Manning et al. 2010), decision tree (Breiman 2017), k-nearest neighbor (Manning et al. 2010), logistic regression (Bishop 2006), random forest (Breiman 2001), and linear support vector classifiers (SVCs) (Joachims 1998).

For the level-1 classifiers, we first train and evaluate our level-0 classifiers by using the same training and test sets for each classifier. We then use the prediction results obtained from these level-0 classifiers to train a level-1 classifier, which combines the probabilistic predictions of the level-0 classifiers using linear logistic regression (Wolpert 1992).

Inspired from Jonsson et al. (2016), we, in particular, train two types of level-1 classifiers: BEST and SELECTED. The BEST ensemble is comprised of k (in our case, k = {3, 5}) level-0 classifiers with the highest assignment accuracies. The SELECTED ensemble, on the other hand, is comprised of a diversified set of k (in our case, k = {3, 5}) level-0 classifiers. More specifically, the SELECTED ensemble includes the level-0 classifiers, which are selected regardless of their classification accuracies, so that errors of individual classifiers can be averaged out by better spanning the learning space (Wolpert 1992). Note that the BEST and SELECTED ensembles are not necessarily the same because the best performing level-0 classifiers may not be the most diversified set of classifiers. More information on how these ensembles are created can be found in Appendix A.

Furthermore, for the baseline classifier, which we use to estimate the baseline classification accuracy for our classifiers, we assign all issue reports to the team that have been assigned with the highest number of issue reports. That is, our baseline classifier always returns the class with the highest number of instances as the prediction.

A.2 Evaluation

We have conducted a series of experiments to evaluate the assignment accuracies of the level-0 and level-1 classifiers.

A.2.1 Experimental Setup

In these experiments, we used the issue reports submitted to Softtech between June 1, 2017 and November 30, 2017 as the training set and the issue reports submitted in the month of December 2017 as the test set. We picked this time frame because it was a representative period of time in terms of the number of issue reports submitted, the number of teams present, and the distribution of the reported issues to these teams. Furthermore, the beginning of this time frame coincide with the time when the significant changes in the organization of the development teams was internalized by the stakeholders (i.e., second vertical line in Fig. 3). As discussed in Section 4.2.3, the reorganization was caused by migrating an integral part of the core banking system from mainframes to state-of-the-art hardware and software platforms. More specifically, the training data started from June 1, 2017 and the aforementioned event occurred on June 16, 2017.

For the aforementioned time frame, we had a total number of 51041 issue reports submitted to 65 different teams. Among all the issue reports of interest in this section as well as in the remainder of the paper, we only used the ones that were marked as “closed,” indicating that the reported issues had been validated and resolved. Furthermore, as the correct assignment for an issue report, we used the development team that had closed the report. The remainder of the issue reports were ignored as it was not yet certain whether these reports were valid or whether the development teams, to which they were currently assigned, were correct. After this filtering, a total of 47123 issue reports submitted to 64 different development teams remained for analysis in this study.

To create the level-1 classifiers, we combined 3 or 5 individual classifiers, i.e., k = 3 or k = 5. We used the latter setting as it was also the setting used in a recent work (Jonsson et al. 2016). We used the former setting as it was the best setting we could empirically determine for ensemble learning, i.e., the one that produced the best assignment accuracies. In the remainder of the paper, these models are referred to as BEST-3, SELECTED-3, BEST-5, and SELECTED-5.

The BEST-3 and BEST-5 models were obtained by combining Linear SVC-Calibrated, Logistic Regression, and K-Neighbours; and Linear SVC-Calibrated, Logistic Regression, K-Neighbours, Random Forest, and Decision Tree classifiers, respectively, as these were the classifiers providing the best assignment accuracies. The SELECTED-3 and SELECTED-5 models, on the other hand, were created with the goal of increasing the diversity of the classification algorithms ensembled. In particular, the SELECTED-3 model was obtained by combining Linear SVC-Calibrated, K-Neighbours, and Multinomial Naive Bayesian classifiers. And, the SELECTED-5 model was obtained by combining Linear SVC-Calibrated, Logistic Regression, K-Neighbours, Random Forest, and Multinomial Naive Bayesian classifiers. Note further that to include SVCs in level-1 classifiers, we used calibrated linear SVCs instead of linear SVCs as we needed to have class probabilities to ensemble individual classifiers (Ting and Witten 1999), which are not supported by the latter.

The classifiers were trained and evaluated by using the scikit-learn (for level-0 classifiers) (Pedregosa et al. 2011) and mlxtend (for level-1 classifiers) (Raschka 2018) packages. All of the classifiers (unless otherwise stated) were configured with the default settings and the experiments were carried out on a dual-core Intel(R) Xeon(R) E5-2695 v4 2.10 GHz machine with 32 GB of RAM running Windows Server 2012 R2 as the operating system.

A.2.2 Evaluation Framework

To evaluate the quality of the assignments obtained from different classifiers, we used well-known metrics, namely accuracy and weighted precision, recall, and F-measure (Manning et al. 2010). Accuracy, which is also referred to as assignment accuracy in the remainder of the paper, is computed as the ratio of correct issue assignments. Precision for a particular development team (i.e., class) is the ratio of the issue reports that are correctly assigned to the team to the total number of issue reports assigned to the team. Recall for a team is the ratio of the issue reports that are correctly assigned to the team to the total number of issue reports that should have been assigned to the team. F-measure is then computed as the harmonic mean of precision and recall, giving equal importance to both metrics. Note that each of these metrics takes on a value between 0 and 1 inclusive. The larger the value, the better the assignments are. Furthermore, we report the results obtained by both carrying out 10-fold cross validation on the training data and carrying out the analysis on the test set.

To evaluate the cost of creating the classification models, we measured the time it took to train the models. The smaller the training time, the better the approach is.

A.2.3 Data and Analysis

Table 10 summarizes the results we obtained. We first observed that all the classifiers we trained performed better than the baseline classifier. While the baseline classifier provided an accuracy of 0.10 on the training set and 0.12 on the test set, those of the worst-performing classifier were 0.47 and 0.52, respectively.

We then observed that the SELECTED ensembles generally performed similar or better than the BEST ensembles, supporting the conjecture that using diversified set of classifiers in an ensemble can help improve the accuracies by better spanning the learning space.ß For example, while the accuracy of the BEST-5 ensemble was 0.67 on the training set and 0.64 on the test set, those of the SELECTED-5 ensemble were 0.80 and 0.78, respectively. Furthermore, the ensembles created by using 3 level-0 classifiers, rather than 5 level-0 classifiers, performed slightly better on our data set. For example, while the accuracy of the SELECTED-5 ensemble was 0.80 on the training set and 0.78 on the test set, those of the SELECTED-3 ensemble were 0.81 and 0.79, respectively.

Last but not least, among all the classifiers, the one that provided the best assignment accuracy (as well as the best F-measure) and did so at a fraction of the cost, was the linear SVC classifier (Table 10). While the linear SVC classifier provided an accuracy of 0.82 on the training data set and 0.80 on the test set with a training time of about three minutes, the runner-up classifiers, namely the SELECTED-3 and BEST-3 ensembles, provided the accuracies of 0.81 and 0.79, respectively, with a training time of about half an hour or more.

Based on both the assignment accuracies and the costs of training obtained from various classifiers using our data set, we have decided to employ linear SVC in IssueTAG.

Appendix B: Time Locality and Amount of Training Data

In this section, we discuss the details of the studies we carried out to determine the time locality of the issue reports required for preparing the training data every time the underlying classification model needs to be trained.

B.1 Approach

To carry out the study, we use the sliding window and cumulative window approaches introduced in Jonsson et al. (2016). More specifically, we conjecture that using issue reports from “recent past” to train the prediction models, as opposed to using the ones from “distant past”, can provide better assignment accuracies since organizations, products, teams, and issues may change overtime.

To evaluate this hypothesis, we take a long period of time T (in our case, 13 months) and divide it into a consecutive list of calendar months \({T=[m_{1}, m_{2}, \dots ]}\). For every month miT, we train and evaluate a linear SVC model. To this end, we use all the issue reports submitted in the month of mi as the test set and all the issue reports submitted in the month of mj as the training set, where ij = Δ, i.e., the sliding window approach in Jonsson et al. (2016). Note that given mi and Δ, mj is the month, which is Δ months away from mi going back in time. For every month miT, we repeat this process for each possible value of Δ (in our case, \({\varDelta } \in \{1, \dots , 12\}\)). By fixing the test set and varying the training sets, such that they come from different historical periods, we aim to measure the effect of time locality of the training data on the assignment accuracies.

Figure 10 illustrates the sliding window approach using the period of time from Jan 1, 2017 to Jan 31, 2018. For example, for the month of Jan 2018, we train a total of 12 classification models, each of which was trained by using all the issue reports submitted in a distinct month of 2017 (marked as Train1-1, Train1-2, \(\dots \), Train1-12) and separately test these models using all the issue reports submitted in the month of Jan, 2018 as the test set (marked as Test1). We then repeat this process for every month in the time period of interest, except for Jan 2017 as it does not have any preceding months. That is, for Dec 2017 (marked as Test2), we train and evaluate 11 models (marked as Train2-1, Train2-2, \(\dots \)), for Nov 2017, we train and evaluate 10 models, etc.

Fig. 10
figure 10

Overview of the sliding window approach to study the effect of the time locality of training data on assignment accuracies

To evaluate the effect of the amount of training data on the assignment accuracies, we use a related approach, called the cumulative window approach (Jonsson et al. 2016). This approach, as is the case with the sliding window approach, divides a period of interest T in to a consecutive list of months \({T=[m_{1}, m_{2}, \dots ]}\). Then, for every possible pair of miT and Δ, we train and evaluate a classification model, where all the issue reports submitted in the month of mi are used as the test set and all the issue reports submitted in the preceding Δ months, i.e., {mjT∣1 ≤ ijΔ}, are used as the training set.

Figure 11 illustrates the approach. For example, for the month of Jan 2018, we train a total of 12 classification models. The first model is created by using the previous month’s data (marked as Train1-1), the second model is created by using the previous two months’ data (marked as Train1-2), and the last model is created by using the previous year’s data (marked as Train1-12). The same process is repeated for every possible month in the period of interest.

Fig. 11
figure 11

Overview of the cumulative window approach to study the effect of the amount of training data on assignment accuracies

B.2 Evaluation

We conducted a series of experiments to evaluate the effect of the amount and time locality of training data on assignment accuracies.

B.2.1 Experimental Setup

In these experiments, we used all the issue reports that were submitted during the period from Jan 1, 2017 to Jan 31, 2018. The summary statistics for this data set can be found in Table 11. All told, we have trained and evaluated a total of 144 linear SVC models for this study. All the experiments were carried out on the same platform with the previous study (Appendix A).

Table 11 Number of issue reports submitted

B.2.2 Evaluation Framework

We used the assignment accuracies (Appendix A) for evaluations.

B.2.3 Data and Analysis

Figures 12 and 13 represent the results we obtained from the sliding window and cumulative window approach, respectively. In these figures, the vertical and horizontal axes depict the assignment accuracies obtained and the Δ values used in the experiments, respectively. The accuracies associated with a Δ value were obtained from the classification models, each of which was created for a distinct month in the period of interest by using the same Δ value. Furthermore, the polynomials in the figures are the second degree polynomials fitted to the data.

Fig. 12
figure 12

Assignment accuracies obtained from the sliding window approach

Fig. 13
figure 13

Assignment accuracies obtained from the cumulative window approach

Looking at Fig. 12, we first observed that using issue reports from recent past to train classification models, rather than the ones from distant past, provided better assignment accuracies; the accuracies tended to decrease as Δ increased. For example, while the average assignment accuracy obtained when Δ = 1, i.e., when the issue reports submitted in the immediate preceding months were used as the training sets, was 0.73, that obtained when Δ = 12, i.e., when the issue reports submitted in Jan 2017 were used as the training set for the issue reports submitted in Jan 2018, was 0.52.

Looking at Fig. 13, we then observed that as we went back in time to collect the training data starting from the immediate preceding months (i.e., as Δ increased in the cumulative window approach), the assignment accuracies tended to increase first and then stabilized around a year of training data. For example, while the average accuracy obtained when Δ = 1, i.e., when the issue reports submitted only in the immediate preceding months were used as the training sets, was 0.73, that obtained when Δ = 12, i.e., when all the issue reports submitted in the preceding 12 months were used as the training data set, was 0.82.

Note that in this study, we were solely concerned with the assignment accuracy when choosing the time locality of the training data. This was mainly due to the fact that training the linear SVC models in our case was not costly at all; the differences between the training times for various amounts of training data were practically negligible. More specially, the minimum, average, and maximum training times we observed in all the experiments carried out in this section, were 0.3, 3.4, and 10.8 minutes, respectively, where the minimum, average, and maximum numbers of issue reports used in these experiments were 11366, 37149, and 86348, respectively. However, if training times are not negligible, then the cost of training may greatly vary depending on the amount of the training data used (e.g., the window size chosen). In such cases, assignment accuracies and training times should be balanced according to the requirements of the project when choosing the time locality of the training data.

Based on the results of these studies, to train a prediction model at a given point in time, we decided to use all the issue reports that have been submitted in the last 12 months as the training set. Clearly, among all the issue reports of interest, we filter out the ones that have not yet been closed (Appendix A).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Aktas, E.U., Yilmaz, C. Automated issue assignment: results and insights from an industrial case. Empir Software Eng 25, 3544–3589 (2020). https://doi.org/10.1007/s10664-020-09846-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-020-09846-3

Keywords

  • Bug triaging
  • Issue report assignment
  • Text classification
  • Accountable machine learning
  • Change point detection