Abstract
Severity levels, e.g., critical and minor, of bugs are often used to prioritize development efforts. Prior research efforts have proposed approaches to automatically assign the severity label to a bug report. All prior efforts verify the accuracy of their approaches using human-assigned bug reports data that is stored in software repositories. However, all prior efforts assume that such human-assigned data is reliable. Hence a perfect automated approach should be able to assign the same severity label as in the repository – achieving a 100% accuracy. Looking at duplicate bug reports (i.e., reports referring to the same problem) from three open-source software systems (OpenOffice, Mozilla, and Eclipse), we find that around 51 % of the duplicate bug reports have inconsistent human-assigned severity labels even though they refer to the same software problem. While our results do indicate that duplicate bug reports have unreliable severity labels, we believe that they send warning signals about the reliability of the full bug severity data (i.e., including non-duplicate reports). Future research efforts should explore if our findings generalize to the full dataset. Moreover, they should factor in the unreliable nature of the bug severity data. Given the unreliable nature of the severity data, classical metrics to assess the accuracy of models/learners should not be used for assessing the accuracy of approaches for automated assigning severity label. Hence, we propose a new approach to assess the performance of such models. Our new assessment approach shows that current automated approaches perform well – 77-86 % agreement with human-assigned severity labels.
Similar content being viewed by others
Notes
JIRA bug reports do not contain severity field but only priority field (which has the same meaning as the severity field for Bugzilla bug reports). Thus when we sent emails to Apache developers who are using JIRA, we replace the term “severity” with “priority”.
Stop words are words (like “a” and “the”) that do not carry much specific information.
The tf-idf is commonly used to reflect the importance of a word to a document in a collection of documents. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the collection of documents.
References
Antoniol G, Ayari K, Di Penta M, Khomh F, Guéhéneuc YG (2008) Is it a bug or an enhancement?: A text-based approach to classify change requests. In: Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds, CASCON ’08,. ACM, New York, pp 23:304–23:318
Cohen WW (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning (ICML’95), pp 115–123
Espinha T, Zaidman A, Gross H G (2014) Web api growing pains: stories from client developers and their code. In: Software evolution week-IEEE conference on software maintenance, reengineering and reverse engineering (CSMR-WCRE), 2014. IEEE, pp 84–93
Hayes A F, Krippendorff K (2007) Answering the call for a standard reliability measure for coding data. Commun Method Meas 1(1):77–89
Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13, pp 392–401
Huang S J, Lin C Y, Chiu N H et al (2006) Fuzzy decision tree approach for embedding risk assessment information into software cost estimation model. J Inf Sci Eng 22(2):297–313
Joachims T (1998) Text categorization with suport vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, ECML ’98. Springer-Verlag, London, pp 137–142
Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: Proceedings of the 33rd international conference on software engineering, ICSE ’11. ACM, New York, pp 481–490
Krippendorff K (2003) Content analysis: an introduction to its methodology, 2nd edn. Sage Publications
Kocaguneli E, Menzies T, Bener AB, Keung JW (2012) Exploiting the essential assumptions of analogy-based effort estimation. IEEE Trans Software Eng 38 (2):425–438
Lamkanfi A, Demeyer S, Giger E, Goethals B (2010) Predicting the severity of a reported bug. In: 7th IEEE working conference on mining software repositories (MSR), 2010. IEEE, pp 1–10
Lamkanfi A, Demeyer S, Soetens QD, Verdonck T (2011) Comparing mining algorithms for predicting the severity of a reported bug. In: 15th European conference on software maintenance and reengineering (CSMR), 2011, pp 249–258
Lim H, Goel A (2006) Support vector machines for data modeling with software engineering applications. In: Pham H (ed) Springer handbook of engineering statistics. Springer, London, pp 1023–1037
Menzies T, Marcus A (2008) Automated severity assessment of software defect reports. In: IEEE international conference on software maintenance, 2008. ICSM 2008. IEEE, pp 346–355
Mockus A (2008) Missing data in software engineering. In: Shull F, Singer J, Sjberg D (eds) Guide to advanced empirical software engineering. doi:10.1007/978-1-84800-044-5_7. Springer, London, pp 185–200
Nguyen A, Nguyen T, Nguyen T, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: ASE. ACM, pp 70–79
Nguyen HA, Nguyen AT, Nguyen T (2013) Filtering noise in mixed-purpose fixing commits to improve defect prediction and localization
Nuseibeh B, Easterbrook S, Russo A (2000) Leveraging inconsistency in software development. Computer 33(4):24–29
Ramler R, Himmelbauer J (2013) Noise in bug report data and the impact on defect prediction results. In: Joint conference of the 23rd international workshop on software measurement and the 2013 8th international conference on software process and product measurement (IWSM-MENSURA), 2013, pp 173–180
Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: ICSE. IEEE, pp 499–510
Shihab E, Ihara A, Kamei Y, Ibrahim WM, Ohira M, Adams B, Hassan AE, Ki Matsumoto (2013) Studying re-opened bugs in open source software. Empir Softw Eng 18(5):1005–1042
Strike K, El Emam K, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908
Sun C, Lo D, Wang X, Jiang J, Khoo S (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: ICSE. ACM, pp 45–54
Thung F, Lo D, Jiang L, Rahman F, Devanbu PT et al (2012) When would this bug get reported? In: 28th IEEE International Conference on Software Maintenance (ICSM), 2012. IEEE, pp 420–429
Tian Y, Lo D, Sun C (2012a) Information retrieval based nearest neighbor classification for fine-grained bug severity prediction. In: 19th working conference on reverse engineering (WCRE), 2012. IEEE, pp 215–224
Tian Y, Sun C, Lo D (2012b) Improved duplicate bug report identification. In: 16th European conference on software maintenance and reengineering, CSMR 2012, Szeged, pp 385–390
Tian Y, Lo D, Xia X, Sun C (2014) Automated prediction of bug report priority using multi-factor analysis. Empir Softw Eng:1–30
Valdivia Garcia H, Shihab E (2014) Characterizing and predicting blocking bugs in open source projects. In: Proceedings of the 11th working conference on mining software repositories. ACM, pp 72–81
Vyas D, Fritz T, Shepherd D (2014) Bug reproduction: A collaborative practice within software maintenance activities. In: COOP 2014-Proceedings of the 11th international conference on the design of cooperative systems. Springer, Nice (France), pp 189–207
Xia X, Lo D, Wen M, Shihab E, Zhou B (2014) An empirical study of bug report field reassignment. In: 2014 software evolution week - IEEE conference on software maintenance, reengineering, and reverse engineering, CSMR-WCRE 2014. Antwerp, Belgium, pp 174–183
Zeller A (2013) Can we trust software repositories? In: Mnch J, Schmid K (eds) Perspectives on the future of software engineering. Springer, Berlin Heidelberg, pp 209–215
Zhang F, Khomh F, Zou Y, Hassan AE (2012) An empirical study on factors impacting bug fixing time. In: 19th Working Conference on Reverse Engineering (WCRE), 2012. IEEE, pp 225–234
Zhang H, Gong L, Versteeg S (2013) Predicting bug-fixing time: an empirical study of commercial software projects. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, pp 1042–1051
Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study of their impacts. Artif Intell Rev 22(3):177–210
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Andreas Zeller
Rights and permissions
About this article
Cite this article
Tian, Y., Ali, N., Lo, D. et al. On the unreliability of bug severity data. Empir Software Eng 21, 2298–2323 (2016). https://doi.org/10.1007/s10664-015-9409-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-015-9409-1