Skip to main content

Improving Quality of Code Review Datasets – Token-Based Feature Extraction Method

  • 480 Accesses

Part of the Lecture Notes in Business Information Processing book series (LNBIP,volume 404)

Abstract

Machine learning is used increasingly frequent in software engineering to automate tasks and improve the speed and quality of software products. One of the areas where machine learning starts to be used is the analysis of software code. The goal of this paper is to evaluate a new method for creating machine learning feature vectors, based on the content of a line of code. We designed a new feature extraction algorithm and evaluated it in an industrial case study. Our results show that using the new feature extraction technique improves the overall performance in terms of MCC (Matthews Correlation Coefficient) by 0.39 – from 0.31 to 0.70, while reducing the precision by 0.05. The implications of this is that we can improve overall prediction accuracy for both true positives and true negatives significantly. This increases the trust in the predictions by the practitioners and contributes to its deeper adoption in practice.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-65854-0_7
  • Chapter length: 13 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   59.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-65854-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   74.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.

Notes

  1. 1.

    The full code of the featurizer can be found at: https://github.com/miroslawstaron/code_featurizer.

References

  1. Mamun, M.A.A., Berger, C., Hansson, J.: Effects of measurements on correlations of software code metrics. Empirical Softw. Eng. 24(4), 2764–2818 (2019). https://doi.org/10.1007/s10664-019-09714-9

    CrossRef  Google Scholar 

  2. Al-Sabbagh, K., Staron, M., Hebig, R., Meding, W.: Predicting test case verdicts using textual analysis of commited code churns (2019)

    Google Scholar 

  3. Antinyan, V., Staron, M., Sandberg, A., Hansson, J.: Validating software measures using action research a method and industrial experiences. In: Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering, p. 23. ACM (2016)

    Google Scholar 

  4. Basili, V.R., Briand, L.C., Melo, W.L.: A validation of object-oriented design metrics as quality indicators. IEEE Trans. Softw. Eng. 22(10), 751–761 (1996)

    CrossRef  Google Scholar 

  5. Bird, C., Rigby, P.C., Barr, E.T., Hamilton, D.J., German, D.M., Devanbu, P.: The promises and perils of mining Git. In: 2009 6th IEEE International Working Conference on Mining Software Repositories, pp. 1–10. IEEE (2009)

    Google Scholar 

  6. Chidamber, S.R., Kemerer, C.F.: Towards a metrics suite for object oriented design (1991)

    Google Scholar 

  7. Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20(6), 476–493 (1994)

    CrossRef  Google Scholar 

  8. Fenton, N., Bieman, J.: Software Metrics: A Rigorous and Practical Approach. CRC Press, Boca Raton (2014)

    CrossRef  Google Scholar 

  9. Goldberg, Y., Levy, O.: word2vec explained: deriving Mikolov et al’.s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014)

  10. Halali, S., Staron, M., Ochodek, M., Meding, W.: Improving defect localization by classifying the affected asset using machine learning. In: Winkler, D., Biffl, S., Bergsmann, J. (eds.) SWQD 2019. LNBIP, vol. 338, pp. 106–122. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05767-1_8

    CrossRef  Google Scholar 

  11. Kitchenham, B.A., Pickard, L.M., MacDonell, S.G., Shepperd, M.J.: What accuracy statistics really measure. IEE Proc. Softw. 148(3), 81–85 (2001)

    CrossRef  Google Scholar 

  12. Lindahl, T., Sagonas, K.: Detecting software defects in telecom applications through lightweight static analysis: a war story. In: Chin, W.-N. (ed.) APLAS 2004. LNCS, vol. 3302, pp. 91–106. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30477-7_7

    CrossRef  Google Scholar 

  13. Liu, G., Lu, Y., Shi, K., Chang, J., Wei, X.: Convolutional neural networks-based locating relevant buggy code files for bug reports affected by data imbalance. IEEE Access 7, 131304–131316 (2019)

    CrossRef  Google Scholar 

  14. Mi, Q., Keung, J., Xiao, Y., Mensah, S., Gao, Y.: Improving code readability classification using convolutional neural networks. Inf. Softw. Technol. 104, 60–71 (2018)

    CrossRef  Google Scholar 

  15. Mukadam, M., Bird, C., Rigby, P.C.: Gerrit software code review data from android. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 45–48. IEEE (2013)

    Google Scholar 

  16. Nagappan, N., Ball, T.: Static analysis tools as early indicators of pre-release defect density. In: Proceedings of the 27th international conference on Software engineering, pp. 580–586. ACM (2005)

    Google Scholar 

  17. Ochodek, M., Hebig, R., Meding, W., Frost, G.: Recognizing lines of code violating company-specific coding guidelines using machine learning. Empirical Softw. Eng. 25, 220–265 (2019)

    CrossRef  Google Scholar 

  18. Ochodek, M., Staron, M., Bargowski, D., Meding, W., Hebig, R.: Using machine learning to design a flexible loc counter. In: 2017 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), pp. 14–20. IEEE (2017)

    Google Scholar 

  19. Ouellet, A., Badri, M.: Empirical analysis of object-oriented metrics and centrality measures for predicting fault-prone classes in object-oriented software. In: Piattini, M., Rupino da Cunha, P., García Rodríguez de Guzmán, I., Pérez-Castillo, R. (eds.) QUATIC 2019. CCIS, vol. 1010, pp. 129–143. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29238-6_10

    CrossRef  Google Scholar 

  20. Powers, D.M.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation (2011)

    Google Scholar 

  21. Rana, R., Staron, M.: Machine learning approach for quality assessment and prediction in large software organizations. In: 2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS), pp. 1098–1101. IEEE (2015)

    Google Scholar 

  22. Rana, R., Staron, M., Hansson, J., Nilsson, M., Meding, W.: A framework for adoption of machine learning in industry for software defect prediction. In: 2014 9th International Conference on Software Engineering and Applications (ICSOFT-EA), pp. 383–392. IEEE (2014)

    Google Scholar 

  23. Rathore, S.S., Kumar, S.: A study on software fault prediction techniques. Artif. Intell. Rev. 51(2), 255–327 (2017). https://doi.org/10.1007/s10462-017-9563-5

    CrossRef  Google Scholar 

  24. Schnappinger, M., Osman, M.H., Pretschner, A., Fietzke, A.: Learning a classifier for prediction of maintainability based on static analysis tools. In: Proceedings of the 27th International Conference on Program Comprehension, pp. 243–248. IEEE Press (2019)

    Google Scholar 

  25. Shippey, T., Bowes, D., Hall, T.: Automatically identifying code features for software defect prediction: using AST N-Grams. Inf. Softw. Technol. 106, 142–160 (2019)

    CrossRef  Google Scholar 

  26. Son, L.H., et al.: Empirical study of software defect prediction: a systematic mapping. Symmetry 11(2), 212 (2019)

    CrossRef  Google Scholar 

  27. Staron, M.: Action Research in Software Engineering. Springer, Heidelberg (2020). https://doi.org/10.1007/978-3-030-32610-4

    CrossRef  Google Scholar 

  28. Staron, M., Kuzniarz, L., Thurn, C.: An empirical assessment of using stereotypes to improve reading techniques in software inspections. ACM SIGSOFT Softw. Eng. Notes 30(4), 1–7 (2005)

    CrossRef  Google Scholar 

  29. Staron, M., Ochodek, M., Meding, W., Söder, O.: Using machine learning to identify code fragments for manual review. In: International Conference on Software Engineering and Advanced Applications, pp. 1–20. ACM (2020)

    Google Scholar 

  30. Subramanyam, R., Krishnan, M.S.: Empirical analysis of CK metrics for object-oriented design complexity: implications for software defects. IEEE Trans. Softw. Eng. 29(4), 297–310 (2003)

    CrossRef  Google Scholar 

  31. Sultanow, E., Ullrich, A., Konopik, S., Vladova, G.: Machine learning based static code analysis for software quality assurance. In: 2018 Thirteenth International Conference on Digital Information Management (ICDIM), pp. 156–161. IEEE (2018)

    Google Scholar 

  32. Tahir, A., Bennin, K.E., MacDonell, S.G., Marsland, S.: Revisiting the size effect in software fault prediction models. In: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, p. 23. ACM (2018)

    Google Scholar 

  33. Wu, L., Hoi, S.C., Yu, N.: Semantics-preserving bag-of-words models and applications. IEEE Trans. Image Process. 19(7), 1908–1920 (2010)

    MathSciNet  CrossRef  Google Scholar 

  34. Xiao, Y., Keung, J., Bennin, K.E., Mi, Q.: Improving bug localization with word embedding and enhanced convolutional neural networks. Inf. Softw. Technol. 105, 17–29 (2019)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Miroslaw Staron .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Staron, M., Meding, W., Söder, O., Ochodek, M. (2021). Improving Quality of Code Review Datasets – Token-Based Feature Extraction Method. In: Winkler, D., Biffl, S., Mendez, D., Wimmer, M., Bergsmann, J. (eds) Software Quality: Future Perspectives on Software Engineering Quality. SWQD 2021. Lecture Notes in Business Information Processing, vol 404. Springer, Cham. https://doi.org/10.1007/978-3-030-65854-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-65854-0_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-65853-3

  • Online ISBN: 978-3-030-65854-0

  • eBook Packages: Computer ScienceComputer Science (R0)