Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

cregit: Token-level blame information in git version control repositories

Abstract

The blame feature of version control systems is widely used—both by practitioners and researchers—to determine who has last modified a given line of code, and the commit where this contribution was made. The main disadvantage of blame is that, when a line is modified several times, it only shows the last commit that modified it—occluding previous changes to other areas of the same line. In this paper, we developed a method to increase the granularity of blame in git: instead of tracking lines of code, this method is capable of tracking tokens in source code. We evaluate its effectiveness with an empirical study in which we compare the accuracy of blame in git (per line) with our proposed blame-per-token method. We demonstrate that, in 5 large open source systems, blame-per-token is capable of properly identifying the commit that introduced a token with an accuracy between 94.5% and 99.2%, while blame-per-line can only achieve an accuracy between 75% and 91% (with a margin of error of +/-5% and a confidence interval of 95%). We also classify the reasons why either blame method fails, highlighting each method’s weaknesses. The blame-per-token method has been implemented in an open source tool called cregit, which is currently in use by the Linux Foundation to identify the persons who have contributed to the source code of the Linux kernel.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Notes

  1. 1.

    https://rtyley.github.io/bfg-repo-cleaner/

  2. 2.

    cloc is an OSS tool to count lines of code and comments https://github.com/AlDanial/cloc

  3. 3.

    For this reason, other “diff” algorithms have been proposed, such as “patient diff” (originally implemented in the version control system Bazaar, and also implemented in git). Patient-diff tries to maximize the number of unique unchanged lines by repeatedly running Myers’ diff on sections of the input). For a discussion of its benefits, we refer elsewhere (Schindelin 2009).

  4. 4.

    https://github.com/git/git/commit/c9018b0305a56436c85b292edbeacff04b0ebb5d

  5. 5.

    http://turingmachine.org/2018/cregit

  6. 6.

    http://github.com/cregit/evaluation

  7. 7.

    https://github.com/GumTreeDiff/gumtree

  8. 8.

    https://blogs.s-osg.org/made-change-using-cregit-debugging/

  9. 9.

    http://github.com/cregit

  10. 10.

    http://cregit.linuxsources.org

References

  1. Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? In: Proceedings of the 28th international conference on software engineering, ICSE ’06. ACM, New York, pp 361–370

  2. Asaduzzaman M, Roy CK, Schneider KA, Di Penta M (2013) Lhdiff: a language-independent hybrid approach for tracking source code lines. In: ICSM. IEEE Computer Society, pp 230–239

  3. Asenov D, Guenat B, Müller P, Otth M (2017) Precise version control of trees with line-based version control systems. In: Huisman M, Rubin J (eds) Fundamental approaches to software engineering. Springer, Berlin, pp 152–169

  4. Ayuso PN (2017) Frequently asked questions regarding gpl compliance and netfilter http://www.netfilter.org/licensing.html#faq

  5. Bhattacharya P, Neamtiu I, Faloutsos M (2014) Determining developers’ expertise and role: a graph hierarchy-based approach. In: 2014 IEEE international conference on software maintenance and evolution, pp 11–20

  6. Bille P (2005) A survey on tree edit distance and related problems. Theor Comput Sci 337(1-3):217–239

  7. Bird C, Nagappan N, Murphy B, Gall H, Devanbu P (2011) Don’t touch my code!: examining the effects of ownership on software quality. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th european conference on foundations of software engineering, ESEC/FSE ’11. ACM, New York, pp 4–14

  8. Canfora G, Cerulo L, Di Penta M (2009) Tracking your changes: a language-independent approach. IEEE Soft 26(1):50–57

  9. Chacon S, Straub B (2014) Pro git, 2nd edn. APres

  10. Chacon S, Straub B (2014) Pro git, 2nd edn. Apress, Berkely

  11. Chawathe SS, Rajaraman A, Garcia-Molina H, Widom J (1996) Change detection in hierarchically structured information. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, SIGMOD ’96. ACM, New York, pp 493–504

  12. Cochran WG (1963) Sampling techniques, 2nd edn. Wiley, New York

  13. Collard ML, Decker MJ, Maletic JI (2011) Lightweight transformation and fact extraction with the srcml toolkit. In: 2011 IEEE 11th international working conference on source code analysis and manipulation, pp 173–184

  14. Collard ML, Decker MJ, Maletic JI (2013) srcml: an infrastructure for the exploration, analysis, and manipulation of source code: a tool demonstration. In: 2013 IEEE international conference on software maintenance, pp 516–519

  15. Davies J, German DM, Godfrey MW, Hindle A (2011) Software bertillonage: finding the provenance of an entity. In: Proceedings of the 8th working conference on mining software repositories, MSR ’11. ACM, New York, pp 183–192

  16. Dotzler G, Philippsen M (2016) Move-optimized source code tree differencing. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering, ASE 2016. ACM, New York, pp 660–671

  17. Falleri JR, Morandat F, Blanc X, Martinez M, Monperrus M (2014) Fine-grained and accurate source code differencing. In: ACM/IEEE international conference on automated software engineering, ASE’14, Vasteras, Sweden - September 15 - 19, 2014, pp 313–324

  18. Feist MD, Santos EA, Watts I, Hindle A (2016) Visualizing project evolution through abstract syntax tree analysis. In: 2016 IEEE working conference on software visualization, VISSOFT 2016, Raleigh, NC, USA, October 3-4, 2016, pp 11–20

  19. Fluri B, Wuersch M, PInzger M, Gall H (2007) Change distilling: tree differencing for fine-grained source code change extraction. IEEE Trans Softw Eng 33 (11):725–743

  20. Fritz T, Murphy GC, Hill E (2007) Does a programmer’s activity indicate knowledge of code?. In: Proceedings of the the 6th joint meeting of the european software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC-FSE ’07. ACM, New York, pp 341–350

  21. Fritz T, Ou J, Murphy GC, Murphy-Hill E (2010) A degree-of-knowledge model to capture source code familiarity. In: Proceedings of the 32Nd ACM/IEEE international conference on software engineering - volume 1, ICSE ’10. ACM, New York, pp 385–394

  22. German DM (2006) A study of the contributors of postgresql. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06, pp 163–164

  23. German DM, Hassan AE, Robles G (2009) Change impact graphs: determining the impact of prior codechanges. Inf Softw Technol 51(10):1394–1408

  24. Girba T, Kuhn A, Seeberger M, Ducasse S (2005) How developers drive software evolution. In: Eighth international workshop on principles of software evolution (IWPSE’05), pp 113–122

  25. Godfrey MW, Zou L (2005) Using origin analysis to detect merging and splitting of source code entities. IEEE Trans Softw Eng 31(2):166–181

  26. Hashimoto M, Mori A (2008) Diff/ts: a tool for fine-grained structural change analysis. In: Proceedings of the 2008 15th working conference on reverse engineering, WCRE ’08. IEEE Computer Society, Washington, pp 279–288

  27. Hassan AE (2009) Predicting faults using the complexity of code changes. In: 2009 IEEE 31st international conference on software engineering, pp 78–88

  28. Hassan AE, Holt RC (2004) C-REX: an evolutionary code extractor for C - (PDF). Technical report, University of Waterloo. http://plg.uwaterloo.ca/~aeehassa/home/pubs/crex.pdf

  29. Hata H, Mizuno O, Kikuno T (2012) Bug prediction based on fine-grained module histories. In: 2012 34th international conference on software engineering (ICSE), pp 200–210

  30. Hata H, Mizuno O, Kikuno T (2012) Bug prediction based on fine-grained module histories. In: Proceedings of the 34th international conference on software engineering, ICSE ’12. IEEE Press, Piscataway, pp 200–210

  31. Hattori LP, Lanza M, Robbes R (2012) Refining code ownership with synchronous changes. Empirical Softw Engg 17(4-5):467–499

  32. Higo Y, Ohtani A, Kusumoto S (2017) Generating simpler ast edit scripts by considering copy-and-paste. In: Proceedings of the 32Nd IEEE/ACM international conference on automated software engineering, ASE 2017. IEEE Press, Piscataway, pp 532–542

  33. Ihara A, Kamei Y, Ohira M, Hassan AE, Ubayashi N, Matsumoto K (2014) Early identification of future committers in open source software projects. In: Proceedings of the 2014 14th international conference on quality software, QSIC ’14. IEEE Computer Society, Washington, pp 47–56

  34. Kawrykow D, Robillard MP (2011) Non-essential changes in version histories. In: Proceedings of the 33rd international conference on software engineering, ICSE ’11. ACM, New York, pp 351–360

  35. Khan S (2018) Who made that change and when: using cregit for debugging http://www.gonehiking.org/ShuahLinuxBlogs/blog/2018/10/18/who-made-that-change-and-when-using-cregit-for-debugging/

  36. Kim M, Notkin D (2009) Discovering and representing systematic code changes. In: Proceedings of the 31st international conference on software engineering, ICSE ’09. IEEE Computer Society, Washington, pp 309–319

  37. Kim S, Zimmermann T, Pan K, Whitehead Jr EJ (2006) Automatic identification of bug-introducing changes. In: Proceedings of the 21st IEEE/ACM international conference on automated software engineering, ASE ’06. IEEE Computer Society, Washington, pp 81–90

  38. Ma D, Schuler D, Zimmermann T, Sillito J (2009) Expert recommendation with usage expertise. In: 2009 IEEE international conference on software maintenance, pp 535–538

  39. Macho C, Mcintosh S, Pinzger M (2017) Extracting build changes with builddiff. In: Proceedings of the 14th international conference on mining software repositories, MSR ’17. IEEE Press, Piscataway, pp 368–378

  40. McDonald DW, Ackerman MS (2000) Expertise recommender: a flexible recommendation system and architecture. In: Proceedings of the 2000 ACM conference on computer supported cooperative work, CSCW ’00. ACM, New York, pp 231–240

  41. Meeker H (2017) Patrick mchardy and copyright profiteering. Open source https://opensource.com/article/17/8/patrick-mchardy-and-copyright-profiteering

  42. Meng X, Miller BP, Williams WR, Bernat AR (2013) Mining software repositories for accurate authorship. In: Proceedings of the 2013 IEEE international conference on software maintenance, ICSM ’13. IEEE Computer Society, Washington, pp 250–259

  43. Meng X, Miller BP, Williams WR, Bernat AR (2013) Mining software repositories for accurate authorship. In: Proceedings of the 2013 IEEE international conference on software maintenance, ICSM ’13. IEEE Computer Society, Washington, pp 250–259

  44. Miller W, Myers EW (1985) A file comparison program. Soft Practice Exp 15(11):1025–1040

  45. Minto S, Murphy GC (2007) Recommending emergent teams. In: Proceedings of the fourth international workshop on mining software repositories, MSR ’07. IEEE Computer Society, Washington, pp 5–

  46. Miraldo VC, Dagand P-É, Swierstra W (2017) Type-directed diffing of structured data. In: Proceedings of the 2nd ACM SIGPLAN international workshop on type-driven development, TyDe 2017. ACM, New York, pp 2–15

  47. Mockus A, Herbsleb JD (2002) Expertise browser: a quantitative approach to identifying expertise. In: Proceedings of the 24th international conference on software engineering, ICSE ’02. ACM, New York, pp 503–512

  48. Myers EW (1986) Ano(nd) difference algorithm and its variations. Algorithmica 1(1):251–266

  49. Palix N, Falleri J-R, Lawall J (2015) Improving pattern tracking with a language-aware tree differencing algorithm. In: 22nd IEEE international conference on software analysis, evolution, and reengineering, SANER 2015 Montreal, QC, Canada, March 2-6, 2015, pp 43–52

  50. Panciera K, Halfaker A, Terveen L (2009) Wikipedians are born, not made: a study of power editors on wikipedia. In: Proceedings of the ACM 2009 international conference on supporting group work, GROUP ’09. ACM, New York, pp 51–60

  51. Raghavan S, Rohana R, Leon D, Podgurski A, Augustine V (2004) Dex: a semantic-graph differencing tool for studying changes in large code bases. In: Proceedings of the 20th IEEE international conference on software maintenance, ICSM ’04. IEEE Computer Society, Washington, pp 188–197

  52. Rahman F, Devanbu P (2011) Ownership, experience and defects: a fine-grained study of authorship. In: Proceedings of the 33rd international conference on software engineering, ICSE ’11. ACM, New York, pp 491–500

  53. Reiss SP (2008) Tracking source locations. In: Proceedings of the 30th international conference on software engineering, ICSE ’08. ACM, New York, pp 11–20

  54. Schindelin J (2009) [patch 0/3] teach git about the patience diff algorithm. https://marc.info/?l=git&m=123082787502576&w=2

  55. Schuler D, Zimmermann T (2008) Mining usage expertise from version archives. In: Proceedings of the 2008 international working conference on mining software repositories, MSR ’08. ACM, New York, pp 121–124

  56. Servant F, Jones JA (2012) History slicing: assisting code-evolution tasks. In: Proceedings of the ACM SIGSOFT 20th international symposium on the foundations of software engineering, FSE ’12. ACM, New York, pp 43:1–43:11

  57. Servant F, Jones JA (2017) Fuzzy fine-grained code-history analysis. In: Proceedings of the 39th international conference on software engineering, ICSE ’17. IEEE Press, Piscataway, pp 746–757

  58. Sharwood S (2017) Linux kernel community tries to castrate GPL copyright troll. The register https://www.theregister.co.uk/2017/10/18/linux_kernel_community_enforcement_statement/

  59. Shihab E, Mockus A, Kamei Y, Adams B, Hassan AE (2011) High-impact defects: a study of breakage and surprise defects. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on foundations of software engineering, ESEC/FSE ’11. ACM, New York, pp 300–310

  60. Spacco J, Williams C (2009) Lightweight techniques for tracking unique program statements. In: 2009 Ninth IEEE international working conference on source code analysis and manipulation, pp 99–108

  61. Tantithamthavorn C, McIntosh S, Hassan AE, Ihara A, Matsumoto K (2015) The impact of mislabelling on the performance and interpretation of defect prediction models. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 1, pp 812–823

  62. The Linux Foundation (2017) Linux foundation and free software foundation europe introduce resources to support open source software license identification and compliance https://www.linuxfoundation.org/press-release/2017/04/linux-foundation-and-free-software-foundation-europe-introduce-resources-to-support-open-source-software-license-identification-and-compliance/ https://www.linuxfoundation.org/press-release/2017/04/linux-foundation-and-free-software-foundation-europe-introduce-resources-to-support-open-source-software-license-identification-and-compliance/

  63. Thongtanunam P, McIntosh S, Hassan AE, Iida H (2016) Revisiting code ownership and its relationship with software quality in the scope of modern code review. In: Proceedings of the 38th international conference on software engineering, ICSE ’16. ACM, New York, pp 1039–1050

  64. Tsantalis N, Mansouri M, Eshkevari L, Mazinanian D, Dig D (2018) Accurate and efficient refactoring detection in commit history. In: Proceedings of the 40th international conference on software engineering, ICSE 2018

  65. Tsikerdekis M (2018) Persistent code contribution: a ranking algorithm for code contribution in crowdsourced software. J Empir Softw Eng archive 23(4):1871–1894

  66. Ukkonen E (1985) Algorithms for approximate string matching. Inf Control 64 (1):100–118. International Conference on Foundations of Computation Theory

  67. Weissgerber P, Diehl S (2006) Identifying refactorings from source-code changes. In: 21st IEEE/ACM international conference on automated software engineering (ASE’06), pp 231–240

  68. Welte H (2018) Report from the Geniatech vs. mchardy GPL violation court hearing http://laforge.gnumonks.org/blog/20180307-mchardy-gpl/

  69. Xing Z, Stroulia E (2005) Umldiff: an algorithm for object-oriented design differencing. In: Proceedings of the 20th IEEE/ACM international conference on automated software engineering, ASE ’05. ACM, New York, pp 54–65

  70. Ye Y, Kishida K (2003) Toward an understanding of the motivation open source software developers. In: Proceedings of the 25th international conference on software engineering, ICSE ’03. IEEE Computer Society, Washington, pp 419–429

  71. Zhou M, Chen Q, Mockus A, Wu F (2017) On the scalability of linux kernel maintainers’ work. In: Proceedings of the 2017 11th joint meeting on foundations of software engineering, ESEC/FSE 2017. ACM, New York, pp 27–37

Download references

Author information

Correspondence to Daniel M. German.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by: Romain Robbes

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

German, D.M., Adams, B. & Stewart, K. cregit: Token-level blame information in git version control repositories. Empir Software Eng 24, 2725–2763 (2019). https://doi.org/10.1007/s10664-019-09704-x

Download citation