A large-scale empirical study of commit message generation: models, datasets and evaluation

Tao, Wei; Wang, Yanlin; Shi, Ensheng; Du, Lun; Han, Shi; Zhang, Hongyu; Zhang, Dongmei; Zhang, Wenqiang

doi:10.1007/s10664-022-10219-1

A large-scale empirical study of commit message generation: models, datasets and evaluation

Published: 26 October 2022

Volume 27, article number 198, (2022)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Wei Tao¹,
Yanlin Wang ORCID: orcid.org/0000-0001-7761-7269²,
Ensheng Shi³,
Lun Du⁴,
Shi Han⁴,
Hongyu Zhang⁵,
Dongmei Zhang⁴ &
…
Wenqiang Zhang¹

1663 Accesses
4 Citations
Explore all metrics

Abstract

Commit messages are natural language descriptions of code changes, which are important for program understanding and maintenance. However, writing commit messages manually is time-consuming and laborious, especially when the code is updated frequently. Various approaches utilizing generation or retrieval techniques have been proposed to automatically generate commit messages. To achieve a better understanding of how the existing approaches perform in solving this problem, this paper conducts a systematic and in-depth analysis of the state-of-the-art models and datasets. We find that: (1) Different variants of the BLEU metric used in previous works affect the evaluation. (2) Most datasets are crawled only from Java repositories while repositories in other programming languages are not sufficiently explored. (3) Dataset splitting strategies can influence the performance of existing models by a large margin. (4) For pre-trained models, fune-tuning with different multi-programming-language combinations can influence their performance. Based on these findings, we collect a large-scale, information-rich, M ulti-language C ommit M essage D ataset (MCMD). Using MCMD, we conduct extensive experiments under different experiment settings including splitting strategies and multi-programming-language combinations. Furthermore, we provide suggestions for comprehensively evaluating commit message generation models and discuss possible future research directions. We believe our work can help practitioners and researchers better evaluate and select models for automatic commit message generation. Our source code and data are available at https://anonymous.4open.science/r/CommitMessageEmpirical.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Human-Written Commit Messages to Document Code Changes

Article 30 November 2020

Yuan Huang, Nan Jia, … Ming-Dong Tang

18 million links in commit messages: purpose, evolution, and decay

Article 25 May 2023

Tao Xiao, Sebastian Baltes, … Kenichi Matsumoto

On the documentation of refactoring types

Article 29 December 2021

Eman Abdullah AlOmar, Jiaqian Liu, … Zhe Yu

Notes

https://doi.org/10.5281/zenodo.5025758
https://ieeexplore.ieee.org
https://dl.acm.org
https://www.engineeringvillage.com
https://www.scopus.com
https://sjiang1.github.io/commitgen
https://github.com/SoftWiser-group/CoDiSum
https://github.com/epochx/commitgen
https://zenodo.org/record/2542706
https://github.com/Tbabm/nngen
https://zenodo.org/record/3828107
https://github.com/microsoft/CodeBERT
https://github.com/CC2Vec/CC2Vec
https://github.com/tech-srl/code2seq/tree/master/JavaExtractor
https://osf.io/67kyc/?view_only=ad588fe5d1a14dd795553fb4951b5bf9
https://pypl.github.io/PYPL.html
https://doi.org/10.5281/zenodo.5025758
The details of calculation can be seen at our repository https://anonymous.4open.science/r/CommitMessageEmpirical
All the generated commit messages are available in our repository.
Although the ATOM_data can be used for evaluating ATOM, it is not publicly available as described in Section 3.2.1.
Note that although Table 14 is the result under the B-Norm metric, our findings still hold on other metrics (with results shown in Appendix B) as well.

References

Ahmad WU, Chakraborty S, Ray B, Chang K (2021) Unified pre-training for program understanding and generation. In: NAACL-HLT. Association for Computational Linguistics, pp 2655–2668
Alon U, Brody S, Levy O, Yahav E (2019) code2seq: generating sequences from structured representations of code. In: ICLR
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: ICLR
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: IEEValuation@ACL. Association for Computational Linguistics, pp 65–72
Barnett JG, Gathuru CK, Soldano LS, McIntosh S (2016) The relationship between commit message detail and defect proneness in java projects on github. In: MSR. ACM, pp 496–499
Buse RPL, Weimer W (2010) Automatically documenting program changes. In: ASE. ACM, pp 33–42
Chen B, Cherry C (2014) A systematic comparison of smoothing techniques for sentence-level BLEU. In: WMT@ACL. The Association for Computer Linguistics, pp 362–367
Clark K, Luong M, Le QV, Manning CD (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. In: ICLR. OpenReview.net
Conneau A, Wu S, Li H, Zettlemoyer L, Stoyanov V (2020) Emerging cross-lingual structure in pretrained language models. In: ACL. Association for Computational Linguistics, pp 6022–6034
Cortes-Coy LF, Vásquez ML, Aponte J, Poshyvanyk D (2014) On automatically generating commit messages via summarization of source code changes. In: SCAM. IEEE Computer Society, pp 275–284
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1). Association for computational linguistics, pp 4171–4186
Dragan N, Collard ML, Maletic JI (2006) Reverse engineering method stereotypes. In: ICSM. IEEE Computer Society, pp 24–34
Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) Codebert: a pre-trained model for programming and natural languages. In: EMNLP (Findings), findings of ACL, vol EMNLP 2020. Association for Computational Linguistics, pp 1536–1547
Fluri B, Würsch M, Pinzger M, Gall HC (2007) Change distilling: tree differencing for fine-grained source code change extraction. IEEE Trans Software Eng 33(11):725–743
Article Google Scholar
Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S, Tufano M, Deng SK, Clement CB, Drain D, Sundaresan N, Yin J, Jiang D, Zhou M (2021) Graphcodebert: pre-training code representations with data flow. In: ICLR. OpenReview.net
Hayes AF, Krippendorff K (2007) Answering the call for a standard reliability measure for coding data. Commun Methods Meas 1(1):77–89
Article Google Scholar
Hindle A, Germán DM, Godfrey MW, Holt RC (2009) Automatic classication of large changes into maintenance categories. In: ICPC. IEEE Computer Society, pp 30–39
Hoang T, Kang HJ, Lo D, Lawall J (2020) Cc2vec: distributed representations of code changes. In: ICSE. ACM, pp 518–529
Huang Y, Jia N, Zhou H, Chen X, Zheng Z, Tang M (2020) Learning human-written commit messages to document code changes. J Comput Sci Technol 35(6):1258–1277
Article Google Scholar
Jiang S (2019) Boosting neural commit message generation with code semantic analysis. In: ASE. IEEE, pp 1280–1282
Jiang S, Armaly A, McMillan C (2017) Automatically generating commit messages from diffs using neural machine translation. In: ASE
Jiang S, McMillan C (2017) Towards automatic generation of short summaries of commits. In: Proceedings of the 25th international conference on program comprehension, ICPC 2017, Buenos Aires, Argentina, May 22-23, 2017
Kanade A, Maniatis P, Balakrishnan G, Shi K (2020) Pre-trained contextual embedding of source code. Preprint. https://openreview.net/attachment?id=rygoURNYvS&name=original_pdf
Kendall MG (1945) The treatment of ties in ranking problems. Biometrika 33(3):239–251
Article MathSciNet MATH Google Scholar
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: ACL. The Association for Computational Linguistics
Lample G, Conneau A, Ranzato M, Denoyer L, Jégou H (2018) Word translation without parallel data. In: ICLR (Poster). Openreview.net
LeClair A, McMillan C (2019) Recommendations for datasets for source code summarization. In: NAACL-HLT (1). Association for Computational Linguistics, pp 3931–3937
Lin C (2004) ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Lin C, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: ACL. ACL, pp 605–612
Liu C, Xia X, Lo D, Gao C, Yang X, Grundy JC (2022) Opportunities and challenges in code search tools. ACM Comput Surv 54(9):196:1–196:40
Article Google Scholar
Liu Q, Liu Z, Zhu H, Fan H, Du B, Qian Y (2019) Generating commit messages from diffs using pointer-generator network. In: MSR. IEEE/ACM, pp 299–309
Liu S, Gao C, Chen S, Nie LY, Liu Y (2020) ATOM: commit message generation based on abstract syntax tree and hybrid ranking. TSE PP:1–1
Google Scholar
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized BERT pretraining approach. arXiv:1907.11692
Liu Z, Xia X, Hassan AE, Lo D, Xing Z, Wang X (2018) Neural-machine-translation-based commit message generation: how far are we?. In: ASE. ACM, pp 373–384
Liu Z, Xia X, Treude C, Lo D, Li S (2019) Automatic generation of pull request descriptions. In: ASE. IEEE, pp 176–188
Loyola P, Marrese-taylor E, Balazs JA, Matsuo Y, Satoh F (2018) Content aware source code change description generation. In: INLG. Association for Computational Linguistics, pp 119–128
Loyola P, Marrese-Taylor E, Matsuo Y (2017) A neural architecture for generating natural language descriptions from source code changes. In: ACL (2). Association for Computational Linguistics, pp 287–292
Luong T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. In: EMNLP, pp 1412–1421
Ma Q, Wei J, Bojar O, Graham Y (2019) Results of the WMT19 metrics shared task: segment-level and strong MT systems pose big challenges. In: WMT (2). Association for Computational Linguistics, pp 62–90
Mogotsi IC, Manning CD, Raghavan P, Schütze H (2010) Introduction to information retrieval - Cambridge University Press, Cambridge, England, 2008, 482 pp, ISBN: 978-0-521-86571-5. Inf Retr 13(2):192–195
Article Google Scholar
Moreno L, Aponte J, Sridhara G, Marcus A, Pollock LL, Vijay-Shanker K (2013) Automatic generation of natural language summaries for java classes. In: ICPC. IEEE Computer Society, pp 23–32
Moreno L, Marcus A (2012) Jstereocode: automatically identifying method and class stereotypes in java code. In: ASE. ACM, pp 358–361
Myers JL, Well AD, Lorch RF Jr (2013) Research design and statistical analysis. Routledge
Nie LY, Gao C, Zhong Z, Lam W, Liu Y, Xu Z (2021) Coregen: contextualized code representation learning for commit message generation. Neurocomputing 459:97–107
Article Google Scholar
Panichella S, Panichella A, Beller M, Zaidman A, Gall HC (2016) The impact of test case summaries on bug fixing performance: an empirical investigation. In: ICSE. ACM, pp 547–558
Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: ACL. ACL, pp 311–318
Petersen K, Vakkalanka S, Kuzniarz L (2015) Guidelines for conducting systematic mapping studies in software engineering: an update. Inf Softw Technol 64:1–18
Article Google Scholar
Ranzato M, Chopra S, Auli M, Zaremba W (2016) Sequence level training with recurrent neural networks. In: ICLR (Poster)
Rebai S, Kessentini M, Alizadeh V, Sghaier OB, Kazman R (2020) Recommending refactorings via commit message analysis. Inf Softw Technol 126:106332
Article Google Scholar
See A, Liu PJ, Manning CD (2017) Get to the point: summarization with pointer-generator networks. In: ACL (1). Association for Computational Linguistics, pp 1073–1083
Sennrich R, Firat O, Cho K, Birch A, Haddow B, Hitschler J, Junczys-Dowmunt M, Läubli S, Barone AVM, Mokry J, Nadejde M (2017) Nematus: a toolkit for neural machine translation. In: EACL (Software demonstrations). Association for Computational Linguistics, pp 65–68
Shen J, Sun X, Li B, Yang H, Hu J (2016) On automatic summarization of what and why information in source code changes. In: COMPSAC. IEEE Computer Society, pp 103–112
Sillito J, Murphy GC, Volder KD (2008) Asking and answering questions during a programming change task. IEEE Trans Software Eng 34(4):434–451
Article Google Scholar
Sorbo AD, Visaggio CA, Penta MD, Canfora G, Panichella S (2021) An nlp-based tool for software artifacts analysis. In: ICSME. IEEE, pp 569–573
Swanson EB (1976) The dimensions of maintenance. In: ICSE. IEEE Computer Society, pp 492–497
Tao W, Wang Y, Shi E, Du L, Han S, Zhang H, Zhang D, Zhang W (2021) On the evaluation of commit message generation models: an experimental study. In: ICSME. IEEE, pp 126–136
van der Lee C, Gatt A, van Miltenburg E, Wubben S, Krahmer E (2019) Best practices for the human evaluation of automatically generated text. In: Proceedings of the 12th international conference on natural language generation, INLG
Vásquez ML, Cortes-Coy LF, Aponte J, Poshyvanyk D (2015) Changescribe: a tool for automatically generating commit messages. In: ICSE (2). IEEE Computer Society, pp 709–712
Wang B, Yan M, Liu Z, Xu L, Xia X, Zhang X, Yang D (2021a) Quality assurance for automated commit message generation. In: SANER. IEEE, pp 260–271
Wang H, Xia X, Lo D, He Q, Wang X, Grundy J (2021b) Context-aware retrieval-based deep commit message generation. ACM Trans Softw Eng Methodol 30(4):56:1–56:30
Article Google Scholar
Wang X, Wang Y, Wan Y, Wang J, Zhou P, Li L, Wu H, Liu J (2022) CODE-MVP: learning to represent source code from multiple views with contrastive pre-training. In: NAACL-HLT. Association For computational Linguistics
Wang Y, Wang W, Joty SR, Hoi SCH (2021) Codet5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In: EMNLP (1). Association for Computational Linguistics, pp 8696–8708
Xu S, Yao Y, Xu F, Gu T, Tong H, Lu J (2019) Commit message generation for source code changes. In: IJCAI, pp 3975–3981. ijcai.org
Xue N (2011) Steven bird, Evan Klein and Edward Loper. Natural Language Processing with Python. O’Reilly Media, Inc 2009. ISBN: 978-0-596-51649-9. Nat Lang Eng 17(3):419–424
Article Google Scholar
Yang Y, Xia X, Lo D, Grundy JC (2020) A survey on deep learning for software engineering. ACM Comput Surv

Download references

Author information

Authors and Affiliations

Fudan University, Shanghai, China
Wei Tao & Wenqiang Zhang
School of Software Engineering, Sun Yat-sen University, Zhuhai, China
Yanlin Wang
Xi’an Jiaotong University, Xi’an, China
Ensheng Shi
Microsoft Research Asia, Beijing, China
Lun Du, Shi Han & Dongmei Zhang
The University of Newcastle, Callaghan, Australia
Hongyu Zhang

Authors

Wei Tao
View author publications
You can also search for this author in PubMed Google Scholar
Yanlin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ensheng Shi
View author publications
You can also search for this author in PubMed Google Scholar
Lun Du
View author publications
You can also search for this author in PubMed Google Scholar
Shi Han
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dongmei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenqiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanlin Wang.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Communicated by: Denys Poshyvanyk

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Yanlin Wang work done during the author’s employment at Microsoft Research Asia.

Appendices

Appendix A: Searched Results

All of the searched results can be available at our repository https://anonymous.4open.science/r/CommitMessageEmpirical/survey. “Searched_Results.csv” contains 583 papers’ titles searched from four databases. Other “.html” files record web page results when searching from different databases. For example, “IEEE.html” is the search result page when searching from https://ieeexplore.ieee.org.

Appendix B: More Experimental Results

1.1 B.1 Models performance on ROUGE and METEOR

As shown in Table 16, rankings of models are consistent mostly under ROUGE-1, ROUGE-2, ROUGE-L and METEOR. We also calculated the correlation between each of them. We find that 44 of the 48 Spearman’s Rank correlation coefficients are significantly higher than 0.78, which means these metrics are generally consistent across evaluations.

Table 16 Models performance on ROUGE and METEOR

Full size table

1.2 B.2 Models Performance on MCMD Split by Timestamp on ROUGE and METEOR

Table 17 shows experimental results on MCMD split by timestamp. Compared to Table 10, the performance of all models on all PLs of MCMD drops consistently. This shows that it is more difficult to predict future commit messages based on past data training.

Table 17 The performance on MCMD split by timestamp

Full size table

1.3 B.3 Models Performance on MCMD Split by Project on ROUGE and METEOR

Table 18 shows experimental results on MCMD split by project. Compared to Table 10, the performance of all models on all PLs of MCMD drops in general, which indicates that the split-by-project scenario is much more difficult than the split-by-commit.

Table 18 The performance on MCMD split by project

Full size table

1.4 B.4 CodeBERT Performance on MCMD (Fine-Tuning with Different Multi-PL Combinations) on ROUGE, METEOR, B-Moses, and B-CC

Tables 19, 20, 21, 22, 23, and 24 shows the experimental results on ROUGE-1, ROUGE-2, ROUGE-L, METEOR, B-Moses and B-CC respectively. These scores reflect the performance of CodeBERT fine-tuned with different combinations of the five PLs. The findings described in Section 4.5 still can be found from these tables.

Table 19 CodeBERT on MCMD (fine-tuning with different multi-PL combinations)

Full size table

Table 20 CodeBERT on MCMD (fine-tuning with different multi-PL combinations)

Full size table

Table 21 CodeBERT on MCMD (fine-tuning with different multi-PL combinations)

Full size table

Table 22 CodeBERT on MCMD (fine-tuning with different multi-PL combinations)

Full size table

Table 23 CodeBERT on MCMD (fine-tuning with different multi-PL combinations)

Full size table

Table 24 CodeBERT on MCMD (fine-tuning with different multi-PL combinations)

Full size table

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tao, W., Wang, Y., Shi, E. et al. A large-scale empirical study of commit message generation: models, datasets and evaluation. Empir Software Eng 27, 198 (2022). https://doi.org/10.1007/s10664-022-10219-1

Download citation

Accepted: 29 July 2022
Published: 26 October 2022
DOI: https://doi.org/10.1007/s10664-022-10219-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A large-scale empirical study of commit message generation: models, datasets and evaluation

Abstract

Access this article

Similar content being viewed by others

Learning Human-Written Commit Messages to Document Code Changes

18 million links in commit messages: purpose, evolution, and decay

On the documentation of refactoring types

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Appendices

Appendix A: Searched Results

Appendix B: More Experimental Results

1.1 B.1 Models performance on ROUGE and METEOR

1.2 B.2 Models Performance on MCMD Split by Timestamp on ROUGE and METEOR

1.3 B.3 Models Performance on MCMD Split by Project on ROUGE and METEOR

1.4 B.4 CodeBERT Performance on MCMD (Fine-Tuning with Different Multi-PL Combinations) on ROUGE, METEOR, B-Moses, and B-CC

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A large-scale empirical study of commit message generation: models, datasets and evaluation

Abstract

Access this article

Similar content being viewed by others

Learning Human-Written Commit Messages to Document Code Changes

18 million links in commit messages: purpose, evolution, and decay

On the documentation of refactoring types

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Appendices

Appendix A: Searched Results

Appendix B: More Experimental Results

1.1 B.1 Models performance on ROUGE and METEOR

1.2 B.2 Models Performance on MCMD Split by Timestamp on ROUGE and METEOR

1.3 B.3 Models Performance on MCMD Split by Project on ROUGE and METEOR

1.4 B.4 CodeBERT Performance on MCMD (Fine-Tuning with Different Multi-PL Combinations) on ROUGE, METEOR, B-Moses, and B-CC

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation