Learning Human-Written Commit Messages to Document Code Changes

Huang, Yuan; Jia, Nan; Zhou, Hao-Jie; Chen, Xiang-Ping; Zheng, Zi-Bin; Tang, Ming-Dong

doi:10.1007/s11390-020-0496-0

Learning Human-Written Commit Messages to Document Code Changes

Regular Paper
Published: 30 November 2020

Volume 35, pages 1258–1277, (2020)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Yuan Huang¹,
Nan Jia²,
Hao-Jie Zhou¹,
Xiang-Ping Chen³,
Zi-Bin Zheng¹ &
…
Ming-Dong Tang^4,5

352 Accesses
15 Citations
1 Altmetric
Explore all metrics

Abstract

Commit messages are important complementary information used in understanding code changes. To address message scarcity, some work is proposed for automatically generating commit messages. However, most of these approaches focus on generating summary of the changed software entities at the superficial level, without considering the intent behind the code changes (e.g., the existing approaches cannot generate such message: “fixing null pointer exception”). Considering developers often describe the intent behind the code change when writing the messages, we propose ChangeDoc, an approach to reuse existing messages in version control systems for automatical commit message generation. Our approach includes syntax, semantic, pre-syntax, and pre-semantic similarities. For a given commit without messages, it is able to discover its most similar past commit from a large commit repository, and recommend its message as the message of the given commit. Our repository contains half a million commits that were collected from SourceForge. We evaluate our approach on the commits from 10 projects. The results show that 21.5% of the recommended messages by ChangeDoc can be directly used without modification, and 62.8% require minor modifications. In order to evaluate the quality of the commit messages recommended by ChangeDoc, we performed two empirical studies involving a total of 40 participants (10 professional developers and 30 students). The results indicate that the recommended messages are very good approximations of the ones written by developers and often include important intent information that is not included in the messages generated by other tools.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Barnett M, Bird C, Brunet J, Lahiri S K. Helping developers help themselves: Automatic decomposition of code review changesets. In Proc. the 37th IEEE/ACM International Conference on Software Engineering, May 2015, pp.134-144.
Huang Y, Jia N, Zhou Q, Chen X, Xiong Y F, Luo X N. Guiding developers to make informative commenting decisions in source code. In Proc. the 40th IEEE/ACM International Conference on Software Engineering: Companion, May 2018, pp.260-261.
Hattori L, Lanza M. On the nature of commits. In Proc. the 23rd IEEE/ACM International Conference on Automated Software Engineering, September 2008, pp.63-71.
Huang Y, Huang S, Chen H, Chen X, Zheng Z, Luo X, Jia N, Hu X, Zhou X. Towards automatically generating block comments for code snippets. Information and Software Technology, 2020, 127: Article No. 106373.
Tao Y, Dang Y, Xie T, Zhang D, Kim S. How do software engineers understand code changes? An exploratory study in industry. In Proc. the 20th ACM SIGSOFT Symposium on the Foundations of Software Engineering, November 2012, Article No. 51.
Huang Y, Chen X, Zou Q, Luo X. A probabilistic neural network-based approach for related software changes detection. In Proc. the 21st Asia-Pacific Software Engineering Conference, Dec. 2014, pp.279-286.
Maalej W, Happel H J. Can development work describe itself? In Proc. the 7th International Working Conference on Mining Software Repositories, May 2010, pp.191-200.
Dyer R, Nguyen H A, Rajan H, Nguyen T N. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proc. the 35th International Conference on Software Engineering, May 2013, pp.422-431.
Linares-Vásquez M, Cortés-Coy L F, Aponte J, Poshyvanyk D. ChangeScribe: A tool for automatically generating commit messages. In Proc. the 37th IEEE/ACM International Conference on Software Engineering, May 2015, pp.709-712.
Moreno L, Bavota G, Penta M D, Oliveto R, Marcus A, Canfora G. Automatic generation of release notes. In Proc. the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, November 2014, pp.484-495.
Moreno L, Bavota G, Penta M D, Oliveto R, Marcus A, Canfora G. ARENA: An approach for the automated generation of release notes. IEEE Transactions on Software Engineering, 2016, 43(2): 106-127.
Article Google Scholar
Shen J, Sun X, Li B, Yang H, Hu J. On automatic summarization of what and why information in source code changes. In Proc. the 40th IEEE Annual Computer Software and Applications Conference, June 2016, pp.103-112.
Buse R P, Weimer W R. Automatically documenting program changes. In Proc. the 25th IEEE/ACM International Conference on Automated Software Engineering, September 2010, pp.33-42.
Rastkar S, Murphy G C. Why did this code change? In Proc. the 35th International Conference on Software Engineering, May 2013, pp.1193-1196.
Parnin C, Görg C. Improving change descriptions with change contexts. In Proc. the 2008 International Working Conference on Mining Software Repositories, May 2008, pp.51-60.
Sridhara G, Hill E, Muppaneni D, Pollock L, Vijay-Shanker K. Towards automatically generating summary comments for Java methods. In Proc. the 25th IEEE/ACM International Conference on Automated Software Engineering, September 2010, pp.43-52.
Moreno L, Aponte J, Sridhara G, Marcus A, Pollock L, Vijay-Shanker K. Automatic generation of natural language summaries for Java classes. In Proc. the 21st International Conference on Program Comprehension, May 2013, pp.23-32.
Spinellis D. Version control systems. IEEE Software, 2005, 22(5): 108-109.
Article Google Scholar
Zhong H, Meng N. Towards reusing hints from past fixes: An exploratory study on thousands of real samples. In Proc. the 40th IEEE/ACM International Conference on Software Engineering, May 2018, pp.885-885.
Huang Y, Zheng Q, Chen X, Xiong Y, Liu Z, Luo X. Mining version control system for automatically generating commit comment. In Proc. the 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, November 2017, pp.414-423.
Cortes-Coy L F, Linares-Vásquez M, Aponte J, Poshyvanyk D. On automatically generating commit messages via summarization of source code changes. In Proc. the 14th IEEE International Working Conference on Source Code Analysis and Manipulation, September 2014, pp.275-284.
Jiang S, McMillan C. Towards automatic generation of short summaries of commits. arXiv:1703.09603, 2017. https://arxiv.org/abs/1703.09603, Sept. 2020.
Jiang S, Armaly A. Automatically generating commit messages from diffs using neural machine translation. In Proc. the 32nd IEEE/ACM International Conference on Automated Software Engineering, October 2017, pp.135-146.
Hoang T, Kang H J, Lawall J, Lo D. CC2Vec: Distributed representations of code changes. arXiv:2003.05620, 2003. https://arxiv.org/pdf/2003.05620.pdf, Sept. 2020.
Xu S, Yao Y, Xu F, Gu T, Tong H, Lu J. Commit message generation for source code changes. In Proc. the 28th International Joint Conference on Artificial Intelligence, August 2019, pp.3975-3981.
Liu Z, Xia X, Hassan A E, Lo D, Xing Z, Wang X. Neural-machine-translation-based commit message generation: How far are we? In Proc. the 33rd ACM/IEEE International Conference on Automated Software Engineering, September 2018, pp. 373-384.
Nie L Y, Gao C, Zhong Z, Lam W, Liu Y, Xu Z. Contextualized code representation learning for commit message generation. arXiv:2007.06934, 2020. https://arxiv.org/pdf/2007.06934, Sept. 2020.
Liu S, Gao C, Chen S, Nie L Y, Liu Y. ATOM: Commit message generation based on abstract syntax tree and hybrid ranking. arXiv:1912.02972, 2019. https://arxiv.org/abs/1912.02972, Sept. 2020.
McBurney P W, McMillan C. Automatic documentation generation via source code summarization of method context. In Proc. the 22nd International Conference on Program Comprehension, June 2014, pp.279-290.
Wong E, Yang J, Tan L. AutoComment: Mining question and answer sites for automatic comment generation. In Proc. the 28th IEEE/ACM International Conference on Automated Software Engineering, November 2013, pp.562-567.
Wong E, Liu T, Tan L. CloCom: Mining existing source code for automatic comment generation. In Proc. the 22nd IEEE International Conference on Software Analysis, Evolution, and Reengineering, March 2015, pp.380-389.
Haiduc S, Aponte J, Moreno L, Marcus A. On the use of automated text summarization techniques for summarizing source code. In Proc. the 17th Working Conference on Reverse Engineering, October 2010, pp.35-44.
Haiduc S, Aponte J, Marcus A. Supporting program comprehension with source code summarization. In Proc. the 32nd ACM/IEEE International Conference on Software Engineering, May 2010, pp.223-226.
Iyer S, Konstas I, Cheung A, Zettlemoyer L. Summarizing source code using a neural attention model. In Proc. the 54th Annual Meeting of the Association for Computational Linguistics, August 2016, pp.2073-2083.
Allamanis M, Peng H, Sutton C. A convolutional attention network for extreme summarization of source code. In Proc. the 33rd International Conference on Machine Learning, June 2016, pp.2091-2100.
Hu X, Li G, Xia X, Lo D, Jin Z. Deep code comment generation. In Proc. the 26th IEEE International Conference on Program Comprehension, May 2018, pp.200-210.
Hu X, Li G, Xia X, Lo D, Lu S, Jin Z. Summarizing source code with transferred API knowledge. In Proc. the 27th International Joint Conference on Artificial Intelligence, July 2018, pp.2269-2275.
Baxter I D, Yahin A, de Moura L M et al. Clone detection using abstract syntax trees. In Proc. the 1998 Int. Conf. Software Maintenance, November 1998, pp.368-377.
Roy C K, Cordy J R, Koschke R. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming, 2009, 74(7): 470-495.
Article MathSciNet Google Scholar
Wettel R, Marinescu R. Archeology of code duplication: Recovering duplication chains from small duplication fragments. In Proc. the 7th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, September 2005, pp.63-70.
Yuan Y, Guo Y. Boreas: An accurate and scalable token-based approach to code clone detection. In Proc. the 27th IEEE/ACM International Conference on Automated Software Engineering, Sept. 2012, pp.286-289.
Kamiya T, Kusumoto S, Inoue K. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 2002, 28(7): 654-670.
Article Google Scholar
Fluri B, Wuersch M, PInzger M, Gall H. Change distilling: Tree differencing for fine-grained source code change extraction. IEEE Transactions on Software Engineering, 2007, 33(11): 725-743.
Article Google Scholar
Misra J, Annervaz K, Kaulgud V. Software clustering: Unifying syntactic and semantic features. In Proc. the 19th Working Conference on Reverse Engineering, October 2012, pp.113-122.
Huang Y, Chen X, Liu Z, Luo X, Zheng Z. Using discriminative feature in software entities for relevance identification of code changes. Journal of Software: Evolution and Process, 2017, 29(7): Article No. 2.
Huang Y, Jia N, Chen X, Hong K, Zheng Z. Salient-class location: Help developers understand code change in code review. In Proc. the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, November 2018, pp.770-774.
Khatchadourian R, Rashid A, Masuhara H, Watanabe T. Detecting broken pointcuts using structural commonality and degree of interest (N). In Proc. the 30th IEEE/ACM International Conference on Automated Software Engineering, Nov. 2015, pp.641-646.
Nguyen H A, Nguyen A T, Nguyen T T, Nguyen T N, Rajan H. A study of repetitiveness of code changes in software evolution. In Proc. the 28th IEEE/ACM International Conference on Automated Software Engineering, Nov. 2013, pp.180-190.
Gao Q, Zhang H, Wang J, Xiong Y, Zhang L, Mei H. Fixing recurring crash bugs via analyzing Q & A sites (T). In Proc. the 30th IEEE/ACM International Conference on Automated Software Engineering, Nov. 2015, pp.307-318.
Huang Y, Hu X, Jia N, Chen X, Xiong Y, Zheng Z. Learning code context information to predict comment locations. IEEE Transactions on Reliability, 2020, 69(1): 88-105.
Article Google Scholar
Huang Y, Jia N, Shu J, Hu X, Chen X, Zhou Q. Does your code need comment? Software — Practice and Experience, 2020, 50(3): 227-245.
Article Google Scholar
Huang Y, Hu X, Jia N, Chen X, Zheng Z, Luo X. CommtPst: Deep learning source code for commenting positions prediction. Journal of Systems and Software, 2020, 170: Article No. 110754.
Oliva J, Serrano J I, del Castillo M D, Iglesias Á. SyMSS: A syntax-based measure for short-text semantic similarity. Data & Knowledge Engineering, 2011, 70(4): 390-405.
Article Google Scholar
Salton G. A vector space model for automatic indexing. Communications of the ACM, 1975, 18(11): 613-620.
Article MathSciNet Google Scholar
Zhang J, Chen J, Hao D, Xiong Y, Xie B, Zhang L, Mei H. Search-based inference of polynomial metamorphic relations. In Proc. the 2014 ACM/IEEE International Conference on Automated Software Engineering, September 2014, pp.701-712.
Li Q. A novel Likert scale based on fuzzy sets theory. Expert Systems with Applications, 2013, 40(5): 1609-1618.
Article Google Scholar
Navigli R. Word sense disambiguation: A survey. ACM Computing Surveys, 2009, 41(2): 115-183.
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Engineering Research Center of Digital Life, School of Data and Computer Science, Sun Yat-sen University, Guangzhou, 510006, China
Yuan Huang, Hao-Jie Zhou & Zi-Bin Zheng
School of Information Engineering, Hebei GEO University, Shijiazhuang, 050031, China
Nan Jia
Guangdong Key Laboratory for Big Data Analysis and Simulation of Public Opinion, School of Communication and Design, Sun Yat-sen University, Guangzhou, 510006, China
Xiang-Ping Chen
School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, 510006, China
Ming-Dong Tang
Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, 510006, China
Ming-Dong Tang

Authors

Yuan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Nan Jia
View author publications
You can also search for this author in PubMed Google Scholar
Hao-Jie Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xiang-Ping Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zi-Bin Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Dong Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiang-Ping Chen.

Supplementary Information

ESM 1

(PDF 108 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, Y., Jia, N., Zhou, HJ. et al. Learning Human-Written Commit Messages to Document Code Changes. J. Comput. Sci. Technol. 35, 1258–1277 (2020). https://doi.org/10.1007/s11390-020-0496-0

Download citation

Received: 05 April 2020
Revised: 15 October 2020
Published: 30 November 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11390-020-0496-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Human-Written Commit Messages to Document Code Changes

Abstract

Access this article

Similar content being viewed by others

A large-scale empirical study of commit message generation: models, datasets and evaluation

Effectiveness of exploring historical commits for developer recommendation: an empirical study

Characterizing and identifying reverted commits

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning Human-Written Commit Messages to Document Code Changes

Abstract

Access this article

Similar content being viewed by others

A large-scale empirical study of commit message generation: models, datasets and evaluation

Effectiveness of exploring historical commits for developer recommendation: an empirical study

Characterizing and identifying reverted commits

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation