Abstract
Code summarization aims to generate concise natural language descriptions for a piece of code, which can help developers comprehend the source code. Analysis of current work shows that the extraction of syntactic and semantic features of source code is crucial for generating high-quality summaries. To provide a more comprehensive feature representation of source code from different perspectives, we propose an approach named EnCoSum, which enhances semantic features for the multi-scale multi-modal code summarization method. This method complements our previously proposed M2TS approach (multi-scale multi-modal approach based on Transformer for source code summarization), which uses the multi-scale method to capture Abstract Syntax Trees (ASTs) structural information more completely and accurately at multiple local and global levels. In addition, we devise a new cross-modal fusion method to fuse source code and AST features, which can highlight key features in each modality that help generate summaries. To obtain richer semantic information, we improve M2TS. First, we add data flow and control flow to ASTs, and added-edge ASTs, called Enhanced-ASTs (E-ASTs). In addition, we introduce method name sequences extracted in the source code, which exist more knowledge about critical tokens in the corresponding summaries and can help the model generate higher-quality summaries. We conduct extensive experiments on processed Java and Python datasets and evaluate our approach via the four most commonly used machine translation metrics. The experimental results demonstrate that EnCoSum is effective and outperforms current state-of-the-art methods. Further, we perform ablation experiments on each of the model’s key components, and the results show that they all contribute to the performance of EnCoSum.
Similar content being viewed by others
Code Availability
We give the data and code that are publicly available on the repository in the paper.
References
Ahmad WU, Chakraborty S, Ray B, Chang K-W (2020) A transformer-based approach for source code summarization. In: ACL
Ahmad WU, Chakraborty S, Ray B, Chang K-W (2021) Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333
Allamanis M (2022) Graph neural networks in program analysis. Foundations, Frontiers, and Applications, Graph Neural Networks, pp 483–497
Allamanis M, Barr ET, Bird C, Sutton C (2015) Suggesting accurate method and class names. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, pp 38–49
Allamanis M, Brockschmidt M, Khademi M (2015) Learning to represent programs with graphs. In: International conference on learning representations
Allamanis M, Peng H, Sutton C (2016) A convolutional attention network for extreme summarization of source code. In: International conference on machine learning, pp 2091–2100. PMLR
Allamanis M, Tarlow D, Gordon A, Wei Y (2015) Bimodal modelling of source code and natural language. In: International conference on machine learning, pp 2123–2132. PMLR
Alon U, Brody S, Levy O, Yahav E (2018) code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400
Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Barone AVM, Sennrich R (2017) A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv e-prints, pp arXiv:1707
Cho K, Merriënboer BV, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Eddy BP, Robinson JA, Kraft NA, Carver JC (2013) Evaluating source code summarization techniques: Replication and expansion. In: 2013 21st International Conference on Program Comprehension (ICPC), pp 13–22. IEEE
Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D et al (2020) Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155
Gao S, Gao C, He Y, Zeng J, Nie LY, Xia X (2021) Code structure guided transformer for source code summarization. arXiv preprint arXiv:2104.09340
Gao Y, Lyu C (2022) M2ts: Multi-scale multi-modal approach based on transformer for source code summarization. arXiv preprint arXiv:2203.09707
Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S et al (2021) Graphcodebert: Pre-training code representations with data flow. In: ICLR
Haiduc S, Aponte J, Marcus A (2010a) Supporting program comprehension with source code summarization. In: 2010 acm/ieee 32nd international conference on software engineering, volume 2, pp 223–226. IEEE
Haiduc S, Aponte J, Moreno L, Marcus A (2010b) On the use of automated text summarization techniques for summarizing source code. In: 2010 17th Working conference on reverse engineering, pp 35–44. IEEE
Haije T, Intelligentie Bachelor Opleiding, Kunstmatige Gavves E, Heuer H (2016) Automatic comment generation using a neural translation model. Inf Softw Technol 55(3):258–268
Hasan M, Muttaqueen T, Ishtiaq AA, Mehrab KS, Haque Md, Anjum M, Hasan T, Ahmad WU, Iqbal A, Shahriyar R (2021) Codesc: A large code-description parallel dataset. arXiv preprint arXiv:2105.14220
Hindle A, Barr ET, Gabel M, Su Z, Devanbu P (2016) On the naturalness of software. Commun ACM 59(5):122–131
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hu X, Li G, Xia X, Lo D, Jin Z (2020) Deep code comment generation with hybrid lexical and syntactical information. Empir Softw Eng 25(3):2179–2217
Hu X, Li G, Xia X, Lo D, Jin Z (2018b) Deep code comment generation. In: 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC), pp 200–20010. IEEE
Hu X, Li G, Xia X, Lo D, Lu S, Jin Z (2018a) Summarizing source code with transferred api knowledge
Iyer S, Konstas I, Cheung A, Zettlemoyer L (2016) Summarizing source code using a neural attention model. In: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 2073–2083
Jiang X, Zheng Z, Lyu C, Li L, Lyu L (2021) Treebert: A tree-based pre-trained model for programming language. In: Uncertainty in artificial intelligence, pp 54–63. PMLR
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
Ko AJ, Myers BA, Aung Htet Htet (2004) Six learning barriers in end-user programming systems. In: 2004 IEEE Symposium on visual languages-human centric computing, pp 199–206. IEEE
Ko AJ, Myers BA, Coblenz MJ, Htet Aung Htet (2006) An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. IEEE Trans Softw Eng 32(12):971–987
LeClair A, Haque S, Wu L, McMillan C (2020) Improved code summarization via a graph neural network. In: Proceedings of the 28th international conference on program comprehension, pp 184–195
LeClair A, Jiang S, McMillan C (2019) A neural model for generating natural language summaries of program subroutines. In: 2019 IEEE/ACM 41st international conference on software engineering (ICSE), pp 795–806. IEEE
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Lin C, Ouyang Z, Zhuang J, Chen J, Li H, Wu R (2021) Improving code summarization with block-wise abstract syntax tree splitting. In: 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC), pp 184–195. IEEE
Lu S, Guo D, Ren S, Huang J, Svyatkovskiy A, Blanco A, Clement C, Drain D, Jiang D, Tang D et al (2021) Codexglue: A machine learning benchmark dataset for code understanding and generation. In: Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 1)
McBurney PW, McMillan C (2015) Automatic source code summarization of context for java methods. IEEE Trans Softw Eng 42(2):103–119
Mehrotra N, Agarwal N, Gupta P, Anand S, Lo D, Purandare R (2021) Modeling functional similarity in source code with graph-based siamese networks. IEEE Trans Softw Eng
Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. In: Thirtieth AAAI conference on artificial intelligence,
Niu C, Li C, Ng V, Ge J, Huang L, Luo B (2022) Spt-code: sequence-to-sequence pre-training for learning source code representations. In: Proceedings of the 44th international conference on software engineering, pp 2006–2018
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Phan L, Tran H, Le D, Nguyen H, Annibal J, Peltekian A, Ye Y (2021) Cotext: Multi-task learning with code-text transformer. In: Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), pp 40–47
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 400–407
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Rush AM, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. In: EMNLP
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2008) The graph neural network model. IEEE Transactions on Neural Networks 20(1):61–80
Shido Y, Kobayashi Y, Yamamoto A, Miyamoto A, Matsumura T (2019) Automatic source code summarization with extended tree-lstm. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp 1–8. IEEE
Shuai J, Xu L, Liu C, Yan M, Xia X, Lei Y (2020) Improving code search with co-attentive representation learning. In: Proceedings of the 28th international conference on program comprehension, pp 196–207
Singer J, Lethbridge T, Vinson N, Anquetil N (2010) An examination of software engineering work practices. In: CASCON First decade high impact papers, pp 174–188
Sridhara G, Hill E, Muppaneni D, Pollock L, Vijay-Shanker K (2010) Towards automatically generating summary comments for java methods. In: Proceedings of the IEEE/ACM international conference on Automated software engineering, pp 43–52
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Sun Z, Zhu Q, Xiong Y, Sun Y, Mou L, Zhang L (2020) Treegen: A tree-based transformer architecture for code generation. Proceedings of the AAAI Conference on Artificial Intelligence 34:8984–8991
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27
Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in Neural Information Processing Systems 30
Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang W, Li G, Shen S, Xia X, Jin Z (2020) Modular tree network for source code representation learning. ACM Trans Softw Eng Methodol (TOSEM) 29(4):1–23
Wang W, Li G, Ma B, Xia X, Jin Z (2020) Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp 261–271. IEEE
Wang Y, Wang W, Joty S, Hoi SCH (2021) Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv e-prints, pp arXiv:2109
Wang Y, Wang W, Joty S, Hoi SCH (2021) Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 8696–8708
Wang W, Zhang Y, Sui Y, Wan Y, Zhao Z, Wu J, Yu P, Xu G (2020) Reinforcement-learning-guided source code summarization via hierarchical attention. IEEE Trans Softw Eng
Wan Y, Zhao Z, Yang M, Xu G, Ying H, Wu J, Yu PS (2018) Improving automatic source code summarization via deep reinforcement learning. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, pp 397–407
Wei B, Li G, Xia X, Fu Z, Jin Z (2019) Code generation as a dual task of code summarization. Advances in Neural Information Processing Systems 32
Wong E, Yang J, Tan L (2013) Autocomment: Mining question and answer sites for automatic comment generation. In: 2013 28th IEEE/ACM International conference on automated software engineering (ASE), pp 562–567. IEEE
Xia X, Bao L, Lo D, Xing Z, Hassan AE, Li S (2017) Measuring program comprehension: A large-scale field study with professionals. IEEE Trans Softw Eng 44(10):951–976
Xu K, Wu L, Wang Z, Feng Y, Witbrock M, Sheinin V (2018) Graph2seq: Graph to sequence learning with attention-based neural networks. arXiv preprint arXiv:1804.00823
Yamaguchi F, Golde N, Arp D, Rieck K (2014) Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE Symposium on security and privacy, pp 590–604. IEEE
Yang Z, Keung J, Yu X, Gu X, Wei Z, Ma X, Zhang M (2021) A multi-modal transformer-based code summarization approach for smart contracts. In: 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC), pages 1–12. IEEE
Zhang J, Wang X, Zhang H, Sun H, Liu X (2020) Retrieval-based neural source code summarization. In: 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), pp 1385–1397. IEEE
Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp 783–794. IEEE
Zhao G, Huang J (2018) Deepsim: deep learning code functional similarity. In: Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp 141–151
Zhou Y, Liu S, Siow J, Du X, Liu Y (2019) Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in Neural Information Processing Systems 32
Funding
This work was supported by the Natural Science Foundation of Shandong Province, China (ZR2021MF059, ZR2019MF071), National Natural Science Foundation of China (61602286, 61976127) and Special Project on Innovative Methods (2020IM020100). Competing interests: The authors have no competing interests to declare that are relevant to the content of this article.
Author information
Authors and Affiliations
Contributions
Yuexiu Gao conceived and designed the study. Yuexiu Gao and Chen Lyu performed the experiments and wrote the paper. Hongyu Zhang and Chen Lyu reviewed and edited the manuscript. All authors discussed the results and contributed to the final manuscript.
Corresponding author
Additional information
Communicated by: Xin Xia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gao, Y., Zhang, H. & Lyu, C. EnCoSum: enhanced semantic features for multi-scale multi-modal source code summarization. Empir Software Eng 28, 126 (2023). https://doi.org/10.1007/s10664-023-10384-x
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-023-10384-x