EnCoSum: enhanced semantic features for multi-scale multi-modal source code summarization

Gao, Yuexiu; Zhang, Hongyu; Lyu, Chen

doi:10.1007/s10664-023-10384-x

EnCoSum: enhanced semantic features for multi-scale multi-modal source code summarization

Published: 19 September 2023

Volume 28, article number 126, (2023)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

402 Accesses
Explore all metrics

Abstract

Code summarization aims to generate concise natural language descriptions for a piece of code, which can help developers comprehend the source code. Analysis of current work shows that the extraction of syntactic and semantic features of source code is crucial for generating high-quality summaries. To provide a more comprehensive feature representation of source code from different perspectives, we propose an approach named EnCoSum, which enhances semantic features for the multi-scale multi-modal code summarization method. This method complements our previously proposed M2TS approach (multi-scale multi-modal approach based on Transformer for source code summarization), which uses the multi-scale method to capture Abstract Syntax Trees (ASTs) structural information more completely and accurately at multiple local and global levels. In addition, we devise a new cross-modal fusion method to fuse source code and AST features, which can highlight key features in each modality that help generate summaries. To obtain richer semantic information, we improve M2TS. First, we add data flow and control flow to ASTs, and added-edge ASTs, called Enhanced-ASTs (E-ASTs). In addition, we introduce method name sequences extracted in the source code, which exist more knowledge about critical tokens in the corresponding summaries and can help the model generate higher-quality summaries. We conduct extensive experiments on processed Java and Python datasets and evaluate our approach via the four most commonly used machine translation metrics. The experimental results demonstrate that EnCoSum is effective and outperforms current state-of-the-art methods. Further, we perform ablation experiments on each of the model’s key components, and the results show that they all contribute to the performance of EnCoSum.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-modal Code Summarization Fusing Local API Dependency Graph and AST

FCSO: Source Code Summarization by Fusing Multiple Code Features and Ensuring Self-consistency Output

An Approach of Code Summary Generation Using Multi-Feature Fusion Based on Transformer

Code Availability

We give the data and code that are publicly available on the repository in the paper.

Notes

References

Ahmad WU, Chakraborty S, Ray B, Chang K-W (2020) A transformer-based approach for source code summarization. In: ACL
Ahmad WU, Chakraborty S, Ray B, Chang K-W (2021) Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333
Allamanis M (2022) Graph neural networks in program analysis. Foundations, Frontiers, and Applications, Graph Neural Networks, pp 483–497
Google Scholar
Allamanis M, Barr ET, Bird C, Sutton C (2015) Suggesting accurate method and class names. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, pp 38–49
Allamanis M, Brockschmidt M, Khademi M (2015) Learning to represent programs with graphs. In: International conference on learning representations
Allamanis M, Peng H, Sutton C (2016) A convolutional attention network for extreme summarization of source code. In: International conference on machine learning, pp 2091–2100. PMLR
Allamanis M, Tarlow D, Gordon A, Wei Y (2015) Bimodal modelling of source code and natural language. In: International conference on machine learning, pp 2123–2132. PMLR
Alon U, Brody S, Levy O, Yahav E (2018) code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400
Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Barone AVM, Sennrich R (2017) A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv e-prints, pp arXiv:1707
Cho K, Merriënboer BV, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Eddy BP, Robinson JA, Kraft NA, Carver JC (2013) Evaluating source code summarization techniques: Replication and expansion. In: 2013 21st International Conference on Program Comprehension (ICPC), pp 13–22. IEEE
Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D et al (2020) Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155
Gao S, Gao C, He Y, Zeng J, Nie LY, Xia X (2021) Code structure guided transformer for source code summarization. arXiv preprint arXiv:2104.09340
Gao Y, Lyu C (2022) M2ts: Multi-scale multi-modal approach based on transformer for source code summarization. arXiv preprint arXiv:2203.09707
Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S et al (2021) Graphcodebert: Pre-training code representations with data flow. In: ICLR
Haiduc S, Aponte J, Marcus A (2010a) Supporting program comprehension with source code summarization. In: 2010 acm/ieee 32nd international conference on software engineering, volume 2, pp 223–226. IEEE
Haiduc S, Aponte J, Moreno L, Marcus A (2010b) On the use of automated text summarization techniques for summarizing source code. In: 2010 17th Working conference on reverse engineering, pp 35–44. IEEE
Haije T, Intelligentie Bachelor Opleiding, Kunstmatige Gavves E, Heuer H (2016) Automatic comment generation using a neural translation model. Inf Softw Technol 55(3):258–268
Google Scholar
Hasan M, Muttaqueen T, Ishtiaq AA, Mehrab KS, Haque Md, Anjum M, Hasan T, Ahmad WU, Iqbal A, Shahriyar R (2021) Codesc: A large code-description parallel dataset. arXiv preprint arXiv:2105.14220
Hindle A, Barr ET, Gabel M, Su Z, Devanbu P (2016) On the naturalness of software. Commun ACM 59(5):122–131
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hu X, Li G, Xia X, Lo D, Jin Z (2020) Deep code comment generation with hybrid lexical and syntactical information. Empir Softw Eng 25(3):2179–2217
Article Google Scholar
Hu X, Li G, Xia X, Lo D, Jin Z (2018b) Deep code comment generation. In: 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC), pp 200–20010. IEEE
Hu X, Li G, Xia X, Lo D, Lu S, Jin Z (2018a) Summarizing source code with transferred api knowledge
Iyer S, Konstas I, Cheung A, Zettlemoyer L (2016) Summarizing source code using a neural attention model. In: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 2073–2083
Jiang X, Zheng Z, Lyu C, Li L, Lyu L (2021) Treebert: A tree-based pre-trained model for programming language. In: Uncertainty in artificial intelligence, pp 54–63. PMLR
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
Ko AJ, Myers BA, Aung Htet Htet (2004) Six learning barriers in end-user programming systems. In: 2004 IEEE Symposium on visual languages-human centric computing, pp 199–206. IEEE
Ko AJ, Myers BA, Coblenz MJ, Htet Aung Htet (2006) An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. IEEE Trans Softw Eng 32(12):971–987
Article Google Scholar
LeClair A, Haque S, Wu L, McMillan C (2020) Improved code summarization via a graph neural network. In: Proceedings of the 28th international conference on program comprehension, pp 184–195
LeClair A, Jiang S, McMillan C (2019) A neural model for generating natural language summaries of program subroutines. In: 2019 IEEE/ACM 41st international conference on software engineering (ICSE), pp 795–806. IEEE
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Lin C, Ouyang Z, Zhuang J, Chen J, Li H, Wu R (2021) Improving code summarization with block-wise abstract syntax tree splitting. In: 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC), pp 184–195. IEEE
Lu S, Guo D, Ren S, Huang J, Svyatkovskiy A, Blanco A, Clement C, Drain D, Jiang D, Tang D et al (2021) Codexglue: A machine learning benchmark dataset for code understanding and generation. In: Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 1)
McBurney PW, McMillan C (2015) Automatic source code summarization of context for java methods. IEEE Trans Softw Eng 42(2):103–119
Article Google Scholar
Mehrotra N, Agarwal N, Gupta P, Anand S, Lo D, Purandare R (2021) Modeling functional similarity in source code with graph-based siamese networks. IEEE Trans Softw Eng
Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. In: Thirtieth AAAI conference on artificial intelligence,
Niu C, Li C, Ng V, Ge J, Huang L, Luo B (2022) Spt-code: sequence-to-sequence pre-training for learning source code representations. In: Proceedings of the 44th international conference on software engineering, pp 2006–2018
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Phan L, Tran H, Le D, Nguyen H, Annibal J, Peltekian A, Ye Y (2021) Cotext: Multi-task learning with code-text transformer. In: Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), pp 40–47
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 400–407
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Article MATH Google Scholar
Rush AM, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. In: EMNLP
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2008) The graph neural network model. IEEE Transactions on Neural Networks 20(1):61–80
Article Google Scholar
Shido Y, Kobayashi Y, Yamamoto A, Miyamoto A, Matsumura T (2019) Automatic source code summarization with extended tree-lstm. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp 1–8. IEEE
Shuai J, Xu L, Liu C, Yan M, Xia X, Lei Y (2020) Improving code search with co-attentive representation learning. In: Proceedings of the 28th international conference on program comprehension, pp 196–207
Singer J, Lethbridge T, Vinson N, Anquetil N (2010) An examination of software engineering work practices. In: CASCON First decade high impact papers, pp 174–188
Sridhara G, Hill E, Muppaneni D, Pollock L, Vijay-Shanker K (2010) Towards automatically generating summary comments for java methods. In: Proceedings of the IEEE/ACM international conference on Automated software engineering, pp 43–52
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Sun Z, Zhu Q, Xiong Y, Sun Y, Mou L, Zhang L (2020) Treegen: A tree-based transformer architecture for code generation. Proceedings of the AAAI Conference on Artificial Intelligence 34:8984–8991
Article Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27
Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in Neural Information Processing Systems 30
Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang W, Li G, Shen S, Xia X, Jin Z (2020) Modular tree network for source code representation learning. ACM Trans Softw Eng Methodol (TOSEM) 29(4):1–23
Google Scholar
Wang W, Li G, Ma B, Xia X, Jin Z (2020) Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp 261–271. IEEE
Wang Y, Wang W, Joty S, Hoi SCH (2021) Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv e-prints, pp arXiv:2109
Wang Y, Wang W, Joty S, Hoi SCH (2021) Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 8696–8708
Wang W, Zhang Y, Sui Y, Wan Y, Zhao Z, Wu J, Yu P, Xu G (2020) Reinforcement-learning-guided source code summarization via hierarchical attention. IEEE Trans Softw Eng
Wan Y, Zhao Z, Yang M, Xu G, Ying H, Wu J, Yu PS (2018) Improving automatic source code summarization via deep reinforcement learning. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, pp 397–407
Wei B, Li G, Xia X, Fu Z, Jin Z (2019) Code generation as a dual task of code summarization. Advances in Neural Information Processing Systems 32
Wong E, Yang J, Tan L (2013) Autocomment: Mining question and answer sites for automatic comment generation. In: 2013 28th IEEE/ACM International conference on automated software engineering (ASE), pp 562–567. IEEE
Xia X, Bao L, Lo D, Xing Z, Hassan AE, Li S (2017) Measuring program comprehension: A large-scale field study with professionals. IEEE Trans Softw Eng 44(10):951–976
Article Google Scholar
Xu K, Wu L, Wang Z, Feng Y, Witbrock M, Sheinin V (2018) Graph2seq: Graph to sequence learning with attention-based neural networks. arXiv preprint arXiv:1804.00823
Yamaguchi F, Golde N, Arp D, Rieck K (2014) Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE Symposium on security and privacy, pp 590–604. IEEE
Yang Z, Keung J, Yu X, Gu X, Wei Z, Ma X, Zhang M (2021) A multi-modal transformer-based code summarization approach for smart contracts. In: 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC), pages 1–12. IEEE
Zhang J, Wang X, Zhang H, Sun H, Liu X (2020) Retrieval-based neural source code summarization. In: 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), pp 1385–1397. IEEE
Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp 783–794. IEEE
Zhao G, Huang J (2018) Deepsim: deep learning code functional similarity. In: Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp 141–151
Zhou Y, Liu S, Siow J, Du X, Liu Y (2019) Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in Neural Information Processing Systems 32

Download references

Funding

This work was supported by the Natural Science Foundation of Shandong Province, China (ZR2021MF059, ZR2019MF071), National Natural Science Foundation of China (61602286, 61976127) and Special Project on Innovative Methods (2020IM020100). Competing interests: The authors have no competing interests to declare that are relevant to the content of this article.

Author information

Authors and Affiliations

School of Information Science and Engineering, Shandong Normal University, Jinan, China
Yuexiu Gao & Chen Lyu
Chongqing University, Chongqing, China
Hongyu Zhang

Authors

Yuexiu Gao
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chen Lyu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yuexiu Gao conceived and designed the study. Yuexiu Gao and Chen Lyu performed the experiments and wrote the paper. Hongyu Zhang and Chen Lyu reviewed and edited the manuscript. All authors discussed the results and contributed to the final manuscript.

Corresponding author

Correspondence to Chen Lyu.

Additional information

Communicated by: Xin Xia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gao, Y., Zhang, H. & Lyu, C. EnCoSum: enhanced semantic features for multi-scale multi-modal source code summarization. Empir Software Eng 28, 126 (2023). https://doi.org/10.1007/s10664-023-10384-x

Download citation

Accepted: 22 August 2023
Published: 19 September 2023
DOI: https://doi.org/10.1007/s10664-023-10384-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EnCoSum: enhanced semantic features for multi-scale multi-modal source code summarization

Abstract

Access this article

Similar content being viewed by others

Multi-modal Code Summarization Fusing Local API Dependency Graph and AST

FCSO: Source Code Summarization by Fusing Multiple Code Features and Ensuring Self-consistency Output

An Approach of Code Summary Generation Using Multi-Feature Fusion Based on Transformer

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

EnCoSum: enhanced semantic features for multi-scale multi-modal source code summarization

Abstract

Access this article

Similar content being viewed by others

Multi-modal Code Summarization Fusing Local API Dependency Graph and AST

FCSO: Source Code Summarization by Fusing Multiple Code Features and Ensuring Self-consistency Output

An Approach of Code Summary Generation Using Multi-Feature Fusion Based on Transformer

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation