CoCoAST: Representing Source Code via Hierarchical Splitting and Reconstruction of Abstract Syntax Trees

Shi, Ensheng; Wang, Yanlin; Du, Lun; Zhang, Hongyu; Han, Shi; Zhang, Dongmei; Sun, Hongbin

doi:10.1007/s10664-023-10378-9

CoCoAST: Representing Source Code via Hierarchical Splitting and Reconstruction of Abstract Syntax Trees

Published: 06 October 2023

Volume 28, article number 135, (2023)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Ensheng Shi ORCID: orcid.org/0000-0002-5543-2025¹,
Yanlin Wang²,
Lun Du³,
Hongyu Zhang⁴,
Shi Han³,
Dongmei Zhang³ &
…
Hongbin Sun¹

397 Accesses
1 Citation
Explore all metrics

Abstract

Recently, machine learning techniques especially deep learning techniques have made substantial progress on some code intelligence tasks such as code summarization, code search, clone detection, etc. How to represent source code to effectively capture the syntactic, structural, and semantic information is a key challenge. Recent studies show that the information extracted from abstract syntax trees (ASTs) is conducive to code representation learning. However, existing approaches fail to fully capture the rich information in ASTs due to the large size/depth of ASTs. In this paper, we propose a novel model CoCoAST that hierarchically splits and reconstructs ASTs to comprehensively capture the syntactic and semantic information of code without the loss of AST structural information. First, we hierarchically split a large AST into a set of subtrees and utilize a recursive neural network to encode the subtrees. Then, we aggregate the embeddings of subtrees by reconstructing the split ASTs to get the representation of the complete AST. Finally, we combine AST representation carrying the syntactic and structural information and source code embedding representing the lexical information to obtain the final neural code representation. We have applied our source code representation to two common program comprehension tasks, code summarization and code search. Extensive experiments have demonstrated the superiority of CoCoAST. To facilitate reproducibility, our data and code are available https://github.com/s1530129650/CoCoAST.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Code Representation Based on Hybrid Graph Modelling

FCSO: Source Code Summarization by Fusing Multiple Code Features and Ensuring Self-consistency Output

Multi-modal Code Summarization Fusing Local API Dependency Graph and AST

Notes

The full AST is omitted due to space limit, it can be found in Appendix C
We only present the topdown skeleton and partial rules due to space limitation. The full set of rules and tool implementation are provided in Appendix D
A "rare" token refers to a token that occurs infrequently in the training dataset.
See training time details in Appendix Table 14 and 15
See Appendix Table 17

References

Ahmad WU, Chakraborty S, Ray B, Chang K (2020) A transformer-based approach for source code summarization. In: ACL
Ahmad WU, Chakraborty S, Ray B, Chang K (2021) Unified pre-training for program understanding and generation. In: NAACL-HLT, pp. 2655–2668. Association for Computational Linguistics
Allamanis M, Barr ET, Bird C, Sutton CA (2015) Suggesting accurate method and class names. In: FSE
Allamanis M, Brockschmidt M, Khademi M (2018) Learning to represent programs with graphs. In: ICLR. OpenReview.net
Allamanis M, Peng H, Sutton C (2016) A convolutional attention network for extreme summarization of source code. In: ICML, JMLR Workshop and Conference Proceedings, JMLR.org vol. 48, pp 2091–2100
Alon U, Brody S, Levy O, Yahav E (2019a) code2seq: Generating sequences from structured representations of code. In: ICLR (Poster). OpenReview.net
Alon U, Yahav E (2021) On the bottleneck of graph neural networks and its practical implications. In: ICLR. OpenReview.net
Alon U, Zilberstein M, Levy O, Yahav E (2019b) code2vec: Learning distributed representations of code. In: POPL
Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL
Bansal A, Haque S, McMillan C (2021) Project-level encoding for neural source code summarization of subroutines. In: ICPC, IEEE pp 253–264
Bengio Y, Frasconi P, Simard PY (1993) The problem of learning long-term dependencies in recurrent networks. In: ICNN
Cho K, van Merrienboer B, Gülçehre Ç , Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, ACL pp 1724–1734
Du L, Shi X, Wang Y, Shi E, Han S, Zhang D (2021) Is a single model enough? mucos: A multi-model ensemble learning approach for semantic code search. In: CIKM, ACM pp 2994–2998
Eddy BP, Robinson JA, Kraft NA, Carver JC (2013) Evaluating source code summarization techniques: Replication and expansion. In: ICPC, IEEE Computer Society pp 13–22
Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) Codebert: A pre-trained model for programming and natural languages. In: EMNLP (Findings)
Fernandes P, Allamanis M, Brockschmidt M (2019) Structured neural summarization. In: ICLR
Fout A, Byrd J, Shariat B, Ben-Hur A (2017) Protein interface prediction using graph convolutional networks. In: NIPS, pp 6530–6539
Franks C, Tu Z, Devanbu PT, Hellendoorn V (2015) CACHECA: A cache language model based code suggestion tool. In: ICSE, IEEE Computer Society (2), pp 705–708
Gao S, Gao C, He Y, Zeng J, Nie LY, Xia X (2021) Code structure guided transformer for source code summarization. arXiv:2104.09340
Garg VK, Jegelka S, Jaakkola TS (2020) Generalization and representational limits of graph neural networks. ICML, Proceedings of Machine Learning Research, PMLR 119:3419–3430
Google Scholar
Gros D, Sezhiyan H, Devanbu P, Yu Z (2020) Code to comment “translation”: Data, metrics, baselining & evaluation. In: ASE
Gu J, Chen Z, Monperrus M (2021) Multimodal representation for neural code search. In: IEEE International Conference on Software Maintenance and Evolution, ICSME 2021, Luxembourg, 2021, pp 483–494. IEEE. September 27 - October 1 https://doi.org/10.1109/ICSME52107.2021.00049
Gu W, Li Z, Gao C, Wang C, Zhang H, Xu Z, Lyu MR (2021) Cradle: Deep code retrieval based on semantic dependency learning. Neural Networks 141:385–394
Article Google Scholar
Gu X, Zhang H, Kim S (2018) Deep code search. In: ICSE, ACM pp 933–944
Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S, Tufano M, Deng SK, lement CB, Drain D, Sundaresan N, Yin J, Jiang D, Zhou M (2021) Graphcodebert: Pre-training code representations with data flow. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, OpenReview.net. 3-7 May 2021. https://openreview.net/forum?id=jLoC4ez43PZ
Haiduc S, Aponte J, Moreno L, Marcus A (2010) On the use of automated text summarization techniques for summarizing source code. In:WCRE, IEEE Computer Society pp 35–44
Haije T (2016) Automatic comment generation using a neural translation model. Bachelor’s thesis, University of Amsterdam
Haldar R, Wu L, Xiong J, Hockenmaier J (2020) A multi-perspective architecture for semantic code search. In: Jurafsky D, Chai J, Schluter N, Tetreault JR (Eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, Association for Computational Linguistics pp 8563–8568 5-10 July 2020. https://doi.org/10.18653/v1/2020.acl-main.758
Haque S, LeClair A, Wu L, McMillan C (2020) Improved automatic summarization of subroutines via attention to file context. In: MSR
He K, Fan H, Wu Y, Xie S, Girshick RB (2020) Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, Computer Vision Foundation / IEEE. pp 9726–9735. 13-19 June 2020 https://doi.org/10.1109/CVPR42600.2020.00975
Hellendoorn VJ, Sutton C, Singh R, Maniatis P, Bieber D (2020) Global relational models of source code. In: ICLR. OpenReview.net
Hu X, Li G, Xia X, Lo D, Jin Z (2018) Deep code comment generation. In: ICPC
Hu X, Li G, Xia X, Lo D, Jin Z (2019) Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering
Hu X, Li G, Xia X, Lo D, Jin Z (2018) Summarizing source code with transferred api knowledge. In: IJCAI
Huang J, Tang D, Shou L, Gong M, Xu K, Jiang D, Zhou M, Duan N (2021) Cosqa: 20, 000+ web queries for code search and question answering. In: ACL
Husain H, Wu H, Gazit T, Allamanis M, Brockschmidt M (2019) Codesearchnet challenge: Evaluating the state of semantic code search. CoRR abs/1909.09436. arXiv:1909.09436
Iyer S, Konstas I, Cheung A, Zettlemoyer L (2016) Summarizing source code using a neural attention model. In: ACL
Iyyer M, Manjunatha V, Boyd-Graber JL, III HD (2015) Deep unordered composition rivals syntactic methods for text classification. In: ACL (1), The Association for Computer Linguistics pp 1681–1691
Jain P, Jain A, Zhang T, Abbeel P, Gonzalez J, Stoica I (2021) Contrastive code representation learning. In: EMNLP, Association for Computational Linguistics (1), pp 5954–5971
Jiang X, Zheng Z, Lyu C, Li L, Lyu L (2021) Treebert: A tree-based pre-trained model for programming language. UAI, Proceedings of Machine Learning Research, AUAI Press 161:54–63
Google Scholar
Kanade A, Maniatis P, Balakrishnan G, Shi K (2020) Pre-trained contextual embedding of source code. arXiv:2001.00059
Kim Y (2014) Convolutional neural networks for sentence classification. In: EMNLP, ACL pp 1746–1751
LeClair A, Bansal A, McMillan C (2021) Ensemble models for neural source code summarization of subroutines In: ICSME, IEEE pp 286–297
LeClair A, Haque S, Wu L, McMillan C (2020) Improved code summarization via a graph neural network. In: ICPC, ACM pp 18–195
LeClair A, Jiang S, McMillan C (2019) A neural model for generating natural language summaries of program subroutines. In: ICSE
Li W, Qin H, Yan S, Shen B, Chen Y (2020) Learning code-query interaction for enhancing code searches. In: ICSME, IEEE pp 115–126
Libovický J, Helcl J, Mareček D (2018) Input combination strategies for multi-source transformer decoder. In: WMT
Lin C (2004) ROUGE: A package for automatic evaluation of summaries. In: ACL
Lin C, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: ACL pp 605–612
Ling C, Lin Z, Zou Y, Xie B (2020) Adaptive deep code search. In: ICPC ’20: 28th International Conference on Program Comprehension, Seoul, Republic of Korea, ACM pp 48–59 13-15 July 2020. https://doi.org/10.1145/3387904.3389278
Ling X, Wu L, Wang S, Pan G, Ma T, Xu F, Liu AX, Wu C, Ji S (2021) Deep graph matching and searching for semantic code retrieval. ACM Trans Knowl Discov Data 15(5): 88:1–88:21. https://doi.org/10.1145/3447571
Linstead E, Bajracharya SK, Ngo TC, Rigor P, Lopes CV, Baldi P (2009) Sourcerer: mining and searching internet-scale software repositories. Data Min Knowl Discov 18(2):300–336
Article MathSciNet Google Scholar
Liu F, Li G, Zhao Y, Jin Z (2020) Multi-task learning based pre-trained language model for code completion. In: ASE, IEEE pp 473–485
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized BERT pretraining approach. arXiv:1907.11692
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: ICLR
Lu M, Sun X, Wang S, Lo D, Duan Y (2015) Query expansion via wordnet for effective code search. In: SANER, IEEE Computer Society pp 545–549
Lv F, Zhang H, Lou J, Wang S, Zhang D, Zhao J (2015) Codehow: Effective code search based on API understanding and extended boolean model (E). In: ASE, IEEE Computer Society pp 260–270
McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding relevant functions and their usage. In: ICSE, ACM pp 111–120
Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. In: AAAI
Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding
Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: A method for automatic evaluation of machine translation. In: ACL, pp 311–318
Parr T (2013) The definitive ANTLR 4 reference (2 ed.). Pragmatic Bookshelf
Rodeghero P, McMillan C, McBurney PW, Bosch N, D’Mello SK (2014) Improving automated source code summarization via an eye-tracking study of programmers. In: ICSE, ACM pp 390–401
Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) Sourcerercc: Scaling code clone detection to big-code. In: ICSE, ACM pp 1157–1168
See A, Liu PJ, Manning CD (2017) Get to the point: Summarization with pointergenerator networks. In: ACL
Shi E, Gu W, Wang Y, Du L, Zhang H, Han S, Zhang D, Sun H (2022) Enhancing semantic code search with multimodal contrastive learning and soft data augmentation. https://doi.org/10.48550/arXiv.2204.03293
Shi E, Wang Y, Du L, Chen J, Han S, Zhang H, Zhang D, Sun H (2022) On the evaluation of neural code summarization
Shi E, Wang Y, Du L, Zhang H, Han S, Zhang D, Sun H (2021) CAST: Enhancing code summarization with hierarchical splitting and reconstruction of abstract syntax trees. In: EMNLP (1), Association for Computational Linguistics pp 4053–4062
Shi L, Mu F, Chen X, Wang S, Wang J, Yang Y, Li G, Xia X, Wang Q (2022) Are we building on the rock? on the importance of data preprocessing for code summarization. In: ESEC/SIGSOFT FSE, ACM pp 107–119
Shin ECR, Allamanis M, Brockschmidt M, Polozov A (2019) Program synthesis and semantic parsing with learned code idioms. In: NeurIPS, pp 10824–10834
Shuai J, Xu L, Liu C, Yan M, Xia X, Lei Y (2020) Improving code search with coattentive representation learning. In: ICPC ’20: 28th International Conference on Program Comprehension, Seoul, Republic of Korea, ACM pp 196–207 July 13-15, 2020. https://doi.org/10.1145/3387904.3389269
Sridhara G, Hill E, Muppaneni D, Pollock LL, Vijay-Shanker K (2010) Towards automatically generating summary comments for java methods. In: ASE, pp 43–52
Sun Z, Li L, Liu Y, Du X, Li L (2022) On the importance of building high-quality training datasets for neural code search. In: ICSE, ACM pp 1609–1620
Svyatkovskiy A, Deng SK, Fu S, Sundaresan N (2020) Intellicode compose: code generation using transformer. In: ESEC/SIGSOFT FSE, ACM pp 1433–1443
Tai KS, Socher R, Manning CD (2015) Improved semantic representations from treestructured long short-term memory networks. In: ACL (1), The Association for Computer Linguistics pp 1556–1566
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: NIPS, pp 5998–6008
Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In: CVPR
Wan Y, Shu J, SuiY, Xu G, Zhao Z, Wu J, Yu PS (2019) Multi-modal attention network learning for semantic source code retrieval. In: ASE, IEEE pp 13–25
Wan Y, Zhao Z, Yang M, Xu G, Ying H, Wu J, Yu PS (2018) Improving automatic source code summarization via deep reinforcement learning. In: ASE
Wang X, Wang Y, Mi F, Zhou P, Wan Y, Liu X, Li L, Wu H, Liu J, Jiang X (2021) Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv:2108.04556
Wang Y, Du L, Shi E, Hu Y, Han S, Zhang D (2020) Cocogum: Contextual code summarization with multi-relational gnn on umls. Tech rep, Microsoft, MSR-TR-2020-16
Wang Y, Li H (2021) Code completion by modeling flattened abstract syntax trees as graphs. In: AAAI
Wang Y, Wang W, Joty SR, Hoi SCH (2021) Codet5: Identifier-aware unified pretrained encoder-decoder models for code understanding and generation. In: EMNLP (1), Association for Computational Linguistics pp 8696–8708
Wei B, Li G, Xia X, Fu Z, Jin Z (2019) Code generation as a dual task of code summarization. In: NeurIPS, pp 6559–6569
Wei B, Li Y, Li G, Xia X, Jin Z (2020) Retrieve and refine: Exemplar-based neural comment generation. In: ASE, IEEE pp 349–360
White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: ASE, ACM pp 87–98
Wilcoxon F, Katti S, Wilcox RA (1970) Critical values and probability levels for the wilcoxon rank sum test and the wilcoxon signed rank test. Selected tables in mathematical statistics 1:171–259
MATH Google Scholar
Wu H, Zhao H, Zhang M (2021) Code summarization with structure-induced transformer. In: ACL/IJCNLP (Findings), Findings of ACL, vol. ACL/IJCNLP 2021, Association for Computational Linguistics pp 1078–1090
Wu Y, Lian D, Xu Y, Wu L, Chen E (2020) Graph convolutional networks with markov random field reasoning for social spammer detection. In: AAAI, AAAI Press pp 1054–1061
Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: CVPR, Computer Vision Foundation /IEEE Computer Society pp 3733–3742
Yang M, Zhou M, Li Z, Liu J, Pan L, Xiong H, King I (2022) Hyperbolic graph neural networks: A review of methods and applications. arXiv:2202.13852
Ye W, Xie R, Zhang J, Hu T, Wang X, Zhang S (2020) Leveraging code generation to improve code retrieval and summarization via dual learning. In: Huang Y, King I, Liu T, van Steen M (Eds)WWW’20: TheWeb Conference 2020, Taipei, Taiwan, ACM / IW3C2 pp 2309–2319. 20-24 April 2020. https://doi.org/10.1145/3366423.3380295
Yu X, Huang Q, Wang Z, Feng Y, Zhao D (2020) Towards context-aware code comment generation. In: EMNLP (Findings), Association for Computational Linguistics pp 3938–3947
Zhang J, Panthaplackel S, Nie P, Mooney RJ, Li JJ, Gligoric M (2021) Learning to generate code comments from class hierarchies. arXiv:2103.13426
Zhang J, Wang X, Zhang H, Sun H, Liu X (2020) Retrieval-based neural source code summarization. In: ICSE
Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: ICSE
Zhu Q, Sun Z, Liang X, Xiong Y, Zhang L (2020) Ocor: An overlapping-aware code retriever. In: 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Melbourne, Australia, IEEE pp 883–894 21-25 Sept 2020. https://doi.org/10.1145/3324884.3416530

Download references

Acknowledgements

We thank reviewers for their time on this work. This research was supported by National Key R &D Program of China (No. 2017YFA0700800) and Fundamental Research Funds for the Central Universities under Grant xtr072022001. We also thank the participants of our human evaluation for their time.

Author information

Authors and Affiliations

Xi’an Jiaotong University, Xi’an, China
Ensheng Shi & Hongbin Sun
Sun Yat-sen University, Zhuhai, China
Yanlin Wang
Microsoft, Beijing, China
Lun Du, Shi Han & Dongmei Zhang
Chongqing University, Chongqing, China
Hongyu Zhang

Authors

Ensheng Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yanlin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lun Du
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shi Han
View author publications
You can also search for this author in PubMed Google Scholar
Dongmei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hongbin Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yanlin Wang or Hongbin Sun.

Additional information

Communicated by: Foutse Khomh.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A Code Summarization

1.1 A.1 Hyperparameters

Table 13 summarizes the hyperparameters used in our experiments. $len_{code}$ and $len_{sum}$ are the sequence length of code and summary, respectively. Each covers at least 90% of the training set. $vocab_{code}$, $vocab_{sum}$, and $vocab_{ast}$ are the vocabulary size of code, summary, and AST. $len_{pos}$ refers to the clipping distance in relative position. $d_{Emb}$ is the dimension of the embedding layer. heads and layers indicate the number of layers and heads in Transformer, respectively. $d_{ff}$ is the hidden layer dimension of feed-forward in Transformer. $d_{RvNN}$ and $activate_f$ are dimension of hidden state and activate function in RvNN, respectively. $layer_{ff}$ is the number of the feed-forward layer in the copy component.

The values of these hyperparameters are set according to the related work (Ahmad et al. 2020; Zhang et al. 2019). We adjust the sizes of $d_{Emb}$, $d_{ff}$, and $d_{RvNN}$ empirically. The batch size is set according to the computing memory.

Table 13 Hyperparameters in our experiments on code summarization

Full size table

1.2 A.2 Runtime of Different AST Representation Models

Tables 14 and 15 show the time cost for differenct approaches on the two datasets. The 2nd column is the training time of one epoch. The 3rd column is the total training time. And the last column is the inference time per function. On the Funcom dataset, the baselines have different training time. Most of them range from 7 to 69 hours (except that Hybrid-DRL takes 633 hours). The inference time per function is 0.0073s for CASTS and 0.001 to 0.0139s for other baselines. Our approach has comparable time cost as the baselines. Similar performance can also be observed on TL-CodeSum. During training, HybridDrl firstly trains hybrid code representation by applying LSTM and Tree-LSTM to code token and AST. Then it generates summary based on actor-critic reinforcement learning. Therefore, it takes long time for HybridDrl to train the model.

Table 14 Time cost of different AST representation models in Funcom on code summariztion

Full size table

Table 15 Time cost of different AST representation models in TL-CodeSum on code summarization

Full size table

1.3 A.3 Experimental Results on Deduplicated Dataset

In our experiment, we find the existence of code duplication in TL-CodeSum: around $20\%$ code snippets in the testing set can be found in the training set. Thus, we remove the duplicated samples from the testing set and re-evaluate all approaches. Table 16 shows the result of different models in the rest testings set without duplicated samples. Our model still outperforms all baselines.

Table 16 Performance of different models on deduplicated TL-CodeSum dataset

Full size table

1.4 A.4 Evaluation Metrics

We provide the details of the evaluation metrics we used in the experiments.

1.4.1 A.4.1 BLEU

BLEU measures the average n-gram precision between the reference sentences and generated sentences, with brevity penalty for short sentences. The formula to compute BLEU-1/2/3/4 is:

$$\begin{aligned} {\text {BLEU-N}}= BP \cdot \exp \sum _{n=1}^{N} \omega _{n} \log p_{n}, \end{aligned}$$

(14)

where $p_n$ (n-gram precision) is the fraction of n-grams in the generated sentences which are present in the reference sentences, and $\omega _{n}$ is the uniform weight 1/N. Since the generated summary is very short, high-order n-grams may not overlap. We use the +1 smoothing function (Lin and Och 2004). BP is brevity penalty given as:

$$\begin{aligned} BP=\left\{ \begin{array}{cl} 1 &{} \text{ if } c>r \\ e^{(1-r / c)} &{} \text{ if } c \le r \end{array}\right. \end{aligned}$$

(15)

Here, c is the length of the generated summary, and r is the length of the reference sentence.

1.4.2 A.4.2 ROUGE-L

Based on longest common subsequence (LCS), ROUGE-L is widely used in text summarization. Instead of using only recall, it uses F-score which is the harmonic mean of precision and recall values. Suppose A and B are generated and reference summaries of lengths c and r respectively, we have:

$$\begin{aligned} \left\{ \begin{array}{cl} P_{ROUGE-L}=\frac{L C S(A, B)}{c}\\ R_{ROUGE-L}=\frac{L C S(A, B)}{r}\\ \end{array}\right. \end{aligned}$$

(16)

$F_{ROUGE-L}$, which indicates the value of ${\text {ROUGE-L}}$, is calculated as the weighted harmonic mean of $P_{ROUGE-L}$ and $R_{ROUGE-L}$:

$$\begin{aligned} F_{ROUGE-L}=\frac{\left( 1+\beta ^{2}\right) P_{ROUGE-L} \cdot R_{ROUGE-L}}{R_{ROUGE-L}+\beta ^{2} P_{ROUGE-L}} \end{aligned}$$

(17)

$\beta $ is set to 1.2 as in Zhang et al. (2020); Wan et al. (2018).

1.4.3 A.4.3 METEOR

METEOR is a recall-oriented metric that measures how well the model captures the content from the references in the generated sentences and has a better correlation with human judgment. Suppose m is the number of mapped unigrams between the reference and generated sentence with lengths c and r respectively. Then, precision, recall and F are given as:

$$\begin{aligned} P=\frac{m}{c},\,\, R=\frac{m}{r},\,\, F=\frac{P R}{\alpha P+ (1-\alpha )R} \end{aligned}$$

(18)

The sequence of mapping unigrams between the two sentences is divided into the fewest possible number of “chunks”. This way, the matching unigrams in each “chunk” are adjacent (in two sentences) and the word order is the same. The penalty is then computed as:

$$\begin{aligned} \text{ Pen }=\gamma \cdot \text{ frag } ^{\beta } \end{aligned}$$

(19)

where $\text{ frag }$ is a fragmentation fraction: $\text {frag}=ch/m$, where ch is the number of matching chunks and m is the total number of matches. The default values of $\alpha , \beta , \gamma $ are 0.9, 3.0 and 0.5 respectively.

1.4.4 A.4.4 CIDER

CIDER is a consensus-based evaluation metric used in image captioning tasks. The notions of importance and accuracy are inherently captured by computing the TF-IDF weight for each n-gram and using cosine similarity for sentence similarity. To compute CIDER, we first calculate the TF-IDF weighting $g_k(s_i)$ for each n-gram $\omega _k$ in reference sentence $s_i$. Here $\omega $ is the vocabulary of all n-grams. Then we use the cosine similarity between the generated sentence and the reference sentences to compute ${\text {CIDER}}_n$ score for n-grams of length n. The formula is given as:

$$\begin{aligned} {\text {CIDER}}_{n}\left( c_{i}, s_{i}\right) ==\frac{<\varvec{g}^{\varvec{n}}\left( c_{i}\right) , \varvec{g}^{\varvec{n}}\left( s_{i}\right) >}{\left\| \varvec{g}^{n}\left( c_{i}\right) \right\| \left\| \varvec{g}^{n}\left( s_{i}\right) \right\| } \end{aligned}$$

(20)

where $\varvec{g}^{\varvec{n}}(s_i)$ is a vector formed by $g_k(s_i)$ corresponding to all the n-grams (n varying from 1 to 4). $c_i$ is the $i^{th}$ generated sentence. Finally, the scores of various n-grams can be combined to calculate CIDER as follows:

$$\begin{aligned} {\text {CIDER}}\left( c_{i}, s_{i}\right) =\sum _{n=1}^{N} w_{n} {\text {CIDER}}_{n}\left( c_{i}, s_{i}\right) \end{aligned}$$

(21)

1.5 A.5 Human evaluation

Table 17 Statistics significance p-value of CoCoAST over other methods in human evaluation

Full size table

We conduct a human evaluation to evaluate the effectiveness of the summaries generated by our approach CoCoAST and the other three approaches. The shows that CoCoAST outperforms the others in all three aspects: similarity, naturalness, and informativeness. we confirmed the dominance of our approach using Wilcoxon signed-rank tests for human evaluation. The result shown in Table 17 reflects that the improvement of CoCoAST over other approaches is statistically significant with all p-values smaller than 0.05 at 95% confidence level (except CodeAstnn in the naturalness).

B Code Search

1.1 B.1 Hyperparameters

Table 18 summarizes the hyperparameters used in our experiments on code search. The values of these hyperparameters are set according to the related work (Zhang et al. 2019). We adjust the sizes of $d_{Emb}$, $d_{ff}$, and $d_{RvNN}$ empirically. The batch size is set according to the computing memory.

Table 18 Hyperparameters in our experiments on code search

Full size table

C Full AST

D Full Rules

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shi, E., Wang, Y., Du, L. et al. CoCoAST: Representing Source Code via Hierarchical Splitting and Reconstruction of Abstract Syntax Trees. Empir Software Eng 28, 135 (2023). https://doi.org/10.1007/s10664-023-10378-9

Download citation

Accepted: 01 August 2023
Published: 06 October 2023
DOI: https://doi.org/10.1007/s10664-023-10378-9

Keywords

Mathematics Subject Classification (2010)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CoCoAST: Representing Source Code via Hierarchical Splitting and Reconstruction of Abstract Syntax Trees

Abstract

Access this article

Similar content being viewed by others

Code Representation Based on Hybrid Graph Modelling

FCSO: Source Code Summarization by Fusing Multiple Code Features and Ensuring Self-consistency Output

Multi-modal Code Summarization Fusing Local API Dependency Graph and AST

Notes

References

Acknowledgements