Path context augmented statement and network for learning programs

Xiao, Da; Hang, Dengji; Ai, Lu; Li, Shengping; Liang, Hongliang

doi:10.1007/s10664-021-10098-y

Path context augmented statement and network for learning programs

Published: 08 January 2022

Volume 27, article number 37, (2022)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Da Xiao¹,
Dengji Hang¹,
Lu Ai¹,
Shengping Li² &
…
Hongliang Liang ORCID: orcid.org/0000-0001-6877-780X¹

701 Accesses
3 Citations
Explore all metrics

Abstract

Applying machine learning techniques in program analysis has attracted much attention. Recent research efforts in detecting code clones and classifying code have shown that neural models based on abstract syntax trees (ASTs) can better represent source code than other approaches. However, existing AST-based approaches do not take into account contextual information of a program, like statement context. To address this issue, we propose a novel approach path context to capture the context of statements, and a path context augmented network (PCAN) to learn a program. We evaluate PCAN on code clone detection, source code classification, and method naming. The results show that compared to state-of-the-art approaches, PCAN performs the best on code clone detection and has comparable performance on code classification and method naming.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

Data collection and quality challenges in deep learning: a data-centric AI perspective

Article 03 January 2023

Applying NLP techniques to malware detection in a practical environment

Article Open access 06 June 2021

Notes

https://docs.oracle.com/javase/tutorial/java/nutsandbolts/expressions.html
pyparser homepage, https://pypi.python.org/pypi/pycparser
javaparser homepage, https://github.com/javaparser/javaparser
We built our dataset by following the data preparation method in CDLH (Wei and Li , 2017) so that we could directly use and compare with their results.
Astminer: https://github.com/JetBrains-Research/astminer
https://github.com/HongliangLiang/pcan/

References

Ahmadi M, Farkhani RM, Williams R, Lu L (2021) Finding bugs using your own code: detecting functionally-similar yet inconsistent code. In: 30th USENIX Security Symposium (USENIX Security 21)
Allamanis M, Brockschmidt M, Khademi M (2018) Learning to represent programs with graphs. In: 6th international conference on learning representations, ICLR 2018 - Conference Track Proceedings, 1711.00740
Alon U, Zilberstein M, Levy O, Yahav E (2018) A general path-based representation for predicting program properties. SIGPLAN Not 53(4):404–419. https://doi.org/10.1145/3296979.3192412
Article Google Scholar
Alon U, Levy O, Brody S, Yahav E (2019a) Code2Seq: generating sequences from structured representations of code. 7th International Conference on Learning Representations, ICLR 2019 (1):1–22, 1808.01400
Alon U, Zilberstein M, Levy O, Yahav E (2019b) Code2vec: learning distributed representations of code. Proceedings of the ACM on Programming Languages 3(POPL):1–29. https://doi.org/10.1145/3290353, 1803.09473
Cai D, Lam W (2020) Graph transformer for graph-to-sequence learning. arXiv:1911.07470
Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches. In: Wu D, Carpuat M, Carreras X, Vecchi EM (eds) Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014. Association for Computational Linguistics, pp 103–111. https://doi.org/10.3115/v1/W14-4012. https://www.aclweb.org/anthology/W14-4012/
Cummins C, Fisches ZV, Ben-Nun T, Hoefler T, Leather H (2020) Programl: graph-based deep learning for program optimization and analysis. arXiv preprint arXiv:200310536
Falke R, Frenzel P, Koschke R (2008) Empirical evaluation of clone detection using syntax suffix trees. Empirical Software Engineering 13(6):601–643. https://doi.org/10.1007/s10664-008-9073-9
Fang C, Liu Z, Shi Y, Huang J, Shi Q (2020) Functional code clone detection with syntax and semantics fusion learning. Issta 20. https://doi.org/10.1145/3395363.3397362
Fernandes P, Allamanis M, Brockschmidt M (2019) Structured neural summarization. 7th International Conference on Learning Representations, ICLR 2019 (2018):1–18, 1811.01824
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. 34th International Conference on Machine Learning, ICML 2017 3:2053–2070, 1704.01212
Goffi A, Gorla A, Mattavelli A, Pezzè M, Tonella P (2014) Search-based synthesis of equivalent method sequences. In: FSE 2014
Hellendoorn VJ, Sutton C, Singh R, Maniatis P, Bieber D (2019) Global relational models of source code. In: International conference on learning representations
Hindle A, Barr ET, Gabel M, Su Z, Devanbu PT (2016) On the naturalness of software. Commun ACM 59(5):122–131. https://doi.org/10.1145/2902362
Article Google Scholar
Hu X, Li G, Xia X, Lo D, Jin Z (2020) Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering 25(3):2179–2217. https://doi.org/10.1007/s10664-019-09730-9
Jiang L, Misherghi G, Su Z, é phane Glondu S (2007) Deckard: scalable and accurate tree-based detection of code clones. 29th International Conference on Software Engineering (ICSE’07), pp 96–105
Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Software Eng 28:654–670
Article Google Scholar
Khandelwal U, He H, Qi P, Jurafsky D (2018) Sharp nearby, fuzzy far away: how neural language models use context. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp 284–294. https://doi.org/10.18653/v1/P18-1027. https://www.aclweb.org/anthology/P18-1027
Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp 1–15, 1412.6980
Li L, Feng H, Zhuang W, Meng N, Ryder B (2017) CCLearner: a deep learning-based clone detection approach. In: Proceedings - 2017 IEEE international conference on software maintenance and evolution, ICSME 2017, pp 249–260. https://doi.org/10.1109/ICSME.2017.46
Li Y, Zemel R, Brockschmidt M, Tarlow D (2016) Gated graph sequence neural networks. 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings (1):1–20. arXiv:1511.05493v4
Li Y, Gu C, Dullien T, Vinyals O, Kohli P (2019) Graph matching networks for learning the similarity of graph structured objects. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, PMLR, Proceedings of Machine Learning Research, vol 97, pp 3835–3845. http://proceedings.mlr.press/v97/li19d.html
Lin Z, Feng M, Dos Santos CN, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attention sentence embedding. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, pp 1–15. 1703.03130
Linares-Vásquez M, Mcmillan C, Poshyvanyk D, Grechanik M (2014) On using machine learning to automatically classify software applications into domain categories. Empirical Software Engineering 19(3):582–618. https://doi.org/10.1007/s10664-012-9230-z
Luong T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. In: Màrquez L, Callison-Burch C, Su J, Pighin D, Marton Y (eds) Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, The Association for Computational Linguistics, pp 1412–1421. https://doi.org/10.18653/v1/d15-1166
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings pp 1–12, 1301.3781
Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. 30th AAAI Conference on Artificial Intelligence, AAAI 2016, pp 1287–1293, 1409.5718
Nafi KW, Kar TS, Roy B, Roy CK, Schneider KA (2019) CLCDSA: Cross language code clone detection using syntactical features and API documentation. Proceedings - 2019 34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, pp 1026–1037. https://doi.org/10.1109/ASE.2019.00099
Narayanan A, Chandramohan M, Venkatesan R, Chen L, Liu Y, Jaiswal S (2017) graph2vec: Learning distributed representations of graphs. CoRR abs/1707.05005, http://arxiv.org/abs/1707.05005, 1707.05005
Ragkhitwetsagul C, Krinke J (2019) Siamese: scalable and incremental code clone search via multiple code representations. Empirical Software Engineering 24(4):2236–2284. https://doi.org/10.1007/s10664-019-09697-7
Article Google Scholar
Roy CK, Cordy JR (2008) NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. IEEE International Conference on Program Comprehension, pp 172–181. https://doi.org/10.1109/ICPC.2008.41
Saini V, Farmahinifarahani F, Lu Y, Baldi P, Lopes CV (2018) Oreo: detection of clones in the twilight zone. In: ESEC/FSE 2018 - Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp 354–365. https://doi.org/10.1145/3236024.3236026, 1806.05837
Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) Sourcerercc: Scaling code clone detection to big-code. 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp 1157–1168
Socher R, Pennington J, Huang EH, Ng AY, Manning CD (2011) Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, ACL, pp 151–161, https://www.aclweb.org/anthology/D11-1014/
Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-Term memory networks. In: ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference, vol 1, pp 1556–1566. https://doi.org/10.3115/v1/p15-1150, 1503.00075
Tufano M, Watson C, Bavota G, Di Penta M, White M, Poshyvanyk D (2018) Deep learning similarities from different representations of source code. In: Proceedings - International Conference on Software Engineering, IEEE Computer Society, vol 18, pp 542–553. https://doi.org/10.1145/3196398.3196431
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Łu, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, vol 2017-Decem, pp 5999–6009, 1706.03762
Wang W, Li G, Ma B, Xia X, Jin Z (2020a) Detecting code clones with graph neural networkand flow-augmented abstract syntax tree. arXiv preprint arXiv:200208653
Wang W, Li G, Shen S, Xia X, Jin Z (2020b) Modular Tree Network for Source Code Representation Learning. ACM Transactions on Software Engineering and Methodology 29(4):1–23. https://doi.org/10.1145/3409331
Wang Y, Wang K, Gao F, Wang L (2020c) Learning semantic program embeddings with graph interval neural network. Proceedings of the ACM on Programming Languages 4(OOPSLA):1–27
Wei HH, Li M (2017) Supervised deep features for Software functional clone detection by exploiting lexical and syntactical information in source code. IJCAI International Joint Conference on Artificial Intelligence pp 3034–3040. https://doi.org/10.24963/ijcai.2017/423
White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. ASE 2016 - Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, pp 87–98. https://doi.org/10.1145/2970276.2970326
Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A Novel Neural Source Code Representation Based on Abstract Syntax Tree. In: Proceedings - International Conference on Software Engineering, IEEE Computer Society, vol 2019-May, pp 783–794. https://doi.org/10.1109/ICSE.2019.00086

Download references

Author information

Authors and Affiliations

Beijing University of Posts and Communications, Beijing, China
Da Xiao, Dengji Hang, Lu Ai & Hongliang Liang
Colorful Clouds Technology Co., Ltd., Beijing, China
Shengping Li

Authors

Da Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Dengji Hang
View author publications
You can also search for this author in PubMed Google Scholar
Lu Ai
View author publications
You can also search for this author in PubMed Google Scholar
Shengping Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongliang Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongliang Liang.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by Foutse Khomh, Gemma Catolino and Pasquale Salza.

This article belongs to the Topical Collection: Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xiao, D., Hang, D., Ai, L. et al. Path context augmented statement and network for learning programs. Empir Software Eng 27, 37 (2022). https://doi.org/10.1007/s10664-021-10098-y

Download citation

Accepted: 01 November 2021
Published: 08 January 2022
DOI: https://doi.org/10.1007/s10664-021-10098-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Path context augmented statement and network for learning programs

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Data collection and quality challenges in deep learning: a data-centric AI perspective

Applying NLP techniques to malware detection in a practical environment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Path context augmented statement and network for learning programs

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Data collection and quality challenges in deep learning: a data-centric AI perspective

Applying NLP techniques to malware detection in a practical environment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation