Skip to main content
Log in

Path context augmented statement and network for learning programs

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Applying machine learning techniques in program analysis has attracted much attention. Recent research efforts in detecting code clones and classifying code have shown that neural models based on abstract syntax trees (ASTs) can better represent source code than other approaches. However, existing AST-based approaches do not take into account contextual information of a program, like statement context. To address this issue, we propose a novel approach path context to capture the context of statements, and a path context augmented network (PCAN) to learn a program. We evaluate PCAN on code clone detection, source code classification, and method naming. The results show that compared to state-of-the-art approaches, PCAN performs the best on code clone detection and has comparable performance on code classification and method naming.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://docs.oracle.com/javase/tutorial/java/nutsandbolts/expressions.html

  2. pyparser homepage, https://pypi.python.org/pypi/pycparser

  3. javaparser homepage, https://github.com/javaparser/javaparser

  4. We built our dataset by following the data preparation method in CDLH (Wei and Li , 2017) so that we could directly use and compare with their results.

  5. Astminer: https://github.com/JetBrains-Research/astminer

  6. https://github.com/HongliangLiang/pcan/

References

  • Ahmadi M, Farkhani RM, Williams R, Lu L (2021) Finding bugs using your own code: detecting functionally-similar yet inconsistent code. In: 30th USENIX Security Symposium (USENIX Security 21)

  • Allamanis M, Brockschmidt M, Khademi M (2018) Learning to represent programs with graphs. In: 6th international conference on learning representations, ICLR 2018 - Conference Track Proceedings, 1711.00740

  • Alon U, Zilberstein M, Levy O, Yahav E (2018) A general path-based representation for predicting program properties. SIGPLAN Not 53(4):404–419. https://doi.org/10.1145/3296979.3192412

    Article  Google Scholar 

  • Alon U, Levy O, Brody S, Yahav E (2019a) Code2Seq: generating sequences from structured representations of code. 7th International Conference on Learning Representations, ICLR 2019 (1):1–22, 1808.01400

  • Alon U, Zilberstein M, Levy O, Yahav E (2019b) Code2vec: learning distributed representations of code. Proceedings of the ACM on Programming Languages 3(POPL):1–29. https://doi.org/10.1145/3290353, 1803.09473

  • Cai D, Lam W (2020) Graph transformer for graph-to-sequence learning. arXiv:1911.07470

  • Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches. In: Wu D, Carpuat M, Carreras X, Vecchi EM (eds) Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014. Association for Computational Linguistics, pp 103–111. https://doi.org/10.3115/v1/W14-4012. https://www.aclweb.org/anthology/W14-4012/

  • Cummins C, Fisches ZV, Ben-Nun T, Hoefler T, Leather H (2020) Programl: graph-based deep learning for program optimization and analysis. arXiv preprint arXiv:200310536

  • Falke R, Frenzel P, Koschke R (2008) Empirical evaluation of clone detection using syntax suffix trees. Empirical Software Engineering 13(6):601–643. https://doi.org/10.1007/s10664-008-9073-9

  • Fang C, Liu Z, Shi Y, Huang J, Shi Q (2020) Functional code clone detection with syntax and semantics fusion learning. Issta 20. https://doi.org/10.1145/3395363.3397362

  • Fernandes P, Allamanis M, Brockschmidt M (2019) Structured neural summarization. 7th International Conference on Learning Representations, ICLR 2019 (2018):1–18, 1811.01824

  • Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. 34th International Conference on Machine Learning, ICML 2017 3:2053–2070, 1704.01212

  • Goffi A, Gorla A, Mattavelli A, Pezzè M, Tonella P (2014) Search-based synthesis of equivalent method sequences. In: FSE 2014

  • Hellendoorn VJ, Sutton C, Singh R, Maniatis P, Bieber D (2019) Global relational models of source code. In: International conference on learning representations

  • Hindle A, Barr ET, Gabel M, Su Z, Devanbu PT (2016) On the naturalness of software. Commun ACM 59(5):122–131. https://doi.org/10.1145/2902362

    Article  Google Scholar 

  • Hu X, Li G, Xia X, Lo D, Jin Z (2020) Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering 25(3):2179–2217. https://doi.org/10.1007/s10664-019-09730-9

  • Jiang L, Misherghi G, Su Z, é phane Glondu S (2007) Deckard: scalable and accurate tree-based detection of code clones. 29th International Conference on Software Engineering (ICSE’07), pp 96–105

  • Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Software Eng 28:654–670

    Article  Google Scholar 

  • Khandelwal U, He H, Qi P, Jurafsky D (2018) Sharp nearby, fuzzy far away: how neural language models use context. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp 284–294. https://doi.org/10.18653/v1/P18-1027. https://www.aclweb.org/anthology/P18-1027

  • Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp 1–15, 1412.6980

  • Li L, Feng H, Zhuang W, Meng N, Ryder B (2017) CCLearner: a deep learning-based clone detection approach. In: Proceedings - 2017 IEEE international conference on software maintenance and evolution, ICSME 2017, pp 249–260. https://doi.org/10.1109/ICSME.2017.46

  • Li Y, Zemel R, Brockschmidt M, Tarlow D (2016) Gated graph sequence neural networks. 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings (1):1–20. arXiv:1511.05493v4

  • Li Y, Gu C, Dullien T, Vinyals O, Kohli P (2019) Graph matching networks for learning the similarity of graph structured objects. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, PMLR, Proceedings of Machine Learning Research, vol 97, pp 3835–3845. http://proceedings.mlr.press/v97/li19d.html

  • Lin Z, Feng M, Dos Santos CN, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attention sentence embedding. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, pp 1–15. 1703.03130

  • Linares-Vásquez M, Mcmillan C, Poshyvanyk D, Grechanik M (2014) On using machine learning to automatically classify software applications into domain categories. Empirical Software Engineering 19(3):582–618. https://doi.org/10.1007/s10664-012-9230-z

  • Luong T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. In: Màrquez L, Callison-Burch C, Su J, Pighin D, Marton Y (eds) Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, The Association for Computational Linguistics, pp 1412–1421. https://doi.org/10.18653/v1/d15-1166

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings pp 1–12, 1301.3781

  • Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. 30th AAAI Conference on Artificial Intelligence, AAAI 2016, pp 1287–1293, 1409.5718

  • Nafi KW, Kar TS, Roy B, Roy CK, Schneider KA (2019) CLCDSA: Cross language code clone detection using syntactical features and API documentation. Proceedings - 2019 34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, pp 1026–1037. https://doi.org/10.1109/ASE.2019.00099

  • Narayanan A, Chandramohan M, Venkatesan R, Chen L, Liu Y, Jaiswal S (2017) graph2vec: Learning distributed representations of graphs. CoRR abs/1707.05005, http://arxiv.org/abs/1707.05005, 1707.05005

  • Ragkhitwetsagul C, Krinke J (2019) Siamese: scalable and incremental code clone search via multiple code representations. Empirical Software Engineering 24(4):2236–2284. https://doi.org/10.1007/s10664-019-09697-7

    Article  Google Scholar 

  • Roy CK, Cordy JR (2008) NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. IEEE International Conference on Program Comprehension, pp 172–181. https://doi.org/10.1109/ICPC.2008.41

  • Saini V, Farmahinifarahani F, Lu Y, Baldi P, Lopes CV (2018) Oreo: detection of clones in the twilight zone. In: ESEC/FSE 2018 - Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp 354–365. https://doi.org/10.1145/3236024.3236026, 1806.05837

  • Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) Sourcerercc: Scaling code clone detection to big-code. 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp 1157–1168

  • Socher R, Pennington J, Huang EH, Ng AY, Manning CD (2011) Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, ACL, pp 151–161, https://www.aclweb.org/anthology/D11-1014/

  • Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-Term memory networks. In: ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference, vol 1, pp 1556–1566. https://doi.org/10.3115/v1/p15-1150, 1503.00075

  • Tufano M, Watson C, Bavota G, Di Penta M, White M, Poshyvanyk D (2018) Deep learning similarities from different representations of source code. In: Proceedings - International Conference on Software Engineering, IEEE Computer Society, vol 18, pp 542–553. https://doi.org/10.1145/3196398.3196431

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Łu, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, vol 2017-Decem, pp 5999–6009, 1706.03762

  • Wang W, Li G, Ma B, Xia X, Jin Z (2020a) Detecting code clones with graph neural networkand flow-augmented abstract syntax tree. arXiv preprint arXiv:200208653

  • Wang W, Li G, Shen S, Xia X, Jin Z (2020b) Modular Tree Network for Source Code Representation Learning. ACM Transactions on Software Engineering and Methodology 29(4):1–23. https://doi.org/10.1145/3409331

  • Wang Y, Wang K, Gao F, Wang L (2020c) Learning semantic program embeddings with graph interval neural network. Proceedings of the ACM on Programming Languages 4(OOPSLA):1–27

  • Wei HH, Li M (2017) Supervised deep features for Software functional clone detection by exploiting lexical and syntactical information in source code. IJCAI International Joint Conference on Artificial Intelligence pp 3034–3040. https://doi.org/10.24963/ijcai.2017/423

  • White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. ASE 2016 - Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, pp 87–98. https://doi.org/10.1145/2970276.2970326

  • Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A Novel Neural Source Code Representation Based on Abstract Syntax Tree. In: Proceedings - International Conference on Software Engineering, IEEE Computer Society, vol 2019-May, pp 783–794. https://doi.org/10.1109/ICSE.2019.00086

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongliang Liang.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by Foutse Khomh, Gemma Catolino and Pasquale Salza.

This article belongs to the Topical Collection: Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiao, D., Hang, D., Ai, L. et al. Path context augmented statement and network for learning programs. Empir Software Eng 27, 37 (2022). https://doi.org/10.1007/s10664-021-10098-y

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-021-10098-y

Keywords

Navigation