Skip to main content
Log in

Modeling source code in bimodal for program comprehension

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Source code is an intermediary through which humans communicate with computer systems. It contains a large amount of domain knowledge which can be learned by statistical models. Furthermore, this knowledge can be used to build software engineering tools. We find that the functionality of the source code depends on the programming language-specific token which build the base structure, while identifiers provide natural language information. On this basis, we found that the knowledge in the source code can be sufficiently learned more when modeling the source code in bimodal. This paper presents the bimodal composition language model (BCLM) for source code modeling and representation. We analyze the effectiveness of bimodal modeling, and the results show that the bimodal approach has great potential for source code modeling and program comprehension.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Data availability

This article uses the publicly available dataset CodeSearchNet, which can be accessed by following link: https://github.com/github/CodeSearchNet. Apart from this, no private dataset is used as evaluation data in this paper.

Notes

  1. https://nlp.stanford.edu/projects/glove/

  2. https://golang.org/ref/spec.

  3. https://www.php.net/manual/en/reserved.php

  4. https://docs.oracle.com/javase/tutorial/java/nutsandbolts/_keywords.html.

  5. https://docs.python.org/3/reference/lexical_analysis.html.

  6. https://ruby-doc.org/docs/ruby-doc-bundle/Manual/man0-1.4/syntax.html.

  7. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Language_Resources.

  8. https://nlp.stanford.edu/projects/glove/

References

  1. Allamanis M, Barr ET, Bird C, et al (2015a) Suggesting accurate method and class names. In: Nitto ED, Harman M, Heymans P (eds) Proceedings of the 2015 10th joint meeting on foundations of software engineering, ESEC/FSE 2015, Bergamo, Italy, August 30 - September 4, 2015. ACM, pp 38–49, https://doi.org/10.1145/2786805.2786849

  2. Allamanis M, Tarlow D, Gordon AD, et al (2015b) Bimodal modelling of source code and natural language. In: Bach FR, Blei DM (eds) Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6-11 July 2015, JMLR workshop and conference Proceedings, vol 37. JMLR.org, pp 2123–2132, http://proceedings.mlr.press/v37/allamanis15.html

  3. Allamanis M, Barr ET, Devanbu P et al (2018) A survey of machine learning for big code and naturalness. ACM Comput Surv. https://doi.org/10.1145/3212695

    Article  Google Scholar 

  4. Allamanis M, Brockschmidt M, Khademi M (2018c) Learning to represent programs with graphs. In: 6th international conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, https://openreview.net/forum?id=BJOFETxR-

  5. Alon U, Brody S, Levy O, et al (2019) code2seq: Generating sequences from structured representations of code. In: 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, https://openreview.net/forum?id=H1gKYo09tX

  6. Ben-Nun T, Jakobovits AS, Hoefler T (2018) Neural code comprehension: A learnable representation of code semantics. In: Bengio S, Wallach HM, Larochelle H, et al (eds) Advances in neural information processing systems 31: annual conference on neural information processing systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp 3589–3601, https://proceedings.neurips.cc/paper/2018/hash/17c3433fecc21b57000debdf7ad5c930-Abstract.html

  7. Butler S, Wermelinger M, Yu Y, et al (2010) Exploring the influence of identifier names on code quality: An empirical study. In: 2010 14th European conference on software maintenance and reengineering, pp 156–165, 10.1109/CSMR.2010.27

  8. Deissenbock F, Pizka M (2005) Concise and consistent naming [software system identifier naming]. In: 13th international workshop on program comprehension (IWPC’05), pp 97–106, https://doi.org/10.1109/WPC.2005.14

  9. Devlin J, Chang M, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, pp 4171–4186, https://doi.org/10.18653/v1/n19-1423

  10. Dong L, Yang N, Wang W, et al (2019) Unified language model pre-training for natural language understanding and generation. In: Wallach HM, Larochelle H, Beygelzimer A, et al (eds) Advances in Neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp 13042–13054, https://proceedings.neurips.cc/paper/2019/hash/c20bb2d9a50d5ac1f713f8b34d9aac5a-Abstract.html

  11. Fang S, Tan Y, Zhang T et al (2021) Self-attention networks for code search. Inf Softw Technol 134:106542. https://doi.org/10.1016/j.infsof.2021.106542

    Article  Google Scholar 

  12. Feng Z, Guo D, Tang D, et al (2020) Codebert: A pre-trained model for programming and natural languages. In: Cohn T, He Y, Liu Y (eds) Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, Findings of ACL, vol EMNLP 2020. Association for Computational Linguistics, pp 1536–1547, https://doi.org/10.18653/v1/2020.findings-emnlp.139

  13. Gu X, Zhang H, Zhang D, et al (2016) Deep API learning. In: Zimmermann T, Cleland-Huang J, Su Z (eds) Proceedings of the 24th ACM SIGSOFT international symposium on foundations of software engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016. ACM, pp 631–642, https://doi.org/10.1145/2950290.2950334

  14. Gu X, Zhang H, Kim S (2018) Deep code search. In: Chaudron M, Crnkovic I, Chechik M, et al (eds) Proceedings of the 40th international conference on software engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018. ACM, pp 933–944, https://doi.org/10.1145/3180155.3180167

  15. Guo D, Ren S, Lu S, et al (2020) Graphcodebert: Pre-training code representations with data flow. CoRR abs/2009.08366. arXiv:2009.08366

  16. Haldar R, Wu L, Xiong J, et al (2020) A multi-perspective architecture for semantic code search. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, pp 8563–8568, https://doi.org/10.18653/v1/2020.acl-main.758

  17. Hill E, Pollock LL, Vijay-Shanker K (2011) Improving source code search with natural language phrasal representations of method signatures. In: Alexander P, Pasareanu CS, Hosking JG (eds) 26th IEEE/ACM international conference on automated software engineering (ASE 2011), Lawrence, KS, USA, November 6-10, 2011. IEEE Computer Society, pp 524–527, https://doi.org/10.1109/ASE.2011.6100115

  18. Hindle A, Barr ET, Su Z, et al (2012) On the naturalness of software. In: Proceedings of the 34th international conference on software engineering. IEEE Press, ICSE ’12, pp 837-847

  19. Husain H, Wu H, Gazit T, et al (2019) Codesearchnet challenge: Evaluating the state of semantic code search. CoRR abs/1909.09436. arXiv:1909.09436

  20. Kanade A, Maniatis P, Balakrishnan G, et al (2020) Pre-trained contextual embedding of source code. CoRR abs/2001.00059. arXiv:2001.00059

  21. Karampatsis R, Sutton C (2020) Scelmo: Source code embeddings from language models. CoRR abs/2004.13214. arXiv:2004.13214

  22. Lan Z, Chen M, Goodman S, et al (2020) ALBERT: A lite BERT for self-supervised learning of language representations. In: 8th international conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, https://openreview.net/forum?id=H1eA7AEtvS

  23. Lawrie D, Feild H, Binkley D (2007) An empirical study of rules for well-formed identifiers: research articles. J Softw Maint Evol 19(4):205–229

    Article  Google Scholar 

  24. Lawrie DJ, Morrell C, Feild H et al (2007) Effective identifier names for comprehension and memory. Innov Syst Softw Eng 3(4):303–318. https://doi.org/10.1007/s11334-007-0031-2

    Article  Google Scholar 

  25. Le THM, Chen H, Babar MA (2020) Deep learning for source code modeling and generation: models, applications, and challenges. ACM Comput Surv 53(3):1–38. https://doi.org/10.1145/3383458

    Article  Google Scholar 

  26. Li R, Hu G, Peng M (2020) Hierarchical embedding for code search in software q &a sites. In: 2020 international joint conference on neural networks, IJCNN 2020, Glasgow, United Kingdom, July 19-24, 2020. IEEE, pp 1–10, https://doi.org/10.1109/IJCNN48605.2020.9207101

  27. Li X, Gong Y, Shen Y, et al (2022) Coderetriever: A large scale contrastive pre-training method for code search. In: Goldberg Y, Kozareva Z, Zhang Y (eds) Proceedings of the 2022 conference on empirical methods in natural language processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022. association for computational linguistics, pp 2898–2910, https://aclanthology.org/2022.emnlp-main.187

  28. Ling C, Lin Z, Zou Y, et al (2020) Adaptive deep code search. In: ICPC ’20: 28th International conference on program comprehension, Seoul, Republic of Korea, July 13-15, 2020. ACM, pp 48–59, https://doi.org/10.1145/3387904.3389278

  29. Ling X, Wu L, Wang S et al (2021) Deep graph matching and searching for semantic code retrieval. ACM Trans Knowl Discov Data 15(5):1–21. https://doi.org/10.1145/3447571

    Article  Google Scholar 

  30. Liu C, Xia X, Lo D, et al (2020) Opportunities and challenges in code search tools. CoRR abs/2011.02297. arXiv:2011.02297

  31. Liu Y, Ott M, Goyal N, et al (2019) Roberta: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. arXiv:1907.11692

  32. Maalej W, Tiarks R, Roehm T et al (2014) On the comprehension of program comprehension. ACM Trans Softw Eng Methodol 23(4):1–37. https://doi.org/10.1145/2622669

    Article  Google Scholar 

  33. Manning CD, Raghavan P, Schütze H (2008) Introduction to Information Retrieval. Cambridge University Press, USA

    Book  Google Scholar 

  34. Mitra B, Craswell N (2018) An introduction to neural information retrieval. Found Trends Inf Retr 13(1):1–126. https://doi.org/10.1561/1500000061

    Article  Google Scholar 

  35. Peters ME, Neumann M, Iyyer M, et al (2018) Deep contextualized word representations. In: Walker MA, Ji H, Stent A (eds) Proceedings of the 2018 conference of the North American Chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers). Association for Computational Linguistics, pp 2227–2237, https://doi.org/10.18653/v1/n18-1202

  36. Qiu X, Sun T, Xu Y, et al (2020) Pre-trained models for natural language processing: A survey. CoRR abs/2003.08271. arXiv:2003.08271

  37. Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. In: Inui K, Jiang J, Ng V, et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019. Association for Computational Linguistics, pp 3980–3990, https://doi.org/10.18653/v1/D19-1410

  38. Rong X, Yan S, Oney S, et al (2016) Codemend: Assisting interactive programming with bimodal embedding. In: Rekimoto J, Igarashi T, Wobbrock JO, et al (eds) Proceedings of the 29th annual symposium on user interface software and technology, UIST 2016, Tokyo, Japan, October 16-19, 2016. ACM, pp 247–258, https://doi.org/10.1145/2984511.2984544

  39. Sachdev S, Li H, Luan S, et al (2018) Retrieval on source code: a neural code search. In: Gottschlich J, Cheung A (eds) Proceedings of the 2nd ACM SIGPLAN international workshop on machine learning and programming languages, MAPL@PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018. ACM, pp 31–41, https://doi.org/10.1145/3211346.3211353

  40. Shuai J, Xu L, Liu C, et al (2020) Improving code search with co-attentive representation learning. In: ICPC ’20: 28th international conference on program comprehension, Seoul, Republic of Korea, July 13-15, 2020. ACM, pp 196–207, https://doi.org/10.1145/3387904.3389269

  41. Singer J, Lethbridge TC, Vinson NG, et al (1997) An examination of software engineering work practices. In: Johnson JH (ed) Proceedings of the 1997 conference of the centre for advanced studies on collaborative research, November 10-13, 1997, Toronto, Ontario, Canada. IBM, p 21, https://dl.acm.org/citation.cfm?id=782031

  42. Sinha R, Desai U, Tamilselvam S, et al (2020) Evaluation of siamese networks for semantic code search. CoRR abs/2011.01043. arXiv:2011.01043

  43. Storey MD (2006) Theories, tools and research methods in program comprehension: past, present and future. Softw Qual J 14(3):187–208. https://doi.org/10.1007/s11219-006-9216-4

    Article  Google Scholar 

  44. Sun Z, Liu Y, Yang C, et al (2020) PSCS: A path-based neural model for semantic code search. CoRR abs/2008.03042. arXiv:2008.03042

  45. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Guyon I, von Luxburg U, Bengio S, et al (eds) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 5998–6008, https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

  46. Wan Y, Shu J, Sui Y, et al (2019) Multi-modal attention network learning for semantic source code retrieval. In: 34th IEEE/ACM international conference on automated software engineering, ASE 2019, San Diego, CA, USA, November 11-15, 2019. IEEE, pp 13–25, https://doi.org/10.1109/ASE.2019.00012

  47. Wang H, Zhang J, Xia Y, et al (2020a) COSEA: convolutional code search with layer-wise attention. CoRR abs/2010.09520. arXiv:2010.09520

  48. Wang W, Zhang Y, Zeng Z, et al (2020b) Trans3: A transformer-based framework for unifying code summarization and code search. CoRR abs/2003.03238. arXiv:2003.03238

Download references

Funding

This work is partially supported by grant from the Natural Science Foundation of China (No.62076046 and No.62006130), Inner Monoglia Science Foundation (No.2022MS06028). This work is supported by the National & Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian. This work is also supported by the Inner Mongolia Directly College and University Scientific Basic in 2022.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongfei Lin.

Ethics declarations

Conflict of interest

All authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wen, D., Zhang, X., Diao, Y. et al. Modeling source code in bimodal for program comprehension. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09498-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00521-024-09498-0

Keywords

Navigation