Modeling source code in bimodal for program comprehension

Wen, Dongzhen; Zhang, Xiaokun; Diao, Yufeng; Zhao, Ziyun; Jiang, He; Lin, Hongfei

doi:10.1007/s00521-024-09498-0

Modeling source code in bimodal for program comprehension

Original Article
Published: 29 April 2024

(2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Dongzhen Wen¹,
Xiaokun Zhang¹,
Yufeng Diao³,
Ziyun Zhao¹,
He Jiang² &
…
Hongfei Lin ORCID: orcid.org/0000-0001-6314-1052¹

90 Accesses
Explore all metrics

Abstract

Source code is an intermediary through which humans communicate with computer systems. It contains a large amount of domain knowledge which can be learned by statistical models. Furthermore, this knowledge can be used to build software engineering tools. We find that the functionality of the source code depends on the programming language-specific token which build the base structure, while identifiers provide natural language information. On this basis, we found that the knowledge in the source code can be sufficiently learned more when modeling the source code in bimodal. This paper presents the bimodal composition language model (BCLM) for source code modeling and representation. We analyze the effectiveness of bimodal modeling, and the results show that the bimodal approach has great potential for source code modeling and program comprehension.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EnCoSum: enhanced semantic features for multi-scale multi-modal source code summarization

Article 19 September 2023

Leveraging pre-trained language models for code generation

Article Open access 29 February 2024

Investigating large language models capabilities for automatic code repair in Python

Article 09 May 2024

Data availability

This article uses the publicly available dataset CodeSearchNet, which can be accessed by following link: https://github.com/github/CodeSearchNet. Apart from this, no private dataset is used as evaluation data in this paper.

Notes

References

Allamanis M, Barr ET, Bird C, et al (2015a) Suggesting accurate method and class names. In: Nitto ED, Harman M, Heymans P (eds) Proceedings of the 2015 10th joint meeting on foundations of software engineering, ESEC/FSE 2015, Bergamo, Italy, August 30 - September 4, 2015. ACM, pp 38–49, https://doi.org/10.1145/2786805.2786849
Allamanis M, Tarlow D, Gordon AD, et al (2015b) Bimodal modelling of source code and natural language. In: Bach FR, Blei DM (eds) Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6-11 July 2015, JMLR workshop and conference Proceedings, vol 37. JMLR.org, pp 2123–2132, http://proceedings.mlr.press/v37/allamanis15.html
Allamanis M, Barr ET, Devanbu P et al (2018) A survey of machine learning for big code and naturalness. ACM Comput Surv. https://doi.org/10.1145/3212695
Article Google Scholar
Allamanis M, Brockschmidt M, Khademi M (2018c) Learning to represent programs with graphs. In: 6th international conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, https://openreview.net/forum?id=BJOFETxR-
Alon U, Brody S, Levy O, et al (2019) code2seq: Generating sequences from structured representations of code. In: 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, https://openreview.net/forum?id=H1gKYo09tX
Ben-Nun T, Jakobovits AS, Hoefler T (2018) Neural code comprehension: A learnable representation of code semantics. In: Bengio S, Wallach HM, Larochelle H, et al (eds) Advances in neural information processing systems 31: annual conference on neural information processing systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp 3589–3601, https://proceedings.neurips.cc/paper/2018/hash/17c3433fecc21b57000debdf7ad5c930-Abstract.html
Butler S, Wermelinger M, Yu Y, et al (2010) Exploring the influence of identifier names on code quality: An empirical study. In: 2010 14th European conference on software maintenance and reengineering, pp 156–165, 10.1109/CSMR.2010.27
Deissenbock F, Pizka M (2005) Concise and consistent naming [software system identifier naming]. In: 13th international workshop on program comprehension (IWPC’05), pp 97–106, https://doi.org/10.1109/WPC.2005.14
Devlin J, Chang M, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, pp 4171–4186, https://doi.org/10.18653/v1/n19-1423
Dong L, Yang N, Wang W, et al (2019) Unified language model pre-training for natural language understanding and generation. In: Wallach HM, Larochelle H, Beygelzimer A, et al (eds) Advances in Neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp 13042–13054, https://proceedings.neurips.cc/paper/2019/hash/c20bb2d9a50d5ac1f713f8b34d9aac5a-Abstract.html
Fang S, Tan Y, Zhang T et al (2021) Self-attention networks for code search. Inf Softw Technol 134:106542. https://doi.org/10.1016/j.infsof.2021.106542
Article Google Scholar
Feng Z, Guo D, Tang D, et al (2020) Codebert: A pre-trained model for programming and natural languages. In: Cohn T, He Y, Liu Y (eds) Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, Findings of ACL, vol EMNLP 2020. Association for Computational Linguistics, pp 1536–1547, https://doi.org/10.18653/v1/2020.findings-emnlp.139
Gu X, Zhang H, Zhang D, et al (2016) Deep API learning. In: Zimmermann T, Cleland-Huang J, Su Z (eds) Proceedings of the 24th ACM SIGSOFT international symposium on foundations of software engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016. ACM, pp 631–642, https://doi.org/10.1145/2950290.2950334
Gu X, Zhang H, Kim S (2018) Deep code search. In: Chaudron M, Crnkovic I, Chechik M, et al (eds) Proceedings of the 40th international conference on software engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018. ACM, pp 933–944, https://doi.org/10.1145/3180155.3180167
Guo D, Ren S, Lu S, et al (2020) Graphcodebert: Pre-training code representations with data flow. CoRR abs/2009.08366. arXiv:2009.08366
Haldar R, Wu L, Xiong J, et al (2020) A multi-perspective architecture for semantic code search. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, pp 8563–8568, https://doi.org/10.18653/v1/2020.acl-main.758
Hill E, Pollock LL, Vijay-Shanker K (2011) Improving source code search with natural language phrasal representations of method signatures. In: Alexander P, Pasareanu CS, Hosking JG (eds) 26th IEEE/ACM international conference on automated software engineering (ASE 2011), Lawrence, KS, USA, November 6-10, 2011. IEEE Computer Society, pp 524–527, https://doi.org/10.1109/ASE.2011.6100115
Hindle A, Barr ET, Su Z, et al (2012) On the naturalness of software. In: Proceedings of the 34th international conference on software engineering. IEEE Press, ICSE ’12, pp 837-847
Husain H, Wu H, Gazit T, et al (2019) Codesearchnet challenge: Evaluating the state of semantic code search. CoRR abs/1909.09436. arXiv:1909.09436
Kanade A, Maniatis P, Balakrishnan G, et al (2020) Pre-trained contextual embedding of source code. CoRR abs/2001.00059. arXiv:2001.00059
Karampatsis R, Sutton C (2020) Scelmo: Source code embeddings from language models. CoRR abs/2004.13214. arXiv:2004.13214
Lan Z, Chen M, Goodman S, et al (2020) ALBERT: A lite BERT for self-supervised learning of language representations. In: 8th international conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, https://openreview.net/forum?id=H1eA7AEtvS
Lawrie D, Feild H, Binkley D (2007) An empirical study of rules for well-formed identifiers: research articles. J Softw Maint Evol 19(4):205–229
Article Google Scholar
Lawrie DJ, Morrell C, Feild H et al (2007) Effective identifier names for comprehension and memory. Innov Syst Softw Eng 3(4):303–318. https://doi.org/10.1007/s11334-007-0031-2
Article Google Scholar
Le THM, Chen H, Babar MA (2020) Deep learning for source code modeling and generation: models, applications, and challenges. ACM Comput Surv 53(3):1–38. https://doi.org/10.1145/3383458
Article Google Scholar
Li R, Hu G, Peng M (2020) Hierarchical embedding for code search in software q &a sites. In: 2020 international joint conference on neural networks, IJCNN 2020, Glasgow, United Kingdom, July 19-24, 2020. IEEE, pp 1–10, https://doi.org/10.1109/IJCNN48605.2020.9207101
Li X, Gong Y, Shen Y, et al (2022) Coderetriever: A large scale contrastive pre-training method for code search. In: Goldberg Y, Kozareva Z, Zhang Y (eds) Proceedings of the 2022 conference on empirical methods in natural language processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022. association for computational linguistics, pp 2898–2910, https://aclanthology.org/2022.emnlp-main.187
Ling C, Lin Z, Zou Y, et al (2020) Adaptive deep code search. In: ICPC ’20: 28th International conference on program comprehension, Seoul, Republic of Korea, July 13-15, 2020. ACM, pp 48–59, https://doi.org/10.1145/3387904.3389278
Ling X, Wu L, Wang S et al (2021) Deep graph matching and searching for semantic code retrieval. ACM Trans Knowl Discov Data 15(5):1–21. https://doi.org/10.1145/3447571
Article Google Scholar
Liu C, Xia X, Lo D, et al (2020) Opportunities and challenges in code search tools. CoRR abs/2011.02297. arXiv:2011.02297
Liu Y, Ott M, Goyal N, et al (2019) Roberta: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. arXiv:1907.11692
Maalej W, Tiarks R, Roehm T et al (2014) On the comprehension of program comprehension. ACM Trans Softw Eng Methodol 23(4):1–37. https://doi.org/10.1145/2622669
Article Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to Information Retrieval. Cambridge University Press, USA
Book Google Scholar
Mitra B, Craswell N (2018) An introduction to neural information retrieval. Found Trends Inf Retr 13(1):1–126. https://doi.org/10.1561/1500000061
Article Google Scholar
Peters ME, Neumann M, Iyyer M, et al (2018) Deep contextualized word representations. In: Walker MA, Ji H, Stent A (eds) Proceedings of the 2018 conference of the North American Chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers). Association for Computational Linguistics, pp 2227–2237, https://doi.org/10.18653/v1/n18-1202
Qiu X, Sun T, Xu Y, et al (2020) Pre-trained models for natural language processing: A survey. CoRR abs/2003.08271. arXiv:2003.08271
Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. In: Inui K, Jiang J, Ng V, et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019. Association for Computational Linguistics, pp 3980–3990, https://doi.org/10.18653/v1/D19-1410
Rong X, Yan S, Oney S, et al (2016) Codemend: Assisting interactive programming with bimodal embedding. In: Rekimoto J, Igarashi T, Wobbrock JO, et al (eds) Proceedings of the 29th annual symposium on user interface software and technology, UIST 2016, Tokyo, Japan, October 16-19, 2016. ACM, pp 247–258, https://doi.org/10.1145/2984511.2984544
Sachdev S, Li H, Luan S, et al (2018) Retrieval on source code: a neural code search. In: Gottschlich J, Cheung A (eds) Proceedings of the 2nd ACM SIGPLAN international workshop on machine learning and programming languages, MAPL@PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018. ACM, pp 31–41, https://doi.org/10.1145/3211346.3211353
Shuai J, Xu L, Liu C, et al (2020) Improving code search with co-attentive representation learning. In: ICPC ’20: 28th international conference on program comprehension, Seoul, Republic of Korea, July 13-15, 2020. ACM, pp 196–207, https://doi.org/10.1145/3387904.3389269
Singer J, Lethbridge TC, Vinson NG, et al (1997) An examination of software engineering work practices. In: Johnson JH (ed) Proceedings of the 1997 conference of the centre for advanced studies on collaborative research, November 10-13, 1997, Toronto, Ontario, Canada. IBM, p 21, https://dl.acm.org/citation.cfm?id=782031
Sinha R, Desai U, Tamilselvam S, et al (2020) Evaluation of siamese networks for semantic code search. CoRR abs/2011.01043. arXiv:2011.01043
Storey MD (2006) Theories, tools and research methods in program comprehension: past, present and future. Softw Qual J 14(3):187–208. https://doi.org/10.1007/s11219-006-9216-4
Article Google Scholar
Sun Z, Liu Y, Yang C, et al (2020) PSCS: A path-based neural model for semantic code search. CoRR abs/2008.03042. arXiv:2008.03042
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Guyon I, von Luxburg U, Bengio S, et al (eds) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 5998–6008, https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Wan Y, Shu J, Sui Y, et al (2019) Multi-modal attention network learning for semantic source code retrieval. In: 34th IEEE/ACM international conference on automated software engineering, ASE 2019, San Diego, CA, USA, November 11-15, 2019. IEEE, pp 13–25, https://doi.org/10.1109/ASE.2019.00012
Wang H, Zhang J, Xia Y, et al (2020a) COSEA: convolutional code search with layer-wise attention. CoRR abs/2010.09520. arXiv:2010.09520
Wang W, Zhang Y, Zeng Z, et al (2020b) Trans3: A transformer-based framework for unifying code summarization and code search. CoRR abs/2003.03238. arXiv:2003.03238

Download references

Funding

This work is partially supported by grant from the Natural Science Foundation of China (No.62076046 and No.62006130), Inner Monoglia Science Foundation (No.2022MS06028). This work is supported by the National & Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian. This work is also supported by the Inner Mongolia Directly College and University Scientific Basic in 2022.

Author information

Authors and Affiliations

School of Computer Science and Technology, Dalian University of Technology, 2rd Ling Gong Road, Dalian, 116024, Liaoning, China
Dongzhen Wen, Xiaokun Zhang, Ziyun Zhao & Hongfei Lin
Department of Software Engineering, School of Software Technology, Dalian University of Technology, No. 321, Tuqiang Street, Dalian Economic and Technological Development Zone, Dalian, 116024, Liaoning, China
He Jiang
School of Computer Science and Technology, Inner Mongolia University for Nationalities, 536 West Huolinhe Street, Tongliao, 028000, Inner Mongolia, China
Yufeng Diao

Authors

Dongzhen Wen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaokun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yufeng Diao
View author publications
You can also search for this author in PubMed Google Scholar
Ziyun Zhao
View author publications
You can also search for this author in PubMed Google Scholar
He Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Hongfei Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongfei Lin.

Ethics declarations

Conflict of interest

All authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wen, D., Zhang, X., Diao, Y. et al. Modeling source code in bimodal for program comprehension. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09498-0

Download citation

Received: 18 March 2023
Accepted: 14 January 2024
Published: 29 April 2024
DOI: https://doi.org/10.1007/s00521-024-09498-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modeling source code in bimodal for program comprehension

Abstract

Access this article

Similar content being viewed by others

EnCoSum: enhanced semantic features for multi-scale multi-modal source code summarization

Leveraging pre-trained language models for code generation

Investigating large language models capabilities for automatic code repair in Python

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Modeling source code in bimodal for program comprehension

Abstract

Access this article

Similar content being viewed by others

EnCoSum: enhanced semantic features for multi-scale multi-modal source code summarization

Leveraging pre-trained language models for code generation

Investigating large language models capabilities for automatic code repair in Python

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation