A graph-based code representation method to improve code readability classification

Mi, Qing; Zhan, Yi; Weng, Han; Bao, Qinghang; Cui, Longjie; Ma, Wei

doi:10.1007/s10664-023-10319-6

A graph-based code representation method to improve code readability classification

Published: 23 May 2023

Volume 28, article number 87, (2023)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Qing Mi ORCID: orcid.org/0000-0001-5063-3189¹,
Yi Zhan¹,
Han Weng¹,
Qinghang Bao¹,
Longjie Cui¹ &
…
Wei Ma ORCID: orcid.org/0000-0001-9652-4260¹

536 Accesses
Explore all metrics

Abstract

Context

Code readability is crucial for developers since it is closely related to code maintenance and affects developers’ work efficiency. Code readability classification refers to the source code being classified as pre-defined certain levels according to its readability. So far, many code readability classification models have been proposed in existing studies, including deep learning networks that have achieved relatively high accuracy and good performance.

Objective

However, in terms of representation, these methods lack effective preservation of the syntactic and semantic structure of the source code. To extract these features, we propose a graph-based code representation method.

Method

Firstly, the source code is parsed into a graph containing its abstract syntax tree (AST) combined with control and data flow edges to reserve the semantic structural information and then we convert the graph nodes’ source code and type information into vectors. Finally, we train our graph neural networks model composing Graph Convolutional Network (GCN), DMoNPooling, and K-dimensional Graph Neural Networks (k-GNNs) layers to extract these features from the program graph.

Result

We evaluate our approach to the task of code readability classification using a Java dataset provided by Scalabrino et al. (2016). The results show that our method achieves 72.5% and 88% in three-class and two-class classification accuracy, respectively.

Conclusion

We are the first to introduce graph-based representation into code readability classification. Our method outperforms state-of-the-art readability models, which suggests that the graph-based code representation method is effective in extracting syntactic and semantic information from source code, and ultimately improves code readability classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Universal Representation for Code

Code comment generation based on graph neural network enhanced transformer model for code understanding in open-source software ecosystems

Article 14 June 2022

Measuring code maintainability with deep neural networks

Article 21 January 2023

Data Availibility

To aid reproducibility, we provide all our data and source code needed to replicate our findings. The complete replication package is available at: https://github.com/swy0601/Graph-Representation.

Notes

References

Alawad DM, Panta M, Zibran MF, et al (2019) An empirical study of the relationships between code readability and software complexity. arXiv:1909.01760
Allamanis M, Barr ET, Sutton C (2014) Learning natural coding conventions. In: Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering
Allamanis M, Brockschmidt M, Khademi M (2018) Learning to represent programs with graphs. arXiv:1711.00740
Alon U, Zilberstein M, Levy O, et al (2018) A general path-based representation for predicting program properties. CoRR abs/1803.09544. http://arxiv.org/abs/1803.09544
Brunner E, Munzel U (2000) The nonparametric behrens-fisher problem: Asymptotic theory and a small-sample approximation. Biom J 42:17–25. https://doi.org/10.1002/(SICI)1521-4036(200001)42:1<17::AID-BIMJ17>3.0.CO;2-U
Article MathSciNet MATH Google Scholar
Buse R, Weimer W (2010) Learning a metric for code readability. Softw Eng IEEE Trans 36:546–558. https://doi.org/10.1109/TSE.2009.70
Article Google Scholar
Cao S, Sun X, Bo L et al (2021) Bgnn4vd: Constructing bidirectional graph neural-network for vulnerability detection. Inf Softw Technol 136:106576
Article Google Scholar
Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Dinella E, Dai H, Li Z, Naik M, Song L, Wang K (2020) Hoppity: Learning graph transformations to detect and fix bugs in programs. International Conference on Learning Representations. https://iclr.cc/virtual_2020/poster_SJeqs6EFvB.html
Dorn J (2012) A general software readability model. MCS Thesis available from (http://www.cs.virginia.edu/weimer/students/dorn-mcs-paper.pdf) 5:11–14
Fakhoury S, Roy D, Hassan SA, et al (2019) Improving source code readability: Theory and practice. In: 2019 IEEE/ACM 27th international conference on program comprehension (ICPC). pp 2–12
Feng Z, Guo D, Tang D, et al (2020) Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155
Fernandes P, Allamanis M, Brockschmidt M (2019) Structured neural summarization. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=H1ersoRqtm
Gilmer J, Schoenholz SS, Riley PF, et al (2017) Neural message passing for quantum chemistry. arXiv:1704.01212
Hindle A, Barr ET, Su Z, et al (2012) On the naturalness of software. In: 2012 34th International Conference on Software Engineering (ICSE), pp 837–847. https://doi.org/10.1109/ICSE.2012.6227135
Hu Z, Dong Y, Wang K, et al (2020) Heterogeneous graph transformer. In: Huang Y, King I, Liu T, et al (eds) WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020. ACM / IW3C2, pp 2704–2710. https://doi.org/10.1145/3366423.3380027
Johnson J, Lubo S, Yedla N, et al (2019) An empirical study assessing source code readability in comprehension. In: 2019 IEEE International conference on software maintenance and evolution (ICSME). pp 513–523
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
LeClair A, Haque S, Wu LL, et al (2020) Improved code summarization via a graph neural network. In: Proceedings of the 28th international conference on program comprehension
Lee T, Lee JB, In H (2013) A study of different coding styles affecting code readability. Int J Softw Eng Appl 7:413–422. https://doi.org/10.14257/ijseia.2013.7.5.36
Ling C, Huang J, Zhang H (2003) Auc: a statistically consistent and more discriminating measure than accuracy. In: Proc 18th Int’l joint conf artificial intelligence (IJCAI)
Li Y, Tarlow D, Brockschmidt M, et al (2016) Gated graph sequence neural networks. In: Bengio Y, LeCun Y (eds) 4th international conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. http://arxiv.org/abs/1511.05493
Maddison CJ, Tarlow D (2014) Structured generative models of natural source code. ArXiv abs/1401.0514
Mannan UA, Ahmed I, Sarma A (2018) Towards understanding code readability and its impact on design quality. In: Proceedings of the 4th ACM SIGSOFT international workshop on NLP for software engineering. pp 18–21
Ma Y, Wang S, Aggarwal CC, et al (2019) Graph convolutional networks with eigenpooling. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. pp 723–731
Mi Q, Keung J, Xiao Y et al (2018) Improving code readability classification using convolutional neural networks. Inf Softw Technol 104. https://doi.org/10.1016/j.infsof.2018.07.006
Mi Q, Hao Y, Ou L, Ma W (2022a) Towards using visual, semantic and structural features to improve code readability classification. J Syst Softw 193:11. https://doi.org/10.1016/j.jss.2022.111454
Mi Q, Hao Y, Wu M, et al (2022b) An enhanced data augmentation approach to support multi-class code readability classification. In: International conference on software engineering and knowledge engineering
Morris C, Ritzert M, Fey M, et al (2019) Weisfeiler and leman go neural: Higher-order graph neural networks. In: Proceedings of the AAAI conference on artificial intelligence. pp 4602–4609
Pantiuchina J, Lanza M, Bavota G (2018) Improving code: The (mis) perception of quality metrics. In: 2018 IEEE international conference on software maintenance and evolution (ICSME). pp 80–91
Piantadosi V, Fierro F, Scalabrino S et al (2020) How does code readability change during software evolution? Empir Softw Eng 25:5374–5412
Article Google Scholar
Posnett D, Hindle A, Devanbu P (2011) A simpler model of software readability. In: Proceedings of the 8th working conference on mining software repositories. pp 73–82
Raychev V, Bielik P, Vechev MT (2016) Probabilistic model for code with decision trees. In: Visser E, Smaragdakis Y (eds) Proceedings of the 2016 ACM SIGPLAN international conference on object-oriented programming, systems, languages, and applications, OOPSLA 2016, part of SPLASH 2016, Amsterdam, The Netherlands, October 30 - November 4, 2016. ACM, pp 731–747. https://doi.org/10.1145/2983990.2984041
Scalabrino S, Linares-Vásquez M, Oliveto R et al (2018) A comprehensive model for code readability. J Softw Evol Process 30. https://doi.org/10.1002/smr.1958
Scalabrino S, Linares-Vasquez M, Poshyvanyk D, et al (2016) Improving code readability models with textual features. In: 2016 IEEE 24th International conference on program comprehension (ICPC), IEEE, pp 1–10
Sedano T (2016) Code readability testing, an empirical study. In: 2016 IEEE 29th International conference on software engineering education and training (CSEET). pp 111–117
Tsitsulin A, Palowitch J, Perozzi B, et al (2020) Graph clustering with graph neural networks. arXiv preprint arXiv:2006.16904
Vagavolu D, Swarna KC, Chimalakonda S (2021) A mocktail of source code representations. In: 2021 36th IEEE/ACM International conference on automated software engineering (ASE). pp 1296–1300
Wang X, Ji H, Shi C, et al (2019) Heterogeneous graph attention network. In: The world wide web conference. pp 2022–2032
Wang W, Li G, Ma B, et al (2020a) Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In: Kontogiannis K, Khomh F, Chatzigeorgiou A, et al (eds) 27th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2020, London, ON, Canada, February 18-21, 2020. IEEE, pp 261–271. https://doi.org/10.1109/SANER48275.2020.9054857
Wang W, Zhang K, Li G, et al (2020b) Learning to represent programs with heterogeneous graphs. CoRR abs/2012.04188. https://arxiv.org/abs/2012.04188
Xia X, Bao L, Lo D et al (2017) Measuring program comprehension: A large-scale field study with professionals. IEEE Trans Softw Eng 44(10):951–976
Article Google Scholar
Xu K, Hu W, Leskovec J, et al (2018a) How powerful are graph neural networks? CoRR abs/1810.00826. http://arxiv.org/abs/1810.00826
Xu K, Li C, Tian Y, et al (2018b) Representation learning on graphs with jumping knowledge networks. arXiv:abs/1806.03536
Yamaguchi F, Golde N, Arp D, et al (2014) Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE Symposium on security and privacy, SP 2014, Berkeley, CA, USA, May 18-21, 2014. IEEE Computer Society, pp 590–604, https://doi.org/10.1109/SP.2014.44
Zhang C, Song D, Huang C, et al (2019) Heterogeneous graph neural network. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. pp 793–803
Zhou Y, Liu S, Siow J, Du X, Liu Y (2019) Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Adv Neural Inf Process Syst 915:11

Download references

Acknowledgements

This work was supported by the GHfund B (20220202, ghfund202202028015) and the Spark Project of Beijing University of Technology (Project No. XH-2022-01-28).

Author information

Authors and Affiliations

Faculty of Information Technology, Beijing University of Technology, Beijing, China
Qing Mi, Yi Zhan, Han Weng, Qinghang Bao, Longjie Cui & Wei Ma

Authors

Qing Mi
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Han Weng
View author publications
You can also search for this author in PubMed Google Scholar
Qinghang Bao
View author publications
You can also search for this author in PubMed Google Scholar
Longjie Cui
View author publications
You can also search for this author in PubMed Google Scholar
Wei Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Ma.

Ethics declarations

Conflicts of Interest

The authors declared that they have no conflict of interest.

Additional information

Communicated by: Simone Scalabrino, Rocco Oliveto, Felipe Ebert, Fernanda Madeiral, Fernando Castor.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mi, Q., Zhan, Y., Weng, H. et al. A graph-based code representation method to improve code readability classification. Empir Software Eng 28, 87 (2023). https://doi.org/10.1007/s10664-023-10319-6

Download citation

Accepted: 13 March 2023
Published: 23 May 2023
DOI: https://doi.org/10.1007/s10664-023-10319-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A graph-based code representation method to improve code readability classification