Comparison of Image-Based and Text-Based Source Code Classification Using Deep Learning

Kiyak, Elife Ozturk; Cengiz, Ayse Betul; Birant, Kokten Ulas; Birant, Derya

doi:10.1007/s42979-020-00281-1

Comparison of Image-Based and Text-Based Source Code Classification Using Deep Learning

Original Research
Published: 14 August 2020

Volume 1, article number 266, (2020)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Elife Ozturk Kiyak¹,
Ayse Betul Cengiz¹,
Kokten Ulas Birant² &
…
Derya Birant ORCID: orcid.org/0000-0003-3138-0432²

1843 Accesses
6 Citations
Explore all metrics

Abstract

Source code classification (SCC) is a task to assign codes into different categories according to a criterion such as according to their functionalities, programming languages or vulnerabilities. Many source code archives are organized according to the programming languages, and thereby, the desired code fragments can be easily accessed by searching within the archive. However, manually organizing source code archives by field experts is labor intensive and impractical because of the fast-growing available source codes. Therefore, this study proposes new convolutional neural network (CNN) architectures to build source code classifiers that automatically identify programming languages from source codes. This is the first study in which the performances of deep learning algorithms on programming language identification are compared on both image and text files. In this study, the experiments are performed on three source code datasets to identify eight programming languages, including C, C++, C# , Go, Python, Ruby, Rust, and Java. The comparative results indicate that although text-based SCC and image-based SCC approaches achieve very high (\(> 93.5\%\)) and similar accuracies, text-based classification has significantly better performance in terms of execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identifying algorithm in program code based on structural features using CNN classification model

Article Open access 23 September 2022

A Stacked Bidirectional LSTM Model for Classifying Source Codes Built in MPLs

Transformer-based networks over tree structures for code classification

Article 09 November 2021

References

Alrashedy K, Dharmaretnam D, German DM, Srinivasan V, Gulliver TA. SCC++: predicting the programming language of questions and snippets of Stack Overflow. J Syst Softw. 2020;. https://doi.org/10.1016/j.jss.2019.110505.
Article Google Scholar
Zevin S, Holzem C. Machine learning based source code classification using syntax oriented features 2017. arXiv preprint arXiv:1703.07638.
Gilda S. Source code classification using neural networks. In: 14th International Joint Conference on Computer Science and Software Engineering (JCSSE), IEEE; 2017. p. 1–6.
Baquero JF, Camargo JE, Restrepo-Calle F, Aponte JH, González FA. Predicting the programming language: Extracting knowledge from stack overflow posts. In: Colombian Conference on Computing. Springer; 2017. p. 199–210.
Ott J, Atchison A, Harnack P, Bergh A, Linstead E. A deep learning approach to identifying source code in images and video. In: 15th International Conference on Mining Software Repositories (MSR), IEEE/ACM; 2018. p. 376-386.
Zhao D, Xing Z, Chen C, Xia X, Li G. ActionNet: vision-based workflow action recognition from programming screencasts.In: 41st International Conference on Software Engineering (ICSE), IEEE/ACM; 2019. p. 350–361.
Oda Y, Fudaba H, Neubig G, Hata H, Sakti S, Toda T, Nakamura S. Learning to generate pseudo-code from source code using statistical machine translation. In: 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), IEEE/ACM; 2015. p. 574–584.
Kuhn A, Ducasse S, Gírba T. Semantic clustering: identifying topics in source code. Inf Softw Technol. 2007;. https://doi.org/10.1016/j.infsof.2006.10.017.
Article Google Scholar
Darwish O, Maabreh M, Karajeh O, Alsinglawi B. Source codes classification using a modified instruction count pass. In: Workshops of the International Conference on Advanced Information Networking and Applications (WAINA), Springer; 2019. p. 897–906.
Nguyen AT, Nguyen TN. Graph-based statistical language model for code. In : 37th IEEE International Conference on Software Engineering (ICSE), IEEE/ACM; vol 1; 2015. p.858–868.
Phana AH, Chau PN, Nguyen ML, Bui LT. Automatically classifying source code using tree-based approaches. 2018;. https://doi.org/10.1016/j.datak.2017.07.003.
Wilson W, Muteteke JJ, Li L. Automatic clustering of source code using self-organizing maps, In: Proceedings of 19th Annual Conference of SAIS. 2016; p. 1–5.
Shi ST, Li M, Lo D, Thung F, Huo X. Automatic code review by learning the revision of source code. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 33; 2019. p. 4910–4917.
Bandara U, Wijayarathna G. Source code author identification with unsupervised feature learning. Pattern Recognit Lett. 2013;. https://doi.org/10.1016/j.patrec.2012.10.027.
Article Google Scholar
Ying AT, Robillard MP. Code fragment summarization. In: Proceedings of the 9th Joint Meeting on Foundations of Software Engineering. 2013; p. 655–658.
Alvares M, Marwala T, de Lima Neto FB. Application of computational intelligence for source code classification. In: Congress on Evolutionary Computation (CEC), IEEE; 2014. p. 895–902.
Alrashedy K, Dharmaretnam D, German DM, Srinivasan V, Gulliver TA. SCC: Automatic classification of code snippets 2018. arXiv preprint arXiv:1809.07945
Reyes J, Ramírez D, Paciello J, Automatic classification of source code archives by programming language: a deep learning approach. In: International Conference on Computational Science and Computational Intelligence (CSCI), IEEE; 2016. p. 514–519.
Dam V, Kennedy J, Zaytsev V. Software language identification with natural language classifiers. In: 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), vol 01; 2016. p. 624–628.
Khasnabish JN, Sodhi M, Deshmukh J, Srinivasaraghavan G. Detecting programming language from source code using bayesian learning techniques. In: International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer; 2014. p. 513–522.
Klein D, Muuray K, Weber S. Algorithmic programming language identification 2011. arXiv preprint arXiv:1106.4064.
Alahmadi M, Hassel J, Parajuli B, Haiduc S, Kumar P. Accurately predicting the location of code fragments in programming video tutorials using deep learning. In: Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE), 2018. p. 2–11.
Guo G, Zhang N. A survey on deep learning based face recognition. Comput Vis Image Underst. 2019;. https://doi.org/10.1016/j.cviu.2019.102805.
Article Google Scholar
Pastor-Pellicer J, Castro-Bleda MJ, España-Boquera S, Zamora-Martíez F. Handwriting recognition by using deep learning to extract meaningful features. AI Commun. 2019;. https://doi.org/10.3233/AIC-170562.
Article MathSciNet Google Scholar
Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikainen M. Deep learning for generic object detection: a survey. Int J Comput Vis. 2020;. https://doi.org/10.1007/s11263-019-01247-4.
Article Google Scholar
Iwasaki R, Hasegawa T, Mori N, Matsumoto K. Relaxation method of convolutional neural networks for natural language processing. In: International Symposium on Distributed Computing and Artificial Intelligence. Springer; 2018. p.188–195.
Gimenez M, Palanca J, Botti V. Semantic-based padding in convolutional neural networks for improving the performance in natural language processing. A case of study in sentiment analysis. Neurocomputing. 2020; https://doi.org/10.1016/j.neucom.2019.08.096.
Gao H, Lin S, Li C, Yang Y. Application of hyperspectral image classification based on overlap pooling. Neural Process Lett. 2019;49:1335–54.
Article Google Scholar
Laks R. Image-based detection of programming languages. In: Github. 2018. https://github.com/rivol/programming-language-detection. Accessed 15 Nov 2019.
Heres D. Programming language identification tool. In: Algorithmia. 2016. https://algorithmia.com/algorithms. Accessed 8 July 2020.

Download references

Author information

Authors and Affiliations

The Graduate School of Natural and Applied Sciences, Dokuz Eylul University, 35390, Izmir, Turkey
Elife Ozturk Kiyak & Ayse Betul Cengiz
Department of Computer Engineering, Dokuz Eylul University, 35390, Izmir, Turkey
Kokten Ulas Birant & Derya Birant

Authors

Elife Ozturk Kiyak
View author publications
You can also search for this author in PubMed Google Scholar
Ayse Betul Cengiz
View author publications
You can also search for this author in PubMed Google Scholar
Kokten Ulas Birant
View author publications
You can also search for this author in PubMed Google Scholar
Derya Birant
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Derya Birant.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Deep learning approaches for data analysis: A practical perspective” guest edited by D. Jude Hemanth, Lipo Wang and Anastasia Angelopoulou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kiyak, E.O., Cengiz, A.B., Birant, K.U. et al. Comparison of Image-Based and Text-Based Source Code Classification Using Deep Learning. SN COMPUT. SCI. 1, 266 (2020). https://doi.org/10.1007/s42979-020-00281-1

Download citation

Received: 31 March 2020
Accepted: 30 July 2020
Published: 14 August 2020
DOI: https://doi.org/10.1007/s42979-020-00281-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparison of Image-Based and Text-Based Source Code Classification Using Deep Learning

Abstract

Access this article

Similar content being viewed by others

Identifying algorithm in program code based on structural features using CNN classification model

A Stacked Bidirectional LSTM Model for Classifying Source Codes Built in MPLs

Transformer-based networks over tree structures for code classification

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comparison of Image-Based and Text-Based Source Code Classification Using Deep Learning

Abstract

Access this article

Similar content being viewed by others

Identifying algorithm in program code based on structural features using CNN classification model

A Stacked Bidirectional LSTM Model for Classifying Source Codes Built in MPLs

Transformer-based networks over tree structures for code classification

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation