Skip to main content
Log in

Comparison of Image-Based and Text-Based Source Code Classification Using Deep Learning

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Source code classification (SCC) is a task to assign codes into different categories according to a criterion such as according to their functionalities, programming languages or vulnerabilities. Many source code archives are organized according to the programming languages, and thereby, the desired code fragments can be easily accessed by searching within the archive. However, manually organizing source code archives by field experts is labor intensive and impractical because of the fast-growing available source codes. Therefore, this study proposes new convolutional neural network (CNN) architectures to build source code classifiers that automatically identify programming languages from source codes. This is the first study in which the performances of deep learning algorithms on programming language identification are compared on both image and text files. In this study, the experiments are performed on three source code datasets to identify eight programming languages, including C, C++, C# , Go, Python, Ruby, Rust, and Java. The comparative results indicate that although text-based SCC and image-based SCC approaches achieve very high (\(> 93.5\%\)) and similar accuracies, text-based classification has significantly better performance in terms of execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Alrashedy K, Dharmaretnam D, German DM, Srinivasan V, Gulliver TA. SCC++: predicting the programming language of questions and snippets of Stack Overflow. J Syst Softw. 2020;. https://doi.org/10.1016/j.jss.2019.110505.

    Article  Google Scholar 

  2. Zevin S, Holzem C. Machine learning based source code classification using syntax oriented features 2017. arXiv preprint arXiv:1703.07638.

  3. Gilda S. Source code classification using neural networks. In: 14th International Joint Conference on Computer Science and Software Engineering (JCSSE), IEEE; 2017. p. 1–6.

  4. Baquero JF, Camargo JE, Restrepo-Calle F, Aponte JH, González FA. Predicting the programming language: Extracting knowledge from stack overflow posts. In: Colombian Conference on Computing. Springer; 2017. p. 199–210.

  5. Ott J, Atchison A, Harnack P, Bergh A, Linstead E. A deep learning approach to identifying source code in images and video. In: 15th International Conference on Mining Software Repositories (MSR), IEEE/ACM; 2018. p. 376-386.

  6. Zhao D, Xing Z, Chen C, Xia X, Li G. ActionNet: vision-based workflow action recognition from programming screencasts.In: 41st International Conference on Software Engineering (ICSE), IEEE/ACM; 2019. p. 350–361.

  7. Oda Y, Fudaba H, Neubig G, Hata H, Sakti S, Toda T, Nakamura S. Learning to generate pseudo-code from source code using statistical machine translation. In: 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), IEEE/ACM; 2015. p. 574–584.

  8. Kuhn A, Ducasse S, Gírba T. Semantic clustering: identifying topics in source code. Inf Softw Technol. 2007;. https://doi.org/10.1016/j.infsof.2006.10.017.

    Article  Google Scholar 

  9. Darwish O, Maabreh M, Karajeh O, Alsinglawi B. Source codes classification using a modified instruction count pass. In: Workshops of the International Conference on Advanced Information Networking and Applications (WAINA), Springer; 2019. p. 897–906.

  10. Nguyen AT, Nguyen TN. Graph-based statistical language model for code. In : 37th IEEE International Conference on Software Engineering (ICSE), IEEE/ACM; vol 1; 2015. p.858–868.

  11. Phana AH, Chau PN, Nguyen ML, Bui LT. Automatically classifying source code using tree-based approaches. 2018;. https://doi.org/10.1016/j.datak.2017.07.003.

  12. Wilson W, Muteteke JJ, Li L. Automatic clustering of source code using self-organizing maps, In: Proceedings of 19th Annual Conference of SAIS. 2016; p. 1–5.

  13. Shi ST, Li M, Lo D, Thung F, Huo X. Automatic code review by learning the revision of source code. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 33; 2019. p. 4910–4917.

  14. Bandara U, Wijayarathna G. Source code author identification with unsupervised feature learning. Pattern Recognit Lett. 2013;. https://doi.org/10.1016/j.patrec.2012.10.027.

    Article  Google Scholar 

  15. Ying AT, Robillard MP. Code fragment summarization. In: Proceedings of the 9th Joint Meeting on Foundations of Software Engineering. 2013; p. 655–658.

  16. Alvares M, Marwala T, de Lima Neto FB. Application of computational intelligence for source code classification. In: Congress on Evolutionary Computation (CEC), IEEE; 2014. p. 895–902.

  17. Alrashedy K, Dharmaretnam D, German DM, Srinivasan V, Gulliver TA. SCC: Automatic classification of code snippets 2018. arXiv preprint arXiv:1809.07945

  18. Reyes J, Ramírez D, Paciello J, Automatic classification of source code archives by programming language: a deep learning approach. In: International Conference on Computational Science and Computational Intelligence (CSCI), IEEE; 2016. p. 514–519.

  19. Dam V, Kennedy J, Zaytsev V. Software language identification with natural language classifiers. In: 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), vol 01; 2016. p. 624–628.

  20. Khasnabish JN, Sodhi M, Deshmukh J, Srinivasaraghavan G. Detecting programming language from source code using bayesian learning techniques. In: International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer; 2014. p. 513–522.

  21. Klein D, Muuray K, Weber S. Algorithmic programming language identification 2011. arXiv preprint arXiv:1106.4064.

  22. Alahmadi M, Hassel J, Parajuli B, Haiduc S, Kumar P. Accurately predicting the location of code fragments in programming video tutorials using deep learning. In: Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE), 2018. p. 2–11.

  23. Guo G, Zhang N. A survey on deep learning based face recognition. Comput Vis Image Underst. 2019;. https://doi.org/10.1016/j.cviu.2019.102805.

    Article  Google Scholar 

  24. Pastor-Pellicer J, Castro-Bleda MJ, España-Boquera S, Zamora-Martíez F. Handwriting recognition by using deep learning to extract meaningful features. AI Commun. 2019;. https://doi.org/10.3233/AIC-170562.

    Article  MathSciNet  Google Scholar 

  25. Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikainen M. Deep learning for generic object detection: a survey. Int J Comput Vis. 2020;. https://doi.org/10.1007/s11263-019-01247-4.

    Article  Google Scholar 

  26. Iwasaki R, Hasegawa T, Mori N, Matsumoto K. Relaxation method of convolutional neural networks for natural language processing. In: International Symposium on Distributed Computing and Artificial Intelligence. Springer; 2018. p.188–195.

  27. Gimenez M, Palanca J, Botti V. Semantic-based padding in convolutional neural networks for improving the performance in natural language processing. A case of study in sentiment analysis. Neurocomputing. 2020; https://doi.org/10.1016/j.neucom.2019.08.096.

  28. Gao H, Lin S, Li C, Yang Y. Application of hyperspectral image classification based on overlap pooling. Neural Process Lett. 2019;49:1335–54.

    Article  Google Scholar 

  29. Laks R. Image-based detection of programming languages. In: Github. 2018. https://github.com/rivol/programming-language-detection. Accessed 15 Nov 2019.

  30. Heres D. Programming language identification tool. In: Algorithmia. 2016. https://algorithmia.com/algorithms. Accessed 8 July 2020.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Derya Birant.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Deep learning approaches for data analysis: A practical perspective” guest edited by D. Jude Hemanth, Lipo Wang and Anastasia Angelopoulou.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kiyak, E.O., Cengiz, A.B., Birant, K.U. et al. Comparison of Image-Based and Text-Based Source Code Classification Using Deep Learning. SN COMPUT. SCI. 1, 266 (2020). https://doi.org/10.1007/s42979-020-00281-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-020-00281-1

Keywords

Navigation