Abstract
This paper presents a novel approach for comparing software systems by calculating the robust Hausdorff distance between semantic source code embeddings of individual software components, i.e., methods. The proposed approach represents each software as a set of vectors, where every vector is a semantic source code embedding of a particular method. The code embeddings are constructed from abstract syntax trees of the methods with the help of attention-based neural network models that capture the semantics of the methods. Previous research has shown that comparing semantic source code embeddings can reveal semantic relationships between the two methods. We utilize this characteristic to estimate the semantic similarity between the two software systems by computing the robust Hausdorff distance. In the experiment, a pre-trained code2vec neural network model is used to create the source code vector representations of several open-source Java-based libraries. Several variations of the robust Hausdorff distance are evaluated. The results show that the proposed approach can effectively estimate the semantic similarity, reflecting the software library’s scopes, software evolution, and individual parts (e.g., packages) of those libraries.
Similar content being viewed by others
References
Ain QU, Butt WH, Anwar MW, Azam F, Maqbool B (2019) A systematic review on code clone detection. IEEE Access 7:86121–86144. https://doi.org/10.1109/ACCESS.2019.2918202
Al-Debagy O, Martinek P (2021) A microservice decomposition method through using distributed representation of source code. Scalable Comput Pract Experience 22(1):39–52. https://doi.org/10.12694/scpe.v22i1.1836
Alon U, Brody S, Levy O, Yahav E (2019) code2seq: Generating sequences from structured representations of code. In: International conference on learning representations
Alon U, Zilberstein M, Levy O, Yahav E (2018) A general path-based representation for predicting program properties. In: Proceedings of the 39th ACM SIGPLAN conference on programming language design and implementation. Association for Computing Machinery, New York, pp 404–419
Alon U, Zilberstein M, Levy O, Yahav E (2019) code2vec: Learning distributed representations of code. Proc ACM Program Lang 3 (POPL):1–29. https://doi.org/10.1145/3290353
Barr JR, Shaw P, Abu-Khzam FN, Yu S, Yin H, Thatcher T (2020) Combinatorial code classification vulnerability rating. In: 2020 second international conference on transdisciplinary AI (TransAI), pp 80–83
Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: Proceedings of international conference on software maintenance, pp 368–377
Becht E, McInnes L, Healy J, Dutertre C-A, Kwok Immanuel WH, Ng LG, Ginhoux F, Newell EW (2019) Dimensionality reduction for visualizing single-cell data using umap. Nat Biotechnol 37(1):38–44. https://doi.org/10.1038/nbt.4314
Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation of clone detection tools. IEEE Trans Softw Eng 33(9):577–591. https://doi.org/10.1109/TSE.2007.70725
Ben-Nun T, Jakobovits AS, Hoefler T (2018) Neural code comprehension: A learnable representation of code semantics. In: Proceedings of the 32nd international conference on neural information processing systems. Curran Associates Inc., Red Hook, pp 3589–3601
Capiluppi A, Di Ruscio D, Di Rocco J, Nguyen PT, Ajienka N (2020) Detecting java software similarities by using different clustering techniques. Inf Softw Technol 122:106279. https://doi.org/10.1016/j.infsof.2020.106279
Chae D-K, Ha J, Kim S-W, Kang B, Im EG (2013) Software plagiarism detection: A graph-based approach. In: Proceedings of the 22nd ACM international conference on information & knowledge management. Association for Computing Machinery, New York, pp 1577–1580
Cheers H, Lin Y, Smith SP (2019) A novel approach for detecting logic similarity in plagiarised source code. In: 2019 IEEE 10th international conference on software engineering and service science (ICSESS). IEEE, pp 1–6
Chen K, Liu P, Zhang Y (2014) Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In: Proceedings of the 36th international conference on software engineering. Association for Computing Machinery, New York, pp 175–186
Compton R, Frank E, Patros P, Koay A (2020) Embedding java classes with code2vec: Improvements from variable obfuscation. In: Proceedings of the 17th international conference on mining software repositories. MSR ’20. Association for Computing Machinery, New York, pp 243–253
Csuvik V, Kicsi A, Vidács L (2019) Evaluation of textual similarity techniques in code level traceability. In: Computational science and its applications. Springer, pp 529–543
Dann A, Hermann B, Bodden E (2019) Sootdiff: Bytecode comparison across different java compilers. In: Proceedings of the 8th ACM SIGPLAN international workshop on state of the art in program analysis. Association for Computing Machinery, New York, pp 14–19
Decker MJ, Collard ML, Volkert LG, Maletic JI (2020) srcdiff: A syntactic differencing approach to improve the understandability of deltas. J Softw Evol Process 32(4). https://doi.org/10.1002/smr.2226
DeFreez D, Thakur AV, Rubio-González C (2018) Path-based function embedding and its application to error-handling specification mining. In: Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. Association for Computing Machinery, New York, pp 423–433
Deza MM, Deza E (2009) Encyclopedia of distances. In: Encyclopedia of distances. Springer, pp 1–583
Dubuisson M-P, Jain AK (1994) A modified hausdorff distance for object matching. In: Proceedings of 12th international conference on pattern recognition, vol 1. IEEE, pp 566–568
Durić Z, Gašvić D (2012) A source code similarity system for plagiarism detection. Comput J 56(1):70–86. https://doi.org/10.1093/comjnl/bxs018
Faidhi JAW, Robinson SK (1987) An empirical approach for detecting program similarity and plagiarism within a university programming environment. Comput Educ 11(1):11–19. https://doi.org/10.1016/0360-1315(87)90042-X
Falleri J-R, Morandat F, Blanc X, Martinez M, Monperrus M (2014) Fine-grained and accurate source code differencing. In: Proceedings of the 29th ACM/IEEE international conference on automated software engineering, ASE ’14. Association for Computing Machinery, New York, pp 313–324
Figalli A, Gigli N (2010) A new transportation distance between non-negative measures, with applications to gradients flows with dirichlet boundary conditions. J Math Appl 94(2):107–130. https://doi.org/10.1016/j.matpur.2009.11.005
Gardner A, Kanno J, Duncan CA, Selmic R (2014) Measuring distance between unordered sets of different sizes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 137–143
Halstead MH (1977) Elements of software science (operating and programming systems series). Elsevier Science Inc., USA
Hemel A, Kalleberg KT, Vermaas R, Dolstra E (2011) Finding software license violations through binary code clone detection. In: Proceedings of the 8th working conference on mining software repositories. Association for Computing Machinery, New York, pp 63–72
Henkel J, Lahiri SK, Liblit B, Reps T (2018) Code vectors: Understanding programs through embedded abstracted symbolic traces. In: Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. Association for Computing Machinery, New York, pp 163–174
Huttenlocher DP, Klanderman GA, Rucklidge WJ (1993) Comparing images using the hausdorff distance. IEEE Trans Pattern Anal Mach Intell 15 (9):850–863. https://doi.org/10.1109/34.232073
Jhi Y-C, Wang X, Jia X, Zhu S, Liu P, Wu D (2011) Value-based program characterization and its application to software plagiarism detection. In: Proceedings of the 33rd international conference on software engineering, pp 756–765
Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670. https://doi.org/10.1109/TSE.2002.1019480
Kang HJ, Bissyandé TF, Lo D (2019) Assessing the generalizability of code2vec token embeddings. In: 2019 34th IEEE/ACM international conference on automated software engineering (ASE), pp 1–12
Kapser C, Godfrey MW (2003) Toward a taxonomy of clones in source code: A case study. Evol Large Scale Ind Softw Architectures 16:107–113
Kobak D, Linderman GC (2021) Initialization is critical for preserving global data structure in both t-sne and umap. Nat Biotechnol 39(2):156–157. https://doi.org/10.1038/s41587-020-00809-z
Kovalenko V, Bogomolov E, Bryksin T, Bacchelli A (2019) Pathminer: A library for mining of path-based representations of code. In: Proceedings of the 16th international conference on mining software repositories, pp 13–17
Krinke J (2001) Identifying similar code with program dependence graphs. In: Proceedings eighth working conference on reverse engineering, pp 301–309
Levina E, Bickel P (2001) The earth mover’s distance is the mallows distance: Some insights from statistics. In: Proceedings Eighth IEEE international conference on computer vision. ICCV 2001, vol 2. IEEE, pp 251–256
Li L, Feng H, Zhuang W, Meng N, Ryder B (2017) Cclearner: A deep learning-based clone detection approach. In: 2017 IEEE international conference on software maintenance and evolution (ICSME), pp 249–260
Li X, Zhong XJ (2010) The source code plagiarism detection using ast. In: 2010 international symposium on intelligence information processing and trusted computing, pp 406–408
Luan S, Yang D, Barnaby C, Sen K, Chandra S (2019) Aroma: Code recommendation via structural code search. Proc ACM on Program Lang 3(OOPSLA):1–28. https://doi.org/10.1145/3360578
Luo L, Ming J, Wu D, Liu P, Zhu S (2017) Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. IEEE Trans Softw Eng 43(12):1157–1177
Mathur A, Choudhary H, Vashist P, Thies W, Thilagam S (2012) An empirical study of license violations in open source projects. In: Proceedings of the 2012 35th annual IEEE software engineering workshop. IEEE Computer Society, pp 168–176
McInnes L, Healy J, Melville J (2018) UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426
McInnes L, Healy J, Saul N, Grossberger L (2018) Umap: Uniform manifold approximation and projection. J Open Source Softw 3(29):861. https://doi.org/10.21105/joss.00861
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Nafi KW, Kar TS, Roy B, Roy CK, Schneider KA (2019) Clcdsa: Cross language code clone detection using syntactical features and api documentation. In: 2019 34th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 1026–1037
Nguyen PT, Di Rocco J, Rubei R, Di Ruscio D (2020) An automated approach to assess the similarity of github repositories. Softw Qual J 28:595–631. https://doi.org/10.1007/s11219-019-09483-0
Ottenstein KJ (1976) An algorithmic approach to the detection and prevention of plagiarism. SIGCSE Bull 8(4):30–41. https://doi.org/10.1145/382222.382462
Palo HK, Sahoo S, Subudhi AK (2021) Dimensionality reduction techniques: Principles, benefits, and limitations. Wiley, chap 4, pp 77–107
Pauzi Z, Capiluppi A (2020) Text similarity between concepts extracted from source code and documentation. In: International conference on intelligent data engineering and automated learning. Springer, pp 124–135
Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), vol 14. Association for Computational Linguistics, Doha, Qatar, pp 1532–1543
Pigazzini I (2019) Automatic detection of architectural bad smells through semantic representation of code. In: Proceedings of the 13th european conference on software architecture, vol 2. Association for Computing Machinery, New York, pp 59–62
Prechelt L, Malpohl G, Philippsen M (2002) Finding plagiarisms among a set of programs with jplag. J Univers Comput Sci 8(11)
Rabin MRI, Mukherjee A, Gnawali O, Alipour MA (2020) Towards demystifying dimensions of source code embeddings. In: Proceedings of the 1st ACM SIGSOFT international workshop on representation learning for software engineering and program languages. Association for Computing Machinery, New York, pp 29–38
Ragkhitwetsagul C, Krinke J, Clark D (2018) A comparison of code similarity analysers. Empir Softw Eng 23(4):2464–2519. https://doi.org/10.1007/s10664-017-9564-7
Roy CK, Cordy JR (2008) Nicad: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: 2008 16th IEEE international conference on program comprehension, pp 172–181
Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) Sourcerercc: Scaling code clone detection to big-code. In: Proceedings of the 38th International Conference on Software Engineering. Association for Computing Machinery, New York, pp 1157–1168
Schleimer S, Wilkerson D S, Aiken A (2003) Winnowing: Local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data. Association for Computing Machinery, New York, pp 76–85
Shan SQ, Tian ZG, Guo FJ, Ren JX (2014) Similarity detection’s application using chi-square test in the property of counting method. In: Advances in computers, electronics and mechatronics, Trans Tech Publications Ltd, Applied Mechanics and Materials, vol 667, pp 32–35
Sheneamer A, Kalita J (2016) Semantic clone detection using machine learning. In: 2016 15th IEEE international conference on machine learning and applications (ICMLA), pp 1024–1028
Shi K, Lu Y, Chang J, Wei Z (2020) Pathpair2vec: An ast path pair-based code representation method for defect prediction. J Comput Lang 59. https://doi.org/10.1016/j.cola.2020.100979
Tufano M, Watson C, Bavota G, Di Penta M, White M, Poshyvanyk D (2018) Deep learning similarities from different representations of source code. In: 2018 IEEE/ACM 15th international conference on mining software repositories (MSR), pp 542–553
Turian J, Ratinov L-A, Bengio Y (2010) Word representations: A simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, vol 2010. Association for Computational Linguistics, Uppsala, Sweden, pp 384–394
White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering. Association for Computing Machinery, New York, pp 87–98
Ye F, Zhou S, Venkat A, Marucs R, Tatbul N, Tithi JJ, Petersen P, Mattson T, Kraska T, Dubey P et al (2021) Misim: A novel code similarity system
Yuan Y, Guo Y (2012) Boreas: an accurate and scalable token-based approach to code clone detection. In: Proceedings of the 27th IEEE/ACM international conference on automated software engineering, pp 286–289
Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: Proceedings of the 41st international conference on software engineering. IEEE Press, pp 783–794
Zhao J, Xia K, Fu Y, Cui B (2015) An ast-based code plagiarism detection algorithm. In: 2015 10th international conference on broadband and wireless computing, communication and applications (BWCCA), pp 178–182
Acknowledgements
Conceptualization: all authors; Methodology: all authors; Formal analysis and investigation: all authors; Writing - original draft preparation: Sašo Karakatič and Tjaša Heričko; Writing - review and editing: all authors; Supervision: Sašo Karakatic.̌
Funding
This work was supported by the Slovenian Research Agency (Research Core Funding No. P2-0057).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare no conflict of interest.
Additional information
Communicated by: Meiyappan Nagappan and Tim Menzies
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Inventing the Next Generation of Software Analytics
This work was supported by the Slovenian Research Agency (Research Core Funding No. P2-0057).
Appendices
Appendix: A
The additional software libraries included in the extended experiments were the following.
-
Genson (denoted as genson) – https://github.com/owlike/genson. It is a JSON-to-Java conversion library with full data-binding and streaming support.
-
Apache Johnzon Core (johnzon) – https://github.com/apache/johnzon. The library provides an implementation of processing JSON in Java.
-
JSON in Java (json java) – https://github.com/stleary/JSON-java. It is a lightweight library that implements JSON encoders and decoders in Java.
-
JSON Library (json lib) – https://github.com/kordamp/json-lib. It is a Java library for transforming JSON.
-
JSON Smart 2 (json smart) – https://github.com/netplex/json-smart-v2. It is a small JSON processing library. The first version of the library is based on the simple data mapping model provided by Json simple.
-
JSON Iterator (jsoniter) – https://github.com/json-iterator/java. It is a Java library for parsing and manipulating JSON.
-
JsonPath (jsonpath) – https://github.com/json-path/JsonPath. It is a library for reading JSON documents.
-
Eclipse Yasson (yasson) – https://github.com/eclipse-ee4j/yasson. It is a Java framework that provides an implementation of data-binding between Java classes and JSON documents.
-
EasyMock (easymock) – https://github.com/easymock/easymock. It is a Java library that provides a way to use mock objects in unit testing.
-
JMock 2 Core (jmock) – https://github.com/jmock-developers/jmock-library. It is a mock object library used for test-driven development.
-
Dom4j (dom4j) – https://github.com/dom4j/dom4j. It is a Java framework for processing XML.
-
Jakarta XML Binding (jaxb) – https://github.com/eclipse-ee4j/jaxb-ri. The library provides mapping between XML and Java code.
-
JDOM2 (jdom) – https://github.com/hunterhacker/jdom. It provides Java manipulation of XML.
-
Apache XMLBeans (xmlbeans) – https://github.com/apache/xmlbeans. It is a Java-to-XML binding framework.
The basic source code, usage, and repository statistics of the additional software libraries used in the extended experiment are shown in Table 9.
The differences between the software library distances with test parts and without test parts are similar on the smaller set of eight software libraries as is in this larger set of 22 software libraries (△Mdistances = 0.034, △SDdistances = 0.019; Wilcoxon signed-rank test, W = 149, p < 0.001).
Appendix B
The undirected distances of software libraries from Table 10 are also presented in scatter plot (Fig. 9) after multidimensional scaling. Note, json simple was not included in this visualization, as it is the most unusual of the libraries included and thus skewed the multidimensional scaling. It is clear that the JSON libraries cluster in the same space, which is also the case for the XML libraries and testing libraries. The general-purpose libraries are in the middle of the three clusters.
Rights and permissions
About this article
Cite this article
Karakatič, S., Miloševič, A. & Heričko, T. Software system comparison with semantic source code embeddings. Empir Software Eng 27, 70 (2022). https://doi.org/10.1007/s10664-022-10122-9
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-022-10122-9