Skip to main content

SEED: Semantic Graph Based Deep Detection for Type-4 Clone

  • Conference paper
  • First Online:
Reuse and Software Quality (ICSR 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13297))

Included in the following conference series:

Abstract

Type-4 clones refer to a pair of code snippets with similar semantics but written in different syntax, which challenges the existing code clone detection techniques. Previous studies, however, highly rely on syntactic structures and textual tokens, which cannot precisely represent the semantic information of code and might introduce non-negligible noise into the detection models. To overcome these limitations, we design a novel semantic graph-based deep detection approach, called SEED. For a pair of code snippets, SEED constructs a semantic graph of each code snippet based on intermediate representation to represent the code semantic more precisely compared to the representations based on lexical and syntactic analysis. To accommodate the characteristics of Type-4 clones, a semantic graph is constructed focusing on the operators and API calls instead of all tokens. Then, SEED generates the feature vectors by using the graph match network and performs clone detection based on the similarity among the vectors. Extensive experiments show that our approach significantly outperforms two baseline approaches over two public datasets and one customized dataset. Especially, SEED outperforms other baseline methods by an average of 25.2% in the form of F1-Score. Our experiments demonstrate that SEED can reach state-of-the-art and be useful for Type-4 clone detection in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/xzpxzp123123/SEED.

  2. 2.

    https://llvm.org/.

  3. 3.

    http://soot-oss.github.io/soot/.

  4. 4.

    https://github.com/piyush69/JCoffee.

  5. 5.

    http://poj.org/.

  6. 6.

    http://codeforces.com/.

References

  1. Antoniol, G., Villano, U., Merlo, E., Di Penta, M.: Analyzing cloning evolution in the Linux kernel. Inf. Softw. Technol. 44(13), 755–765 (2002)

    Article  Google Scholar 

  2. Ben-Nun, T., Jakobovits, A.S., Hoefler, T.: Neural code comprehension: a learnable representation of code semantics. Adv. Neural Inf. Process. Syst. 31, 3585–3597 (2018)

    Google Scholar 

  3. Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  4. Jiang, L., Misherghi, G., Su, Z., Glondu, S.: Deckard: scalable and accurate tree-based detection of code clones. In: 29th International Conference on Software Engineering (ICSE 2007), pp. 96–105. IEEE (2007)

    Google Scholar 

  5. Li, X., Wang, L., Xin, Y., Yang, Y., Chen, Y.: Automated vulnerability detection in source code using minimum intermediate representation learning. Appl. Sci. 10(5), 1692 (2020)

    Article  Google Scholar 

  6. Li, Y., Gu, C., Dullien, T., Vinyals, O., Kohli, P.: Graph matching networks for learning the similarity of graph structured objects. In: International Conference on Machine Learning, pp. 3835–3845. PMLR (2019)

    Google Scholar 

  7. Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015)

  8. Li, Z., Lu, S., Myagmar, S., Zhou, Y.: Cp-miner: finding copy-paste and related bugs in large-scale software code. IEEE Trans. Softw. Eng. 32(3), 176–192 (2006)

    Article  Google Scholar 

  9. Linstead, E., Bajracharya, S., Ngo, T., Rigor, P., Lopes, C., Baldi, P.: Sourcerer: mining and searching internet-scale software repositories. Data Min. Knowl. Disc. 18(2), 300–336 (2009)

    Article  MathSciNet  Google Scholar 

  10. Mazinanian, D., Tsantalis, N., Stein, R., Valenta, Z.: Jdeodorant: clone refactoring. In: Proceedings of the 38th International Conference on Software Engineering Companion, pp. 613–616 (2016)

    Google Scholar 

  11. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  12. Mou, L., Li, G., Zhang, L., Wang, T., Jin, Z.: Convolutional neural networks over tree structures for programming language processing. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)

    Google Scholar 

  13. Pizzolotto, D., Inoue, K.: Blanker: a refactor-oriented cloned source code normalizer. In: 2020 IEEE 14th International Workshop on Software Clones (IWSC), pp. 22–25. IEEE (2020)

    Google Scholar 

  14. Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci. Comput. Program. 74(7), 470–495 (2009)

    Article  MathSciNet  Google Scholar 

  15. Saini, V., Farmahinifarahani, F., Lu, Y., Baldi, P., Lopes, C.V.: Oreo: detection of clones in the twilight zone. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 354–365 (2018)

    Google Scholar 

  16. Svajlenko, J., Islam, J.F., Keivanloo, I., Roy, C.K., Mia, M.M.: Towards a big data curated benchmark of inter-project code clones. In: 2014 IEEE International Conference on Software Maintenance and Evolution, pp. 476–480. IEEE (2014)

    Google Scholar 

  17. Wang, W., Li, G., Ma, B., Xia, X., Jin, Z.: Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 261–271. IEEE (2020)

    Google Scholar 

  18. Wei, H., Li, M.: Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In: IJCAI, pp. 3034–3040 (2017)

    Google Scholar 

  19. White, M., Tufano, M., Vendome, C., Poshyvanyk, D.: Deep learning code fragments for code clone detection. In: 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 87–98. IEEE (2016)

    Google Scholar 

  20. Yu, H., Lam, W., Chen, L., Li, G., Xie, T., Wang, Q.: Neural detection of semantic code clones via tree-based convolution. In: 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pp. 70–80. IEEE (2019)

    Google Scholar 

  21. Zeng, C., et al.: degraphcs: embedding variable-based flow graph for neural code search. arXiv preprint arXiv:2103.13020 (2021)

  22. Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K., Liu, X.: A novel neural source code representation based on abstract syntax tree. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 783–794. IEEE (2019)

    Google Scholar 

  23. Zhang, L., Yan, L., Zhang, Z., Zhang, J., Chan, W., Zheng, Z.: A theoretical analysis on cloning the failed test cases to improve spectrum-based fault localization. J. Syst. Softw. 129, 35–57 (2017)

    Article  Google Scholar 

  24. Zhao, G., Huang, J.: Deepsim: deep learning code functional similarity. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 141–151 (2018)

    Google Scholar 

  25. Zou, Y., Ban, B., Xue, Y., Xu, Y.: CCGraph: a PDG-based code clone detector with approximate graph matching. In: 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 931–942. IEEE (2020)

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their insightful comments. This work was substantially supported by National Natural Science Foundation of China (No. 61872373 and 61872375).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zhijie Jiang or Chenlin Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xue, Z., Jiang, Z., Huang, C., Xu, R., Huang, X., Hu, L. (2022). SEED: Semantic Graph Based Deep Detection for Type-4 Clone. In: Perrouin, G., Moha, N., Seriai, AD. (eds) Reuse and Software Quality. ICSR 2022. Lecture Notes in Computer Science, vol 13297. Springer, Cham. https://doi.org/10.1007/978-3-031-08129-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-08129-3_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-08128-6

  • Online ISBN: 978-3-031-08129-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics