Source Code Author Identification Method Combining Semantics and Statistical Features

Sun, Xu; Sun, Yutong; Kong, Leilei; Han, Yong; Ning, Hui

doi:10.1007/978-3-030-92632-8_14

Xu Sun⁷,
Yutong Sun⁷,
Leilei Kong⁸,
Yong Han⁸ &
…
Hui Ning⁹

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 107))

Included in the following conference series:

International Conference on Business Intelligence and Information Technology

1072 Accesses

Abstract

The globalization of information sharing has made copying easier and easier. The endless duplication of plagiarism has aroused wide attention in academic circles, and the related research in plagiarism detection has become a hot topic in recent years. Taking deep learning-based plagiarism detection modeling as the research object and improving the performance of the plagiarism detection system as the research objective, this paper conducts an in-depth study on the task of source code author identification in internal plagiarism detection. In the task of source code author identification, text features based on pre-trained language model can be used to model the semantic information of code fragments. However, it still lacks in the representation of complex full-text statistical features at the level of text granularity. Therefore, this paper proposes a source code author identification method that integrates semantic and full-text statistical features and combines statistical and semantic features to build a source code writing style model to realize accurate identification of author identity. Experimental results on AI-SOCO datasets show that the proposed modeling method is superior to the statistical and single semantic feature models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Fadel, A., Musleh, H., Tuffaha, I.: Overview of the PAN@FIRE 2020 task on the authorship identification of SOurce COde. In: FIRE 2020–12th Forum for Information Retrieval Evaluation, pp. 649–676 (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K.: BERT: pre-training of deep bidirectional transformers for language understanding, pp. 4171–4186 (2018)
Google Scholar
Pellin, B.: Using classification techniques to determine source code authorship. In: Department of Computer Science (2000)
Google Scholar
Hugo, A., Bogotá, C.: Personality recognition applying machine learning techniques on source code metrics. In: FIRE Forum for Information Retrieval Evaluation, pp. 25–29, India (2016)
Google Scholar
Alsulami, B., Dauber, E., Harang, R.: Source code authorship attribution using long short-term memory based networks. In: European Symposium on Research in Computer Security, pp. 65–82 (2017)
Google Scholar
Abuhamad, M., Rhim, I., AbuHmed, T.: Source code authorship identification using convolutional neural networks. Future Gener. Comput. Syst. 95, 104–115 (2019)
Google Scholar
Alvi, F., Stevenson, M., Clough, P.D.: Hashing and merging heuristics for text reuse detection. In: 2014 Cross Language Evaluation Forum Conference, pp. 939–946 (2014)
Google Scholar
Crosby, A., Tayyar, H., Tayyar, M.H.: UoB at AI-SOCO 2020: approaches to source code classification and the surprising power of N-grams. In: The 12th meeting of the Forum for Information Retrieval Evaluation, pp. 677–693 (2020)
Google Scholar

Download references

Acknowledgment

This work is supported by the Social Science Foundation of Heilongjiang Province (No. 210120002).

Author information

Authors and Affiliations

Heilongjiang Institute of Technology, Jixi, Heilongjiang, China
Xu Sun & Yutong Sun
Foshan University, Foshan, Guangdong, China
Leilei Kong & Yong Han
Harbin Engineering University, Harbin, Heilongjiang, China
Hui Ning

Authors

Xu Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yutong Sun
View author publications
You can also search for this author in PubMed Google Scholar
Leilei Kong
View author publications
You can also search for this author in PubMed Google Scholar
Yong Han
View author publications
You can also search for this author in PubMed Google Scholar
Hui Ning
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information Technology Department, Cairo University, Giza, Egypt
Aboul Ella Hassanien
School of Computer and Information Engineering, Harbin University of Commerce, Harbin, Heilongjiang, China
Yaoqun Xu
School of Computer and Information Engineering, Harbin University of Commerce, Harbin, Heilongjiang, China
Zhijie Zhao
Department of Computer Science, Lakehead University, Thunder Bay, ON, Canada
Sabah Mohammed
School of Computer and Information Engineering, Harbin University of Commerce, Harbin, Heilongjiang, China
Zhipeng Fan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, X., Sun, Y., Kong, L., Han, Y., Ning, H. (2022). Source Code Author Identification Method Combining Semantics and Statistical Features. In: Hassanien, A.E., Xu, Y., Zhao, Z., Mohammed, S., Fan, Z. (eds) Business Intelligence and Information Technology. BIIT 2021. Lecture Notes on Data Engineering and Communications Technologies, vol 107. Springer, Cham. https://doi.org/10.1007/978-3-030-92632-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-92632-8_14
Published: 16 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92631-1
Online ISBN: 978-3-030-92632-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics