Semantic measure of plagiarism using a hierarchical graph model

Zhang, Tingting; Lee, Baozhen; Zhu, Qinghua

doi:10.1007/s11192-019-03204-x

Semantic measure of plagiarism using a hierarchical graph model

Published: 19 August 2019

Volume 121, pages 209–239, (2019)
Cite this article

Scientometrics Aims and scope Submit manuscript

Tingting Zhang¹,
Baozhen Lee² &
Qinghua Zhu¹

602 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Traditional plagiarism detection is based primarily on methods of character matching or topic similarity. Another promising methodology remains largely unexplored: employing deep mining to establish a contextual hierarchy among themes. This paper proposes a semantic approach to measuring the extent of plagiarism, based on a hierarchical graph model. The main innovations are as follows: (1) hierarchical extraction of topic feature terms and elucidation of a corresponding graph structure; (2) graph similarity calculation based on the maximum common subgraph. This semantic-measure method goes beyond semantic detection of topics to take into account the context of topic feature terms, as well as the hierarchical structure by which those topics are related. This contextual-hierarchical perspective should, in turn, improve the accuracy of plagiarism detection. In addition, by mining the implicit relationships between hierarchical feature terms, our method can detect plagiarized documents with similar themes but using different topic words: a potential boon to plagiarism detection recall. In an experiment conducted on a dataset from Chinese paper database CNKI, the semantic-measure method indeed demonstrates accuracy and recall superior to those achieved with current state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visualizing Bibliometric Networks

A survey of density based clustering algorithms

Article 29 September 2020

A Comparative Survey of Text Summarization Techniques

Article 02 December 2023

References

Abdi, A., Idris, N., Alguliyev, R. M., & Aliguliyev, R. M. (2015). PDLK: plagiarism detection using linguistic knowledge. Expert Systems with Applications, 42(22), 8936–8946.
Article Google Scholar
Aizawa, A. (2003). An information-theoretic perspective of Tf–IDF measures. Information Processing and Management, 39(1), 45–65.
Article MATH Google Scholar
Alzahrani, S. M., Salim N., Abraham, A., & Palade, V. (2011). iPlag: Intelligent plagiarism reasoner in scientific publications. In World congress on information and communication technologies (WICT), pp. 1–6.
Alzahrani, S. M., Salim, N., & Abraham, A. (2012). Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics Part C, 42(2), 133–149.
Article Google Scholar
Atoum, I., & Otoom, A. (2016). Efficient hybrid semantic text similarity using WordNet and a corpus. International Journal of Advanced Computer Science and Applications, 7(9), 124–130.
Article Google Scholar
Barrón-Cedeño, A., & Rosso, P. (2009). On automatic plagiarism detection based on n-grams comparison. In European conference on information retrieval, pp. 696–700.
Biswas, S. K., Bordoloi, M., & Shreya, J. (2018). A graph based keyword extraction model using collective node weight. Expert Systems with Applications, 97, 51–59.
Article Google Scholar
Chahal, P., Singh, M., & Kumar, S. (2013). An ontology based approach for finding semantic similarity between web documents. International Journal of Current Engineering and Technology, 3(5), 1925–1931.
Google Scholar
Chen, Q., Yao, L., & Yang, J. (2017). Short text classification based on LDA topic model. In International conference on audio, language and image processing (ICALIP), IEEE.
Chow, T. W. S., & Rahman, M. K. M. (2009). Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection. IEEE Transactions on Neural Networks, 20(9), 1385–1402.
Article Google Scholar
Deepika, J., Archana, V., Bagyalakshmi, V., & Preethi, P. (2011). A knowledge based approach to detection of idea plagiarism in online research publications. International Journal on Internet and Distributed Computing System, 1(2), 51–61.
Google Scholar
Eisa, T. A. E., Salim, N., & Alzahrani, S. (2015). Existing plagiarism detection techniques: A systematic mapping of the scholarly literature. Online Information Review, 39(3), 383–400.
Article Google Scholar
Elhadi, M., & Al-Tobi, A. (2008). Use of text syntactical structures in detection of document duplicates. In 2008 Third international conference on digital information management, ICDIM, pp. 520–525.
Ezzikouri, H., Erritali, M., & Oukessou, M. (2017). Fuzzy-semantic similarity for automatic multilingual plagiarism detection. International Journal of Advanced Computer Science and Applications, 8(9), 86–90.
Article Google Scholar
Ferreira, R., Lins, R. D., Freitas, F., Simske, S. J., & Riss, M. (2014). A new sentence similarity assessment measure based on a three-layer sentence representation. In Proceedings of the 2014 ACM symposium on document engineering, pp. 25–34.
Ferrero, J., Agnes, F., Besacier, L., et al. (2017). Using word embedding for cross-language plagiarism detection. arXiv preprint arXiv:1702.03082.
Franco-Salvador, M., Rosso, P., & Montes-y-Gómez, M. (2016). A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing and Management, 52(4), 550–570.
Article Google Scholar
García-Romero, A., & Estrada-Lorenzo, J. M. (2014). A bibliometric analysis of plagiarism and self-plagiarism through Déjà vu. Scientometrics, 101(1), 381–396.
Article Google Scholar
Gupta, D., Vani, K., & Singh, C. K. (2014). Using natural language processing techniques and fuzzy-semantic similarity for automatic external plagiarism detection. In IEEE 2014 international conference on advances in computing, communications and informatics (ICACCI), pp. 2694–2699.
Hiremath, S. A., & Otari, M. S. (2014). Plagiarism detection—different methods and their analysis. International Journal of Innovative Research in Advanced Engineering, 1(7), 41–47.
Google Scholar
Hoad, T. C., & Zobel, J. (2003). Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 54(3), 203–215.
Article Google Scholar
Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data, 2(2), 1–25.
Article Google Scholar
Jarić, I. (2016). High time for a common plagiarism detection system. Scientometrics, 106(1), 457–459.
Article MathSciNet Google Scholar
Jinquan, W., Maocheng, L., & Hongliang, Y. (2007). A measure of sentence similarity based on n-grams and vector space model. Modern Foreign Languages, 4, 011.
Google Scholar
Kim, W., Jang, H., Kim, H. J., et al. (2016). A document query search using an extended centrality with the word2vec. In ICEC 2016—International conference on electronic commerce: E-commerce in smart connected world, pp. 14:1–14:8.
Lau, J. H., & Baldwin T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint. arXiv:1607.05368.
Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31th international conference on machine learning (ICML’14), Vol. 32, Beijing, China, JMLR Proceedings, pp. 1188–1196.
Li, M. (2018). Classifying and ranking topic terms based on a novel approach: role differentiation of author keywords. Scientometrics, 116(1), 1–24.
Article Google Scholar
Li, S., Sun, Y., & Soergel, D. (2015). A new method for automatically constructing domain-oriented term taxonomy based on weighted word co-occurrence analysis. Scientometrics, 103(3), 1023–1042.
Article Google Scholar
Liu, M., Lang, B., Gu, Z., et al. (2017). Measuring similarity of academic articles with semantic profile and joint word embedding. Tsinghua Science and Technology, 06, 71–84.
Google Scholar
Liu, X., Xu, C., & Ouyang, B. (2015). Plagiarism detection algorithm for source code in computer science education. International Journal of Distance Education Technologies (IJDET), 13(4), 29–39.
Article Google Scholar
Luo, L., Ming, J., Wu, D., Liu, P., & Zhu, S. (2017). Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. IEEE Transactions on Software Engineering, 43(12), 1157–1177.
Article Google Scholar
Mariani, J., Francopoulo, G., & Paroubek, P. (2018). Reuse and plagiarism in speech and natural language processing publications. International Journal on Digital Libraries, 19(2–3), 113–126.
Article Google Scholar
Menai, M. E. B. (2012). Detection of plagiarism in Arabic documents. International Journal of Information Technology and Computer Science, 10, 80–89.
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint. arXiv:1301.3781.
Momtaz, M., Bijari, K., Salehi, M., & Veisi, H. (2016). Graph-based approach to text alignment for plagiarism detection in Persian documents. In FIRE, pp. 176–179. http://ceur-ws.org/Vol-1737/T4-9.pdf. Accessed 30 Sep 2018.
Niraula, N., Banjade, R., Ştefănescu, D., et al. (2013). Experiments with semantic similarity measures based on LDA and LSA. In International conference on statistical language and speech processing, Springer, Berlin.
Osman, A. H., & Barukab, O. M. (2017). SVM significant role selection method for improving semantic text plagiarism detection. International Journal of Advanced and Applied Sciences, 4(8), 112–122.
Article Google Scholar
Osman, A. H., Salim, N., Binwahlan, S., Hentabli, H., & Ali, A. M. (2011). Conceptual similarity and graph-based method for plagiarism detection. Journal of Theoretical and Applied Information Technology, 32(2), 135–145.
Google Scholar
Osman, A. H., Salim, N., Binwwahlan, M. S., Alteeb, R., & Abuobieda, A. (2012). An improved plagiarism detection scheme based on semantic role labeling. Applied Soft Computing, 12(5), 1493–1502.
Article Google Scholar
Rahim, R., Kurniasih, N., Irawan, M. D., Siregar, Y. H., Hasibuan, A., Sari, D. A. P., et al. (2018). Latent semantic indexing for Indonesian text similarity. International Journal of Engineering & Technology, 7(23), 73–77.
Article Google Scholar
Ramachandran, L., & Gehringer, E. F. (2011). Determining degree of relevance of reviews using a graph-based text representation. In IEEE 23rd international conference on tools with artificial intelligence, pp. 442–445.
Rehurek, R. (2008). Plagiarism detection through vector space models applied to a digital library. In Proceedings of the second workshop on recent advances in slavonic natural languages, pp. 75–83.
Rexha, A., Kröll, M., Ziak, H., & Kern, R. (2018). Authorship identification of documents with high content similarity. Scientometrics, 115(1), 223–237.
Article Google Scholar
Sánchez, D., Batet, M., Isern, D., & Valls, A. (2012). Ontology-based semantic similarity: A new feature-based approach. Expert Systems with Applications, 39(9), 7718–7728.
Article Google Scholar
Schuhmacher, M., & Ponzetto, S. P. (2014). Knowledge-based graph document modeling. In Proceedings of the 7th ACM international conference on Web search and data mining, ACM, pp. 543–552.
Silva, F. B., Werneck, R. D. O., Goldenstein, S., Tabbone, S., & Torres, R. D. S. (2018). Graph-based bag-of-words for classification. Pattern Recognition, 74, 266–285.
Article Google Scholar
Sonawane, S. S., & Kulkarni, P. A. (2014). Graph based representation and analysis of text document: A survey of techniques. International Journal of Computer Applications, 96(19), 1–8.
Article Google Scholar
Tan, C.-M., Wang, Y.-F., & Lee, C.-D. (2002). The use of bigrams to enhance text categorization. Information Processing and Management, 38(4), 529–546.
Article MATH Google Scholar
Tang, W., Du, Z. O. U., & Zhang, L. (2017). A plagiarism detection method based on learning behavior analysis. In DEStech transactions on social science, education and human science, international conference on education reform and modern management (ERMM), pp. 43–47.
Tien, N. M., & Labbé, C. (2018). Detecting automatically generated sentences with grammatical structure similarity. Scientometrics, 116(2), 1247–1271.
Article Google Scholar
Vani, K., & Gupta, D. (2015). Investigating the impact of combined similarity metrics and POS tagging in extrinsic text plagiarism detection system. In International conference on advances in computing, communications and informatics (ICACCI), pp. 1578–1584.
Vani, K., & Gupta, D. (2017). Detection of idea plagiarism using syntax–semantic concept extractions with genetic algorithm. Expert Systems with Applications, 73, 11–26.
Article Google Scholar
Vani, K., & Gupta, D. (2018a). Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges. Information Processing and Management, 54(3), 408–432.
Article Google Scholar
Vani, K., & Gupta, D. (2018b). Integrating syntax-semantic-based text analysis with structural and citation information for scientific plagiarism detection. Journal of the Association for Information Science and Technology, 69(11), 1330–1345.
Article Google Scholar
Wu, J., Xuan, Z., & Pan, D. (2011). Enhancing text representation for classification tasks with semantic graph structures. International Journal of Innovative Computing, Information, & Control, 7(5), 2689–2698.
Google Scholar
Zhang, C., Chen, L., & Li, Q. (2016). A Chinese text similarity calculation algorithm based on DF_LDA. In Proceedings of the 6th international asia conference on industrial engineering and management innovation, Atlantis Press.
Zhang, H., & Chow, T. W. S. (2011). A coarse-to-fine framework to efficiently thwart plagiarism. Pattern Recognition, 44(2), 471–487.
Article Google Scholar
Zhang, W., Yoshida, T., & Tang, X. (2011). A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Systems with Applications, 38(3), 2758–2765.
Article Google Scholar
Zhang, Y., Jin, R., & Zhou, Z. H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1(1–4), 43–52.
Article Google Scholar

Download references

Acknowledgements

The author acknowledges the support by the Project No. 71673122 funded by National Natural Science Foundation of China and Nanjing University Innovation and Creative Program for PhD candidate CXCY17-09.

Author information

Authors and Affiliations

School of Engineering Management, Nanjing University, Nanjing, 210009, China
Tingting Zhang & Qinghua Zhu
School of Information Engineering, Nanjing Audit University, Nanjing, 210031, China
Baozhen Lee

Authors

Tingting Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Baozhen Lee
View author publications
You can also search for this author in PubMed Google Scholar
Qinghua Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qinghua Zhu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, T., Lee, B. & Zhu, Q. Semantic measure of plagiarism using a hierarchical graph model. Scientometrics 121, 209–239 (2019). https://doi.org/10.1007/s11192-019-03204-x

Download citation

Received: 06 January 2019
Published: 19 August 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s11192-019-03204-x

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic measure of plagiarism using a hierarchical graph model

Abstract

Access this article

Similar content being viewed by others

Visualizing Bibliometric Networks

A survey of density based clustering algorithms

A Comparative Survey of Text Summarization Techniques

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Semantic measure of plagiarism using a hierarchical graph model

Abstract

Access this article

Similar content being viewed by others

Visualizing Bibliometric Networks

A survey of density based clustering algorithms

A Comparative Survey of Text Summarization Techniques

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation