Abstract
Traditional plagiarism detection is based primarily on methods of character matching or topic similarity. Another promising methodology remains largely unexplored: employing deep mining to establish a contextual hierarchy among themes. This paper proposes a semantic approach to measuring the extent of plagiarism, based on a hierarchical graph model. The main innovations are as follows: (1) hierarchical extraction of topic feature terms and elucidation of a corresponding graph structure; (2) graph similarity calculation based on the maximum common subgraph. This semantic-measure method goes beyond semantic detection of topics to take into account the context of topic feature terms, as well as the hierarchical structure by which those topics are related. This contextual-hierarchical perspective should, in turn, improve the accuracy of plagiarism detection. In addition, by mining the implicit relationships between hierarchical feature terms, our method can detect plagiarized documents with similar themes but using different topic words: a potential boon to plagiarism detection recall. In an experiment conducted on a dataset from Chinese paper database CNKI, the semantic-measure method indeed demonstrates accuracy and recall superior to those achieved with current state-of-the-art methods.
Similar content being viewed by others
References
Abdi, A., Idris, N., Alguliyev, R. M., & Aliguliyev, R. M. (2015). PDLK: plagiarism detection using linguistic knowledge. Expert Systems with Applications, 42(22), 8936–8946.
Aizawa, A. (2003). An information-theoretic perspective of Tf–IDF measures. Information Processing and Management, 39(1), 45–65.
Alzahrani, S. M., Salim N., Abraham, A., & Palade, V. (2011). iPlag: Intelligent plagiarism reasoner in scientific publications. In World congress on information and communication technologies (WICT), pp. 1–6.
Alzahrani, S. M., Salim, N., & Abraham, A. (2012). Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics Part C, 42(2), 133–149.
Atoum, I., & Otoom, A. (2016). Efficient hybrid semantic text similarity using WordNet and a corpus. International Journal of Advanced Computer Science and Applications, 7(9), 124–130.
Barrón-Cedeño, A., & Rosso, P. (2009). On automatic plagiarism detection based on n-grams comparison. In European conference on information retrieval, pp. 696–700.
Biswas, S. K., Bordoloi, M., & Shreya, J. (2018). A graph based keyword extraction model using collective node weight. Expert Systems with Applications, 97, 51–59.
Chahal, P., Singh, M., & Kumar, S. (2013). An ontology based approach for finding semantic similarity between web documents. International Journal of Current Engineering and Technology, 3(5), 1925–1931.
Chen, Q., Yao, L., & Yang, J. (2017). Short text classification based on LDA topic model. In International conference on audio, language and image processing (ICALIP), IEEE.
Chow, T. W. S., & Rahman, M. K. M. (2009). Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection. IEEE Transactions on Neural Networks, 20(9), 1385–1402.
Deepika, J., Archana, V., Bagyalakshmi, V., & Preethi, P. (2011). A knowledge based approach to detection of idea plagiarism in online research publications. International Journal on Internet and Distributed Computing System, 1(2), 51–61.
Eisa, T. A. E., Salim, N., & Alzahrani, S. (2015). Existing plagiarism detection techniques: A systematic mapping of the scholarly literature. Online Information Review, 39(3), 383–400.
Elhadi, M., & Al-Tobi, A. (2008). Use of text syntactical structures in detection of document duplicates. In 2008 Third international conference on digital information management, ICDIM, pp. 520–525.
Ezzikouri, H., Erritali, M., & Oukessou, M. (2017). Fuzzy-semantic similarity for automatic multilingual plagiarism detection. International Journal of Advanced Computer Science and Applications, 8(9), 86–90.
Ferreira, R., Lins, R. D., Freitas, F., Simske, S. J., & Riss, M. (2014). A new sentence similarity assessment measure based on a three-layer sentence representation. In Proceedings of the 2014 ACM symposium on document engineering, pp. 25–34.
Ferrero, J., Agnes, F., Besacier, L., et al. (2017). Using word embedding for cross-language plagiarism detection. arXiv preprint arXiv:1702.03082.
Franco-Salvador, M., Rosso, P., & Montes-y-Gómez, M. (2016). A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing and Management, 52(4), 550–570.
García-Romero, A., & Estrada-Lorenzo, J. M. (2014). A bibliometric analysis of plagiarism and self-plagiarism through Déjà vu. Scientometrics, 101(1), 381–396.
Gupta, D., Vani, K., & Singh, C. K. (2014). Using natural language processing techniques and fuzzy-semantic similarity for automatic external plagiarism detection. In IEEE 2014 international conference on advances in computing, communications and informatics (ICACCI), pp. 2694–2699.
Hiremath, S. A., & Otari, M. S. (2014). Plagiarism detection—different methods and their analysis. International Journal of Innovative Research in Advanced Engineering, 1(7), 41–47.
Hoad, T. C., & Zobel, J. (2003). Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 54(3), 203–215.
Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data, 2(2), 1–25.
Jarić, I. (2016). High time for a common plagiarism detection system. Scientometrics, 106(1), 457–459.
Jinquan, W., Maocheng, L., & Hongliang, Y. (2007). A measure of sentence similarity based on n-grams and vector space model. Modern Foreign Languages, 4, 011.
Kim, W., Jang, H., Kim, H. J., et al. (2016). A document query search using an extended centrality with the word2vec. In ICEC 2016—International conference on electronic commerce: E-commerce in smart connected world, pp. 14:1–14:8.
Lau, J. H., & Baldwin T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint. arXiv:1607.05368.
Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31th international conference on machine learning (ICML’14), Vol. 32, Beijing, China, JMLR Proceedings, pp. 1188–1196.
Li, M. (2018). Classifying and ranking topic terms based on a novel approach: role differentiation of author keywords. Scientometrics, 116(1), 1–24.
Li, S., Sun, Y., & Soergel, D. (2015). A new method for automatically constructing domain-oriented term taxonomy based on weighted word co-occurrence analysis. Scientometrics, 103(3), 1023–1042.
Liu, M., Lang, B., Gu, Z., et al. (2017). Measuring similarity of academic articles with semantic profile and joint word embedding. Tsinghua Science and Technology, 06, 71–84.
Liu, X., Xu, C., & Ouyang, B. (2015). Plagiarism detection algorithm for source code in computer science education. International Journal of Distance Education Technologies (IJDET), 13(4), 29–39.
Luo, L., Ming, J., Wu, D., Liu, P., & Zhu, S. (2017). Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. IEEE Transactions on Software Engineering, 43(12), 1157–1177.
Mariani, J., Francopoulo, G., & Paroubek, P. (2018). Reuse and plagiarism in speech and natural language processing publications. International Journal on Digital Libraries, 19(2–3), 113–126.
Menai, M. E. B. (2012). Detection of plagiarism in Arabic documents. International Journal of Information Technology and Computer Science, 10, 80–89.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint. arXiv:1301.3781.
Momtaz, M., Bijari, K., Salehi, M., & Veisi, H. (2016). Graph-based approach to text alignment for plagiarism detection in Persian documents. In FIRE, pp. 176–179. http://ceur-ws.org/Vol-1737/T4-9.pdf. Accessed 30 Sep 2018.
Niraula, N., Banjade, R., Ştefănescu, D., et al. (2013). Experiments with semantic similarity measures based on LDA and LSA. In International conference on statistical language and speech processing, Springer, Berlin.
Osman, A. H., & Barukab, O. M. (2017). SVM significant role selection method for improving semantic text plagiarism detection. International Journal of Advanced and Applied Sciences, 4(8), 112–122.
Osman, A. H., Salim, N., Binwahlan, S., Hentabli, H., & Ali, A. M. (2011). Conceptual similarity and graph-based method for plagiarism detection. Journal of Theoretical and Applied Information Technology, 32(2), 135–145.
Osman, A. H., Salim, N., Binwwahlan, M. S., Alteeb, R., & Abuobieda, A. (2012). An improved plagiarism detection scheme based on semantic role labeling. Applied Soft Computing, 12(5), 1493–1502.
Rahim, R., Kurniasih, N., Irawan, M. D., Siregar, Y. H., Hasibuan, A., Sari, D. A. P., et al. (2018). Latent semantic indexing for Indonesian text similarity. International Journal of Engineering & Technology, 7(23), 73–77.
Ramachandran, L., & Gehringer, E. F. (2011). Determining degree of relevance of reviews using a graph-based text representation. In IEEE 23rd international conference on tools with artificial intelligence, pp. 442–445.
Rehurek, R. (2008). Plagiarism detection through vector space models applied to a digital library. In Proceedings of the second workshop on recent advances in slavonic natural languages, pp. 75–83.
Rexha, A., Kröll, M., Ziak, H., & Kern, R. (2018). Authorship identification of documents with high content similarity. Scientometrics, 115(1), 223–237.
Sánchez, D., Batet, M., Isern, D., & Valls, A. (2012). Ontology-based semantic similarity: A new feature-based approach. Expert Systems with Applications, 39(9), 7718–7728.
Schuhmacher, M., & Ponzetto, S. P. (2014). Knowledge-based graph document modeling. In Proceedings of the 7th ACM international conference on Web search and data mining, ACM, pp. 543–552.
Silva, F. B., Werneck, R. D. O., Goldenstein, S., Tabbone, S., & Torres, R. D. S. (2018). Graph-based bag-of-words for classification. Pattern Recognition, 74, 266–285.
Sonawane, S. S., & Kulkarni, P. A. (2014). Graph based representation and analysis of text document: A survey of techniques. International Journal of Computer Applications, 96(19), 1–8.
Tan, C.-M., Wang, Y.-F., & Lee, C.-D. (2002). The use of bigrams to enhance text categorization. Information Processing and Management, 38(4), 529–546.
Tang, W., Du, Z. O. U., & Zhang, L. (2017). A plagiarism detection method based on learning behavior analysis. In DEStech transactions on social science, education and human science, international conference on education reform and modern management (ERMM), pp. 43–47.
Tien, N. M., & Labbé, C. (2018). Detecting automatically generated sentences with grammatical structure similarity. Scientometrics, 116(2), 1247–1271.
Vani, K., & Gupta, D. (2015). Investigating the impact of combined similarity metrics and POS tagging in extrinsic text plagiarism detection system. In International conference on advances in computing, communications and informatics (ICACCI), pp. 1578–1584.
Vani, K., & Gupta, D. (2017). Detection of idea plagiarism using syntax–semantic concept extractions with genetic algorithm. Expert Systems with Applications, 73, 11–26.
Vani, K., & Gupta, D. (2018a). Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges. Information Processing and Management, 54(3), 408–432.
Vani, K., & Gupta, D. (2018b). Integrating syntax-semantic-based text analysis with structural and citation information for scientific plagiarism detection. Journal of the Association for Information Science and Technology, 69(11), 1330–1345.
Wu, J., Xuan, Z., & Pan, D. (2011). Enhancing text representation for classification tasks with semantic graph structures. International Journal of Innovative Computing, Information, & Control, 7(5), 2689–2698.
Zhang, C., Chen, L., & Li, Q. (2016). A Chinese text similarity calculation algorithm based on DF_LDA. In Proceedings of the 6th international asia conference on industrial engineering and management innovation, Atlantis Press.
Zhang, H., & Chow, T. W. S. (2011). A coarse-to-fine framework to efficiently thwart plagiarism. Pattern Recognition, 44(2), 471–487.
Zhang, W., Yoshida, T., & Tang, X. (2011). A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Systems with Applications, 38(3), 2758–2765.
Zhang, Y., Jin, R., & Zhou, Z. H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1(1–4), 43–52.
Acknowledgements
The author acknowledges the support by the Project No. 71673122 funded by National Natural Science Foundation of China and Nanjing University Innovation and Creative Program for PhD candidate CXCY17-09.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, T., Lee, B. & Zhu, Q. Semantic measure of plagiarism using a hierarchical graph model. Scientometrics 121, 209–239 (2019). https://doi.org/10.1007/s11192-019-03204-x
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-019-03204-x