Fuzzy Semantic-Based Similarity and Big Data for Detecting Multilingual Plagiarism in Arabic Documents

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 915)


Plagiarism (intelligent-monolingual) is a complicated fuzzy process, adding translation and making it a cross language problem turn thing to be more obfuscated, what pose difficulties to current plagiarism detection methods. Multilingual plagiarism nature could be more complicated than simple copy + translate and paste, it is defined as the unacknowledged reuse of a text involving its translation from one natural language to another without proper referencing to the original source. Before the detecting process several NLP techniques were used to characterize input texts (tokenization, stop words removal, post-tagging, and text segmentation). In this paper, fuzzy semantic similarity between words is studied using WordNet-based similarity measures Wu & Palmer and Lin. In any data processing system the common problem is efficient large-scale text comparison, especially fuzzy-based semantic similarity to reveal dishonest practices in Arabic documents, first due to the complexity of the Arabic language and the increase in the number of publications and the rate of suspicious documents sources of plagiarism. To remedy this, vague concepts and fuzzy techniques in a big data environment will be used. The work is done in a parallel way using Apache Hadoop with its distributed file system HDFS and the MapReduce programming model. The proposed approach was evaluated on 400 English and Arabic cases of different sources (news, articles, tweets, and academic works), including 25% machine based translated plagiarism cases, and 75% translated (machine and human based) with a percentage of obfuscated plagiarism e.g. handmade paraphrases and back-translation. We effectuate some experimental verifications and comparisons showing that results and running time of Fuzzy-WuP are better than Fuzzy-Lin. Results are evaluated based on three testing parameters: precision, recall and F-measure.


CLPD Fuzzy sets Semantic similarity Hadoop HDFS MapReduce 


  1. 1.
    Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  2. 2.
    Beyer, M.A., Laney, D.: The Importance of ‘Big Data’: A Definition, pp. 2014–2018. Stamford CT Gart. (2012)Google Scholar
  3. 3.
    Najafabadi, M.M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N., Wald, R., Muharemagic, E.: Deep learning applications and challenges in big data analytics. J. Big Data 2, 1 (2015)CrossRefGoogle Scholar
  4. 4.
    Parhami, B.: A highly parallel computing system for information retrieval. In: Proceedings of the 5–7 Dec 1972, Fall Joint Computer Conference, Part II, pp. 681–690. New York, NY, USA (1972)Google Scholar
  5. 5.
    Zhang, Q., Zhang, Y., Yu, H., Huang, X.: Efficient partial-duplicate detection based on sequence matching. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 675–682 (2010)Google Scholar
  6. 6.
    Erritali, M., Beni-Hssane, A., Birjali, M., Madani, Y.: An approach of semantic similarity measure between documents based on big data. Int. J. Electr. Comput. Eng. 6(5), 2454 (2016)Google Scholar
  7. 7.
    Dwivedi, J., Tiwary, A.: Plagiarism detection on bigdata using modified map-reduced based SCAM algorithm. In: 2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), pp. 608–610 (2017)Google Scholar
  8. 8.
    Ezzikouri, H., Erritali, M., Oukessou, M.: Fuzzy-semantic similarity for automatic multilingual plagiarism detection. Int. J. Adv. Comput. Sci. Appl. 8(9), 86–90 (2017)Google Scholar
  9. 9.
    Leacock, C., Chodorow, M.: Combining local context and WordNet similarity for word sense identification. WordNet Electron. Lex. Database 49(2), 265–283 (1998)Google Scholar
  10. 10.
    Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 133–138 (1994)Google Scholar
  11. 11.
    P. Rensik, “Using information content to evaluate semantic similarity,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence, 1995, pp. 448–453Google Scholar
  12. 12.
    Lin, D.: An information-theoretic definition of similarity. Icml 98, 296–304 (1998)Google Scholar
  13. 13.
    Hirst, G., St-Onge, D.: Lexical chains as representations of context for the detection and correction of malapropisms. WordNet Electron. Lex. Database 305, 305–332 (1998)Google Scholar
  14. 14.
    Lin, D.: Principle-based parsing without overgeneration. In: Proceedings of the 31st Annual Meeting on Association for Computational Linguistics, pp. 112–120 (1993)Google Scholar
  15. 15.
    Gupta, D., Vani, K., Singh, C.K.: Using natural language processing techniques and fuzzy-semantic similarity for automatic external plagiarism detection. In: 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 2694–2699 (2014)Google Scholar
  16. 16.
    Werro, N.: Fuzzy Classification of Online Customers, Fuzzy Management MethodsGoogle Scholar
  17. 17.
    Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)Google Scholar
  18. 18.
    Yerra, R., Ng, Y.-K.: A sentence-based copy detection approach for web documents. In: International Conference on Fuzzy Systems and Knowledge Discovery, pp. 557–570 (2005)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.LMACS Laboratory, Mathematics Department, Faculty of Sciences and TechniquesSultan Moulay Slimane UniversityBeni-MellalMorocco
  2. 2.TIAD Laboratory, Computer Sciences Department, Faculty of Sciences and TechniquesSultan Moulay Slimane UniversityBeni-MellalMorocco

Personalised recommendations