Skip to main content

An Innovative Similarity Measure for Sentence Plagiarism Detection

  • Conference paper
  • First Online:
Computational Science and Its Applications – ICCSA 2016 (ICCSA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9790))

Included in the following conference series:

Abstract

We propose and experimentally assess Semantic Word Error Rate (SWER), an innovative similarity measure for sentence plagiarism detection. SWER introduces a complex approach based on latent semantic analysis, which is capable of outperforming the accuracy of competitor methods in plagiarism detection. We provide principles and functionalities of SWER, and we complement our analytical contribution by means of a significant preliminary experimental analysis. Derived results are promising, and confirm to use the goodness of our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources, In: Proceedings of the 20th International Conference on Computational Linguistics (COLING), Geneva, Switzerland, pp. 350–356 (2004)

    Google Scholar 

  2. Hassan, S.: Measuring semantic relatedness using salient encyclopedic concepts. Ph.D. thesis, University of Texas (2011)

    Google Scholar 

  3. Ji, Y., Eisenstein, J.: Discriminative improvements to distributional sentence similarity. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP), Seattle, Washington, USA, pp. 891–896 (2013)

    Google Scholar 

  4. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)

    MathSciNet  MATH  Google Scholar 

  5. Morris, A.C., Maier, V., Green, P.: From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In: INTERSPEECH (2004)

    Google Scholar 

  6. Burke, R.D., Hammond, K.J., Kulyukin, V., Lytinen, S.L., Tomuro, N., Schoenberg, S.: Question answering from frequently asked question files: experiences with the FAQ finder system. AI Mag. 18(2), 57–66 (1997)

    Google Scholar 

  7. Aliguliyev, R.M.: A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Syst. Appl. 36(4), 7764–7772 (2009)

    Article  Google Scholar 

  8. Erkan, G., Radev, D.R.: LexRank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)

    Google Scholar 

  9. Liu, D., Gildea, D.: Syntactic features for evaluation of machine translation. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 25–32 (2005)

    Google Scholar 

  10. Madnani, N., Tetreault, J., Chodorow, M.: Re-examining machine transla-tion metrics for paraphrase identification. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2012), pp. 182-190. Association for Computa-tional Linguistics (2012)

    Google Scholar 

  11. Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. In: Proceedings of the International MultiConference of Engineers and Computer Scientists (IMECS), vol. 1, pp. 380–384 (2013)

    Google Scholar 

  12. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL), pp. 311–318. Association for Computational Linguistics (2002)

    Google Scholar 

  13. Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Second International Conference on Human Language Technology Research (HLT), pp. 138–145 (2002)

    Google Scholar 

  14. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of the Seventh Conference of the Association for Machine Translation in the Americas (AMTA), pp. 223–231 (2006)

    Google Scholar 

  15. Snover, M., Madnani, N., Dorr, B., Schwartz, R.: TERp system description. In: Proceedings of Metrics MATR Workshop at the Eighth Conference of the Association for Machine Translation in the Americas (AMTA), vol. 555 (2008)

    Google Scholar 

  16. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  17. Lavie, A., Denkowski, M.J.: The METEOR metric for automatic evaluation of machine translation. Mach. Transl. 23(2–3), 105–115 (2009)

    Article  Google Scholar 

  18. Allan, J., Wade, C., Bolivar, A.: Retrieval and novelty detection at the sentence level. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR 2003), pp. 314–321 (2003)

    Google Scholar 

  19. Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarised documents. J. Am. Soc. Inform. Sci. Technol. 54, 203–215 (2003)

    Article  Google Scholar 

  20. Metzler, D., Bernstein, Y., Croft, W.B., Moffat, A., Zobel, J.: Similarity measures for tracking information flow. In: Proceedings of the 14th ACM International Conference on Information And Knowledge Management (CIKM), pp. 517–524 (2005)

    Google Scholar 

  21. Barzilay, R., Lee, L.: Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL), vol. 1, pp. 16–23 (2003)

    Google Scholar 

  22. Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP) (2005)

    Google Scholar 

  23. Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)

    Article  Google Scholar 

  24. O’Shea, J., Bandar, Z., Crockett, K.: A new benchmark dataset with production methodology for short text semantic similarity algorithms. ACM Trans. Speech Lang. Process. (TSLP) 10(4) (2013). Article no. 19

    Google Scholar 

  25. Agirre, E., Diab, M., Cer, D., Gonzalez-Agirre, A.: Semeval- task 6: A pilot on semantic textual similarity. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, vol. 1, Proceedings of the Main Conference and the Shared Task. Proceedings of the Sixth International Workshop on Semantic Evaluation, vol. 2, pp. 385–393 (2012)

    Google Scholar 

  26. Agirre, E., Cer, D., Diab, M., Dolan, B.: SemEval-2012 task 6 corpus. University of York (distributor) (2012). https://www.cs.york.ac.uk/semeval-2012/task6/

  27. Microsoft Research, Microsoft Resarch Video Description Corpus. Microsoft Corporation (2010). http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/

  28. Callison-Burch, C.: Workshop on statistical machine translation at ACL 2007 - development data. Johns Hopkins University (2008). http://www.statmt.org/wmt08/shared-evaluation-task.html

  29. Banerjee, S., Pedersen, T.: Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI), vol. 3, pp. 805–810 (2003)

    Google Scholar 

  30. Ponzetto, S.P., Strube, M.: Knowledge derived from wikipedia for computing semantic relatedness. J. Artif. Intell. Res. 30, 181–212 (2007)

    MATH  Google Scholar 

  31. Shapira, D., Storer, J.A.: Edit distance with move operations. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 85–98. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  32. Lee, M.D., Pincombe, B.M., Welsh, M.B.: An empirical evaluation of models of text document similarity. In: Proceedings of the 27th Annual Conference of the Cognitive Science Society, pp. 1254–1259 (2005)

    Google Scholar 

  33. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  34. Sochenkov, I., Zubarev, D., Tikhomirov, I., Smirnov, I., Shelmanov, A., Suvorov, R., Osipov, G.: Exactus like: plagiarism detection in scientific texts. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 837–840. Springer, Heidelberg (2016)

    Chapter  Google Scholar 

  35. Velásquez, J.D., Covacevich, Y., Molina, F., Marrese-Taylor, E., Rodríguez, C., Bravo-Marquez, F.: DOCODE 3.0 (DOcument COpy DEtector): a system for plagiarism detection by applying an information fusion process from multiple documental data sources. Inf. Fusion 27, 64–75 (2016)

    Article  Google Scholar 

  36. Chae, D.-K., Ha, J., Kim, S.-W., Kang, B., Im, E.G., Park, S.: Credible, resilient, and scalable detection of software plagiarism using authority histograms. Knowl.-Based Syst. 95, 114–124 (2016)

    Article  Google Scholar 

  37. Jaric, I.: High time for a common plagiarism detection system. Scientometrics 106(1), 457–459 (2016)

    Article  Google Scholar 

  38. Cuzzocrea, A., Saccà, D., Ullman, J.D.: Big data: a research agenda. In: Proceedings of the 17th International Database Engineering and Applications Symposium (IDEAS), pp. 198–203 (2013)

    Google Scholar 

  39. Cuzzocrea, A., Bellatreche, L., Song, I.-Y.: Data warehousing, OLAP over big data: current challenges and future research directions. In: Proceedings of the 16th International Workshop on Data Warehousing and OLAP (DOLAP), pp. 67–70 (2013)

    Google Scholar 

  40. Cuzzocrea, A.: Analytics over big data: exploring the convergence of data warehousing, OLAP and data-intensive cloud infrastructures.In: Proceedings of the 37th Annual IEEE Computer Software and Applications Conference (COMPSAC), pp. 481–483 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alfredo Cuzzocrea .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Augello, A., Cuzzocrea, A., Pilato, G., Spiccia, C., Vassallo, G. (2016). An Innovative Similarity Measure for Sentence Plagiarism Detection. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2016. ICCSA 2016. Lecture Notes in Computer Science(), vol 9790. Springer, Cham. https://doi.org/10.1007/978-3-319-42092-9_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-42092-9_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-42091-2

  • Online ISBN: 978-3-319-42092-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics