An Innovative Similarity Measure for Sentence Plagiarism Detection

Augello, Agnese; Cuzzocrea, Alfredo; Pilato, Giovanni; Spiccia, Carmelo; Vassallo, Giorgio

doi:10.1007/978-3-319-42092-9_42

Agnese Augello²²,
Alfredo Cuzzocrea^22,23,
Giovanni Pilato²²,
Carmelo Spiccia²² &
…
Giorgio Vassallo²⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9790))

Included in the following conference series:

International Conference on Computational Science and Its Applications

1491 Accesses
1 Citations

Abstract

We propose and experimentally assess Semantic Word Error Rate (SWER), an innovative similarity measure for sentence plagiarism detection. SWER introduces a complex approach based on latent semantic analysis, which is capable of outperforming the accuracy of competitor methods in plagiarism detection. We provide principles and functionalities of SWER, and we complement our analytical contribution by means of a significant preliminary experimental analysis. Derived results are promising, and confirm to use the goodness of our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources, In: Proceedings of the 20th International Conference on Computational Linguistics (COLING), Geneva, Switzerland, pp. 350–356 (2004)
Google Scholar
Hassan, S.: Measuring semantic relatedness using salient encyclopedic concepts. Ph.D. thesis, University of Texas (2011)
Google Scholar
Ji, Y., Eisenstein, J.: Discriminative improvements to distributional sentence similarity. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP), Seattle, Washington, USA, pp. 891–896 (2013)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
MathSciNet MATH Google Scholar
Morris, A.C., Maier, V., Green, P.: From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In: INTERSPEECH (2004)
Google Scholar
Burke, R.D., Hammond, K.J., Kulyukin, V., Lytinen, S.L., Tomuro, N., Schoenberg, S.: Question answering from frequently asked question files: experiences with the FAQ finder system. AI Mag. 18(2), 57–66 (1997)
Google Scholar
Aliguliyev, R.M.: A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Syst. Appl. 36(4), 7764–7772 (2009)
Article Google Scholar
Erkan, G., Radev, D.R.: LexRank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)
Google Scholar
Liu, D., Gildea, D.: Syntactic features for evaluation of machine translation. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 25–32 (2005)
Google Scholar
Madnani, N., Tetreault, J., Chodorow, M.: Re-examining machine transla-tion metrics for paraphrase identification. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2012), pp. 182-190. Association for Computa-tional Linguistics (2012)
Google Scholar
Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. In: Proceedings of the International MultiConference of Engineers and Computer Scientists (IMECS), vol. 1, pp. 380–384 (2013)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL), pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Second International Conference on Human Language Technology Research (HLT), pp. 138–145 (2002)
Google Scholar
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of the Seventh Conference of the Association for Machine Translation in the Americas (AMTA), pp. 223–231 (2006)
Google Scholar
Snover, M., Madnani, N., Dorr, B., Schwartz, R.: TERp system description. In: Proceedings of Metrics MATR Workshop at the Eighth Conference of the Association for Machine Translation in the Americas (AMTA), vol. 555 (2008)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Lavie, A., Denkowski, M.J.: The METEOR metric for automatic evaluation of machine translation. Mach. Transl. 23(2–3), 105–115 (2009)
Article Google Scholar
Allan, J., Wade, C., Bolivar, A.: Retrieval and novelty detection at the sentence level. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR 2003), pp. 314–321 (2003)
Google Scholar
Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarised documents. J. Am. Soc. Inform. Sci. Technol. 54, 203–215 (2003)
Article Google Scholar
Metzler, D., Bernstein, Y., Croft, W.B., Moffat, A., Zobel, J.: Similarity measures for tracking information flow. In: Proceedings of the 14th ACM International Conference on Information And Knowledge Management (CIKM), pp. 517–524 (2005)
Google Scholar
Barzilay, R., Lee, L.: Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL), vol. 1, pp. 16–23 (2003)
Google Scholar
Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP) (2005)
Google Scholar
Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)
Article Google Scholar
O’Shea, J., Bandar, Z., Crockett, K.: A new benchmark dataset with production methodology for short text semantic similarity algorithms. ACM Trans. Speech Lang. Process. (TSLP) 10(4) (2013). Article no. 19
Google Scholar
Agirre, E., Diab, M., Cer, D., Gonzalez-Agirre, A.: Semeval- task 6: A pilot on semantic textual similarity. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, vol. 1, Proceedings of the Main Conference and the Shared Task. Proceedings of the Sixth International Workshop on Semantic Evaluation, vol. 2, pp. 385–393 (2012)
Google Scholar
Agirre, E., Cer, D., Diab, M., Dolan, B.: SemEval-2012 task 6 corpus. University of York (distributor) (2012). https://www.cs.york.ac.uk/semeval-2012/task6/
Microsoft Research, Microsoft Resarch Video Description Corpus. Microsoft Corporation (2010). http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/
Callison-Burch, C.: Workshop on statistical machine translation at ACL 2007 - development data. Johns Hopkins University (2008). http://www.statmt.org/wmt08/shared-evaluation-task.html
Banerjee, S., Pedersen, T.: Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI), vol. 3, pp. 805–810 (2003)
Google Scholar
Ponzetto, S.P., Strube, M.: Knowledge derived from wikipedia for computing semantic relatedness. J. Artif. Intell. Res. 30, 181–212 (2007)
MATH Google Scholar
Shapira, D., Storer, J.A.: Edit distance with move operations. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 85–98. Springer, Heidelberg (2002)
Chapter Google Scholar
Lee, M.D., Pincombe, B.M., Welsh, M.B.: An empirical evaluation of models of text document similarity. In: Proceedings of the 27th Annual Conference of the Cognitive Science Society, pp. 1254–1259 (2005)
Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
Sochenkov, I., Zubarev, D., Tikhomirov, I., Smirnov, I., Shelmanov, A., Suvorov, R., Osipov, G.: Exactus like: plagiarism detection in scientific texts. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 837–840. Springer, Heidelberg (2016)
Chapter Google Scholar
Velásquez, J.D., Covacevich, Y., Molina, F., Marrese-Taylor, E., Rodríguez, C., Bravo-Marquez, F.: DOCODE 3.0 (DOcument COpy DEtector): a system for plagiarism detection by applying an information fusion process from multiple documental data sources. Inf. Fusion 27, 64–75 (2016)
Article Google Scholar
Chae, D.-K., Ha, J., Kim, S.-W., Kang, B., Im, E.G., Park, S.: Credible, resilient, and scalable detection of software plagiarism using authority histograms. Knowl.-Based Syst. 95, 114–124 (2016)
Article Google Scholar
Jaric, I.: High time for a common plagiarism detection system. Scientometrics 106(1), 457–459 (2016)
Article Google Scholar
Cuzzocrea, A., Saccà, D., Ullman, J.D.: Big data: a research agenda. In: Proceedings of the 17th International Database Engineering and Applications Symposium (IDEAS), pp. 198–203 (2013)
Google Scholar
Cuzzocrea, A., Bellatreche, L., Song, I.-Y.: Data warehousing, OLAP over big data: current challenges and future research directions. In: Proceedings of the 16th International Workshop on Data Warehousing and OLAP (DOLAP), pp. 67–70 (2013)
Google Scholar
Cuzzocrea, A.: Analytics over big data: exploring the convergence of data warehousing, OLAP and data-intensive cloud infrastructures.In: Proceedings of the 37th Annual IEEE Computer Software and Applications Conference (COMPSAC), pp. 481–483 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Istituto di Calcolo E Reti Ad Alte Prestazioni (ICAR), Italian National Research Council (CNR), Palermo, Italy
Agnese Augello, Alfredo Cuzzocrea, Giovanni Pilato & Carmelo Spiccia
University of Trieste, Trieste, Italy
Alfredo Cuzzocrea
Universitá degli Studi di Palermo, Palermo, Italy
Giorgio Vassallo

Authors

Agnese Augello
View author publications
You can also search for this author in PubMed Google Scholar
Alfredo Cuzzocrea
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Pilato
View author publications
You can also search for this author in PubMed Google Scholar
Carmelo Spiccia
View author publications
You can also search for this author in PubMed Google Scholar
Giorgio Vassallo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alfredo Cuzzocrea .

Editor information

Editors and Affiliations

University of Perugia, Perugia, Italy
Osvaldo Gervasi
University of Basilicata, Potenza, Italy
Beniamino Murgante
Covenant University, Ota, Nigeria
Sanjay Misra
University of Minho, Braga, Portugal
Ana Maria A.C. Rocha
Polytechnic University, Bari, Italy
Carmelo M. Torre
Monash University, Clayton, Victoria, Australia
David Taniar
Kyushu Sangyo University, Fukuoka, Japan
Bernady O. Apduhan
Saint Petersburg State University, Saint Petersburg, Russia
Elena Stankova
Beijing University of Posts and Telecommunications, Beijing, China
Shangguang Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Augello, A., Cuzzocrea, A., Pilato, G., Spiccia, C., Vassallo, G. (2016). An Innovative Similarity Measure for Sentence Plagiarism Detection. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2016. ICCSA 2016. Lecture Notes in Computer Science(), vol 9790. Springer, Cham. https://doi.org/10.1007/978-3-319-42092-9_42

Download citation

DOI: https://doi.org/10.1007/978-3-319-42092-9_42
Published: 01 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42091-2
Online ISBN: 978-3-319-42092-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics