Vector Representation of Words for Plagiarism Detection Based on String Matching

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10274)

Abstract

Plagiarism detection in documents requires appropriate definition of document similarity and efficient computation of the similarity. This paper evaluates the validity of using vector representation of words for defining a document similarity in terms of the processing time and the accuracy in plagiarism detection. This paper proposes a plagiarism detection algorithm based on the score vector weighted by vector representation of words. The score vector between two documents represents the number of matches between corresponding words for every possible gap of the starting positions of the documents. The vector and its weighted version can be computed efficiently using convolutions. In this paper, two types of vector representation of words, that is, randomly generated vectors and a distributed representation generated by a neural network-based method from training data, are evaluated with the proposed algorithm. The experimental results show that using the weighted score vector instead of the normal one for the algorithm can reduce the processing time with a slight decrease of the accuracy, and that randomly generated vector representation is more suitable for the algorithm than the distributed representation in the sense of a tradeoff between the processing time and the accuracy.

Keywords

Text processing Plagiarism detection Document similarity Score vector Vector representation of words 

References

  1. 1.
    Evaluation data, originality: PAN. http://pan.webis.de/data.html. Accessed Feb 2017
  2. 2.
    Wikimedia downloads: Wikipedia. https://dumps.wikimedia.org/backup-index.html. Accessed Feb 2017
  3. 3.
    word2vec: Google Code Archive. https://code.google.com/archive/p/word2vec/. Accessed Feb 2017
  4. 4.
    Atallah, M.J., Chyzak, F., Dumas, P.: A randomized algorithm for approximate string matching. Algorithmica 29(3), 468–486 (2001)MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Baba, K., Shinohara, A., Takeda, M., Inenaga, S., Arikawa, S.: A note on randomized algorithm for string matching with mismatches. Nord. J. Comput. 10(1), 2–12 (2003)MathSciNetMATHGoogle Scholar
  6. 6.
    Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 2nd edn. McGraw-Hill Higher Education, New York (2001)MATHGoogle Scholar
  7. 7.
    Crochemore, M., Rytter, W.: Jewels of Stringology. World Scientific, Singapore (2003)MATHGoogle Scholar
  8. 8.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York (1997)CrossRefMATHGoogle Scholar
  9. 9.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC 1998, pp. 604–613. ACM, New York (1998)Google Scholar
  10. 10.
    Irving, R.W.: Plagiarism and collusion detection using the Smith-Waterman algorithm. Technical report, Department of Computing Science, University of Glasgow (2004)Google Scholar
  11. 11.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefMATHGoogle Scholar
  12. 12.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 3111–3119. Curran Associates Inc. (2013)Google Scholar
  13. 13.
    Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85(8), 2444–2448 (1988)CrossRefGoogle Scholar
  14. 14.
    Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th International Competition on Plagiarism Detection. In: Forner, P., Navigli, R., Tufis, D. (eds.) Working Notes Papers of the CLEF 2013 Evaluation Labs, September 2013Google Scholar
  15. 15.
    Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)CrossRefGoogle Scholar
  16. 16.
    Su, Z., Ahn, B.-R., Eom, K.-Y., Kang, M.-K., Kim, J.-P., Kim, M.-K.: Plagiarism detection using the Levenshtein distance and Smith-Waterman algorithm. In: Innovative Computing Information and Control, p. 569 (2008)Google Scholar
  17. 17.
    Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Fujitsu LaboratoriesKawasakiJapan
  2. 2.Kyushu UniversityFukuokaJapan
  3. 3.Kyushu Institute of Information SciencesDazaifu, FukuokaJapan

Personalised recommendations