Skip to main content

Filtering Documents for Plagiarism Detection

  • Conference paper
  • First Online:
Discovery Science (DS 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11198))

Included in the following conference series:

Abstract

Efficient methods are required for plagiarism detection. This paper proposes a fast and scalable method for detecting “copy and paste”-type plagiarism in documents. Implementing detection methods for this type of plagiarism requires a long processing time or a large database for comprehensive matching of ordered word occurrences. The author improved the scalability of an existing fast method based on fast Fourier transform using the idea of the frequency domain filtering. He evaluated the effect of the improvement on accuracy of the plagiarism detection method, and achieved an effective trade-off between the accuracy and the required size of database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Nature. http://www.nature.com/nature/. Accessed 15 Jan 2018

  2. Atallah, M.J., Chyzak, F., Dumas, P.: A randomized algorithm for approximate string matching. Algorithmica 29(3), 468–486 (2001)

    Article  MathSciNet  Google Scholar 

  3. Baba, K.: String matching with mismatches by real-valued FFT. In: Taniar, D., Gervasi, O., Murgante, B., Pardede, E., Apduhan, B.O. (eds.) ICCSA 2010. LNCS, vol. 6019, pp. 273–283. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12189-0_24

    Chapter  Google Scholar 

  4. Baba, K.: An acceleration of FFT-based algorithms for the match-count problem. Inf. Process. Lett. 125, 1–4 (2017)

    Article  MathSciNet  Google Scholar 

  5. Baba, K.: An extension of the FFT-based algorithm for the match-count problem to weighted scores. IEEJ Trans. Electr. Electron. Eng. 12(S5), 97–100 (2017)

    Article  Google Scholar 

  6. Baba, K.: A fast algorithm for plagiarism detection in large-scale data. J. Digit. Inf. Manag. 15(6), 331–338 (2017)

    Google Scholar 

  7. Baba, K.: Fast plagiarism detection based on simple document similarity. In: Proceedings of the Twelfth International Conference on Digital Information Management, pp. 49–53. IEEE (2017)

    Google Scholar 

  8. Baba, K.: Fast plagiarism detection using approximate string matching and vector representation of words. In: Wong, R., Chi, C.-H., Hung, P.C.K. (eds.) Behavior Engineering and Applications. ISCEMT, pp. 67–79. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76430-6_3

    Chapter  Google Scholar 

  9. Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)

    Article  MathSciNet  Google Scholar 

  10. Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 2nd edn. McGraw-Hill Higher Education, Boston (2001)

    MATH  Google Scholar 

  11. Fischer, M.J., Paterson, M.S.: String-matching and other products. In: Complexity of Computation (Proceedings of the SIAM-AMS Applied Mathematics Symposium, New York, 1973), pp. 113–125 (1974)

    Google Scholar 

  12. Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York (1997)

    Book  Google Scholar 

  13. Irving, R.W.: Plagiarism and collusion detection using the smith-waterman algorithm. Technical report (2004)

    Google Scholar 

  14. Jain, A.K.: Fundamentals of Digital Image Processing. Prentice-Hall Inc., Upper Saddle River (1989)

    MATH  Google Scholar 

  15. Lin, W.-Y., Peng, N., Yen, C.-C., Lin, S.-D.: Online plagiarism detection through exploiting lexical, syntactic, and semantic information. In: Proceedings of the ACL 2012 System Demonstrations, ACL 2012, Stroudsburg, PA, USA, pp. 145–150. Association for Computational Linguistics (2012)

    Google Scholar 

  16. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  Google Scholar 

  17. Mikolov, T., Sutskever, I. Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates Inc. (2013)

    Google Scholar 

  18. Misra, H., Cappé, O., Yvon, F.: Using LDA to detect semantically incoherent documents. In: Proceedings of the Twelfth Conference on Computational Natural Language Learning, CoNLL 2008, Stroudsburg, PA, USA, pp. 41–48. Association for Computational Linguistics (2008)

    Google Scholar 

  19. Řehůřek, R.: Plagiarism detection through vector space models applied to a digital library. In: RASLAN 2008, Brno, pp. 75–83. Masarykova Univerzita (2008)

    Google Scholar 

  20. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)

    Article  Google Scholar 

  21. Su, Z., Ahn, B.-R., Eom, K.-Y., Kang, M.-K., Kim, J.-P., Kim, M.-K.: Plagiarism detection using the Levenshtein distance and Smith-Waterman algorithm. In: Innovative Computing Information and Control, p. 569 (2008)

    Google Scholar 

  22. Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kensuke Baba .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Baba, K. (2018). Filtering Documents for Plagiarism Detection. In: Soldatova, L., Vanschoren, J., Papadopoulos, G., Ceci, M. (eds) Discovery Science. DS 2018. Lecture Notes in Computer Science(), vol 11198. Springer, Cham. https://doi.org/10.1007/978-3-030-01771-2_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01771-2_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01770-5

  • Online ISBN: 978-3-030-01771-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics