Abstract
Efficient methods are required for plagiarism detection. This paper proposes a fast and scalable method for detecting “copy and paste”-type plagiarism in documents. Implementing detection methods for this type of plagiarism requires a long processing time or a large database for comprehensive matching of ordered word occurrences. The author improved the scalability of an existing fast method based on fast Fourier transform using the idea of the frequency domain filtering. He evaluated the effect of the improvement on accuracy of the plagiarism detection method, and achieved an effective trade-off between the accuracy and the required size of database.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Nature. http://www.nature.com/nature/. Accessed 15 Jan 2018
Atallah, M.J., Chyzak, F., Dumas, P.: A randomized algorithm for approximate string matching. Algorithmica 29(3), 468–486 (2001)
Baba, K.: String matching with mismatches by real-valued FFT. In: Taniar, D., Gervasi, O., Murgante, B., Pardede, E., Apduhan, B.O. (eds.) ICCSA 2010. LNCS, vol. 6019, pp. 273–283. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12189-0_24
Baba, K.: An acceleration of FFT-based algorithms for the match-count problem. Inf. Process. Lett. 125, 1–4 (2017)
Baba, K.: An extension of the FFT-based algorithm for the match-count problem to weighted scores. IEEJ Trans. Electr. Electron. Eng. 12(S5), 97–100 (2017)
Baba, K.: A fast algorithm for plagiarism detection in large-scale data. J. Digit. Inf. Manag. 15(6), 331–338 (2017)
Baba, K.: Fast plagiarism detection based on simple document similarity. In: Proceedings of the Twelfth International Conference on Digital Information Management, pp. 49–53. IEEE (2017)
Baba, K.: Fast plagiarism detection using approximate string matching and vector representation of words. In: Wong, R., Chi, C.-H., Hung, P.C.K. (eds.) Behavior Engineering and Applications. ISCEMT, pp. 67–79. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76430-6_3
Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)
Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 2nd edn. McGraw-Hill Higher Education, Boston (2001)
Fischer, M.J., Paterson, M.S.: String-matching and other products. In: Complexity of Computation (Proceedings of the SIAM-AMS Applied Mathematics Symposium, New York, 1973), pp. 113–125 (1974)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York (1997)
Irving, R.W.: Plagiarism and collusion detection using the smith-waterman algorithm. Technical report (2004)
Jain, A.K.: Fundamentals of Digital Image Processing. Prentice-Hall Inc., Upper Saddle River (1989)
Lin, W.-Y., Peng, N., Yen, C.-C., Lin, S.-D.: Online plagiarism detection through exploiting lexical, syntactic, and semantic information. In: Proceedings of the ACL 2012 System Demonstrations, ACL 2012, Stroudsburg, PA, USA, pp. 145–150. Association for Computational Linguistics (2012)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Mikolov, T., Sutskever, I. Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates Inc. (2013)
Misra, H., Cappé, O., Yvon, F.: Using LDA to detect semantically incoherent documents. In: Proceedings of the Twelfth Conference on Computational Natural Language Learning, CoNLL 2008, Stroudsburg, PA, USA, pp. 41–48. Association for Computational Linguistics (2008)
Řehůřek, R.: Plagiarism detection through vector space models applied to a digital library. In: RASLAN 2008, Brno, pp. 75–83. Masarykova Univerzita (2008)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Su, Z., Ahn, B.-R., Eom, K.-Y., Kang, M.-K., Kim, J.-P., Kim, M.-K.: Plagiarism detection using the Levenshtein distance and Smith-Waterman algorithm. In: Innovative Computing Information and Control, p. 569 (2008)
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Baba, K. (2018). Filtering Documents for Plagiarism Detection. In: Soldatova, L., Vanschoren, J., Papadopoulos, G., Ceci, M. (eds) Discovery Science. DS 2018. Lecture Notes in Computer Science(), vol 11198. Springer, Cham. https://doi.org/10.1007/978-3-030-01771-2_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-01771-2_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01770-5
Online ISBN: 978-3-030-01771-2
eBook Packages: Computer ScienceComputer Science (R0)