A Detection of the Most Influential Documents
Abstract
This work is a result of the ongoing research on semantic compression and robust algorithms applicable in plagiarism detection. This article includes a brief description of Sentence Hashing Algorithm for Plagiarism Detection SHAPD along with a comparison with the other available alternatives using frame structures for subsequence detection. What is more, the core of this publication is devoted to the application of SHAPD to a task of discovery of the most influential documents in a corpus. The experiments were carried out on multiple datasets diversified in terms of structure and content. The observations gathered during the experiments were summarised and are given in the article. The experiment allowed the authors to verify their initial hypothesis that it is possible to single out the most important documents in a corpus capturing the relations of citation among them.
Keywords
Longe Common Subsequence Common Frame Local Discontinuity Plagiarism Detection Related Work SectionPreview
Unable to display preview. Download preview PDF.
References
- 1.Hamid, O.A., Behzadi, B., Christoph, S., Henzinger, M.: Detecting the origin of text segments efficiently. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, vol. 7(3), pp. 61–70 (2009)Google Scholar
- 2.Ahmed, M.N., Yamany, S.M., Mohamed, N., Farag, A.A., Moriarty, T.: A modified fuzzy c-means algorithm for bias field estimation and segmentation of mri data. IEEE Transactions on Medical Imaging 21(3), 193–199 (2002)CrossRefGoogle Scholar
- 3.Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)CrossRefGoogle Scholar
- 4.Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8-13), 1157–1166 (1997)CrossRefGoogle Scholar
- 5.Burrows, S., Tahaghoghi, S.M.M., Zobel, J.: Efficient plagiarism detection for large code repositories. Software: Practice and Experience 37(2), 151–175 (2007)CrossRefGoogle Scholar
- 6.Ceglarek, D., Haniewicz, K.: Fast Plagiarism Detection by Sentence Hashing. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2012, Part II. LNCS, vol. 7268, pp. 30–37. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- 7.Ceglarek, D., Haniewicz, K., Rutkowski, W.: Semantic Compression for Specialised Information Retrieval Systems. In: Nguyen, N.T., Katarzyniak, R., Chen, S.-M. (eds.) Advances in Intelligent Information and Database Systems. SCI, vol. 283, pp. 111–121. Springer, Heidelberg (2010)CrossRefGoogle Scholar
- 8.Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC 2002, pp. 380–388. ACM, New York (2002)CrossRefGoogle Scholar
- 9.Chvatal, V., Klarner, D.A., Knuth, D.E.: Selected combinatorial research problems. Technical report, Stanford, CA, USA (1972)Google Scholar
- 10.Grozea, C., Gehl, C., Popescu, M.: Encoplot: Pairwise sequence matching in linear time applied to plagiarism detection. Time, 10–18 (2009)Google Scholar
- 11.Hirsch, J.E.: An index to quantify an individuals scientific research output. Proceedings of the National Academy of Sciences of the United States of America 102(46), 16569–16572 (2005)CrossRefGoogle Scholar
- 12.Hunt, J.W., Szymanski, T.G.: A fast algorithm for computing longest common subsequences. Commun. ACM 20, 350–353 (1977)MathSciNetMATHCrossRefGoogle Scholar
- 13.Irving, R.W.: Plagiarism and collusion detection using the smith-waterman algorithm. Technical report, University of Glasgow, Department of Computing Science (2004)Google Scholar
- 14.Lukashenko, R., Graudina, V., Grundspenkis, J.: Computer-based plagiarism detection methods and tools: an overview. In: Proceedings of the 2007 International Conference on Computer Systems and Technologies, CompSysTech 2007, pp. 40:1–40:6. ACM, New York (2007)CrossRefGoogle Scholar
- 15.Manber, U.: Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference, WTEC 1994, p. 2. USENIX Association, Berkeley (1994)Google Scholar
- 16.Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. Journal of Computer and System Sciences 20(1), 18–31 (1980)MathSciNetMATHCrossRefGoogle Scholar
- 17.Mozgovoy, M., Karakovskiy, S., Klyuev, V.: Fast and reliable plagiarism detection system. In: 37th Annual Frontiers In Education Conference - Global Engineering: Knowledge Without Borders, Opportunities Without Passports, FIE 2007, pp. S4H-11–S4H-14 (October 2007)Google Scholar
- 18.Nock, R., Nielsen, F.: On weighting clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(8), 1223–1235 (2006)CrossRefGoogle Scholar
- 19.Ota, T., Masuyama, S.: Automatic plagiarism detection among term papers. In: Proceedings of the 3rd International Universal Communication Symposium, IUCS 2009, pp. 395–399. ACM, New York (2009)CrossRefGoogle Scholar