Experiments with Filtered Detection of Similar Academic Papers

HaCohen-Kerner, Yaakov; Tayeb, Aharon

doi:10.1007/978-3-642-33185-5_1

Yaakov HaCohen-Kerner²¹ &
Aharon Tayeb²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7557))

Included in the following conference series:

International Conference on Artificial Intelligence: Methodology, Systems, and Applications

1098 Accesses

Abstract

In this research, we investigate the issue of efficient detection of similar academic papers. Given a specific paper, and a corpus of academic papers, most of the papers from the corpus are filtered out using a fast filter method. Then, 47 methods (baseline methods and combinations of them) are applied to detect similar papers, where 34 of the methods are variants of new methods. These 34 methods are divided into three new method sets: rare words, combinations of at least two methods, and compare methods between portions of the papers. Results achieved by some of the 34 heuristic methods are better than the results of previous heuristic methods, comparing to the results of the “Full Fingerprint” (FF) method, an expensive method that served as an expert. Nevertheless, the run time of the new methods is much more efficient than the run time of the FF method. The most interesting finding is a method called CWA(1) that computes the frequency of rare words that appear only once in both compared papers. This method has been found as an efficient measure to check whether two papers are similar.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Loui, M.C.: Seven Ways to Plagiarize. Science and Engineering Ethics 8(4), 529–539 (2002)
Article Google Scholar
Martin, B.: Plagiarism: a Misplaced Emphasis. Journal of Information Ethics 3(2), 36–47 (1994)
Google Scholar
Ceska, Z.: The Future of Copy Detection Techniques. In: Proceedings of the First Young Researchers Conference on Applied Sciences (YRCAS), pp. 5–10 (2007)
Google Scholar
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local Algorithms for Document Fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD 2003), pp. 76–85. ACM, New York (2003)
Chapter Google Scholar
Collberg, C., Kobourov, S., Louie, J., Slattery, T.: Self-Plagiarism in Computer Science. Commun. ACM 48(4), 88–94 (2005)
Article Google Scholar
Sorokina, D., Gehrke, J., Warner, S., Ginsparg, P.: Plagiarism Detection in arXiv. In: ICDM, pp. 1070–1075 (2006)
Google Scholar
Keuskamp, D., Sliuzas, R.: Plagiarism Prevention or Detection? The Contribution of Text-Matching Software to Education about Academic Integrity. Journal of Academic Language and Learning 1(1), 91–99 (2007)
Google Scholar
HaCohen-Kerner, Y., Tayeb, A., Ben-Dror, N.: Plagiarism Detection in Computer Science Papers. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pp. 421–429 (2010)
Google Scholar
Wise, M.J.: YAP3: Improved Detection of Similarities in Computer Program and Other Texts. ACM SIGCSE 28, 130–134 (1996)
Article Google Scholar
Burrows, S., Tahaghoghi, S., Zobel, J.: Efficient and Effective Plagiarism Detection for Large Code Repositories. In: Proceedings of the Second Australian Undergraduate Students’ Computing Conference, pp. 8–15 (2004)
Google Scholar
Chen, X., Francia, B., Li, M., Mckinnon, B., Seker, A.: Shared Information and Program Plagiarism Detection. IEEE Trans. Information Theory 50(7), 1545–1550 (2004)
Article MathSciNet Google Scholar
Jadalla, A., Elnagar, A.: PDE4Java: Plagiarism Detection Engine for Java Source Code: A Clustering Approach. International Journal of Business Intelligence and Data Mining (IJBIDM) 3(2), 121–135 (2008)
Article Google Scholar
Manber, U.: Finding Similar Files in a Large File System. In: USENIX Technical Conference on USENIX Winter, San Francisco, CA, pp. 1–10 (1994)
Google Scholar
Heintze, N.: Scalable Document Fingerprinting. In: USENIX Workshop on Electronic Commerce (1996)
Google Scholar
Shivakumar, N., G-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, vol. 24(2), pp. 398–409 (1995)
Google Scholar
Broder, A.Z.: On the Resemblance and Containment of Document. In: Compression and Complexity of Sequences (SEQUENCES 1997), pp. 21–29 (1997)
Google Scholar
Lyon, C., Malcolm, J., Dickerson, B.: Detecting Short Passages of Similar Text in Large Document Collections. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 118–125 (2001)
Google Scholar
Hoad, T.C., Zobel, J.: Methods for Identifying Versioned and Plagiarised Documents. Journal of the American Society for Information Science and Technology 54(3), 203–215 (2003)
Article Google Scholar
Monostori, K., Finkel, R., Zaslavsky, A., Hodász, G., Pataki, M.: Comparison of Overlap Detection Techniques. In: Sloot, P.M.A., Tan, C.J.K., Dongarra, J., Hoekstra, A.G. (eds.) ICCS 2002, Part I. LNCS, vol. 2329, pp. 51–60. Springer, Heidelberg (2002)
Chapter Google Scholar
Bernstein, Y., Zobel, J.: A Scalable System for Identifying Co-derivative Documents. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 55–67. Springer, Heidelberg (2004)
Chapter Google Scholar
Forman, G., Eshghi, K., Chiocchetti, S.: Finding Similar Files in Large Document Repositories. In: KDD 2005: Proceeding of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, New York, NY, USA, pp. 394–400 (2005)
Google Scholar
Muthitacharoen, A., Chen, B., Mazieres, D.: A Low-Bandwidth Network File System. In: Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP 2001), Banff, Canada, pp. 174–187 (2001)
Google Scholar
Ponzetto, S.P., Strube, M.: Semantic Role Labeling for Coreference Resolution. In: Companion Volume to the Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, April 3-7, pp. 143–146 (2006)
Google Scholar
Ponzetto, S.P., Strube, M.: Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, New York, N.Y., June 4-9, pp. 192–199 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, Jerusalem College of Technology, 91160, Jerusalem, Israel
Yaakov HaCohen-Kerner & Aharon Tayeb

Authors

Yaakov HaCohen-Kerner
View author publications
You can also search for this author in PubMed Google Scholar
Aharon Tayeb
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, University of Manchester, Oxford Road, M13 9PL, Manchester, UK
Allan Ramsay
Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, 2 Acad. G. Bonchev, 1113, Sofia, Bulgaria
Gennady Agre

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

HaCohen-Kerner, Y., Tayeb, A. (2012). Experiments with Filtered Detection of Similar Academic Papers. In: Ramsay, A., Agre, G. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2012. Lecture Notes in Computer Science(), vol 7557. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33185-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-33185-5_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33184-8
Online ISBN: 978-3-642-33185-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics