Skip to main content

Experiments with Filtered Detection of Similar Academic Papers

  • Conference paper
Artificial Intelligence: Methodology, Systems, and Applications (AIMSA 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7557))

  • 1098 Accesses

Abstract

In this research, we investigate the issue of efficient detection of similar academic papers. Given a specific paper, and a corpus of academic papers, most of the papers from the corpus are filtered out using a fast filter method. Then, 47 methods (baseline methods and combinations of them) are applied to detect similar papers, where 34 of the methods are variants of new methods. These 34 methods are divided into three new method sets: rare words, combinations of at least two methods, and compare methods between portions of the papers. Results achieved by some of the 34 heuristic methods are better than the results of previous heuristic methods, comparing to the results of the “Full Fingerprint” (FF) method, an expensive method that served as an expert. Nevertheless, the run time of the new methods is much more efficient than the run time of the FF method. The most interesting finding is a method called CWA(1) that computes the frequency of rare words that appear only once in both compared papers. This method has been found as an efficient measure to check whether two papers are similar.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Loui, M.C.: Seven Ways to Plagiarize. Science and Engineering Ethics 8(4), 529–539 (2002)

    Article  Google Scholar 

  2. Martin, B.: Plagiarism: a Misplaced Emphasis. Journal of Information Ethics 3(2), 36–47 (1994)

    Google Scholar 

  3. Ceska, Z.: The Future of Copy Detection Techniques. In: Proceedings of the First Young Researchers Conference on Applied Sciences (YRCAS), pp. 5–10 (2007)

    Google Scholar 

  4. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local Algorithms for Document Fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD 2003), pp. 76–85. ACM, New York (2003)

    Chapter  Google Scholar 

  5. Collberg, C., Kobourov, S., Louie, J., Slattery, T.: Self-Plagiarism in Computer Science. Commun. ACM 48(4), 88–94 (2005)

    Article  Google Scholar 

  6. Sorokina, D., Gehrke, J., Warner, S., Ginsparg, P.: Plagiarism Detection in arXiv. In: ICDM, pp. 1070–1075 (2006)

    Google Scholar 

  7. Keuskamp, D., Sliuzas, R.: Plagiarism Prevention or Detection? The Contribution of Text-Matching Software to Education about Academic Integrity. Journal of Academic Language and Learning 1(1), 91–99 (2007)

    Google Scholar 

  8. HaCohen-Kerner, Y., Tayeb, A., Ben-Dror, N.: Plagiarism Detection in Computer Science Papers. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pp. 421–429 (2010)

    Google Scholar 

  9. Wise, M.J.: YAP3: Improved Detection of Similarities in Computer Program and Other Texts. ACM SIGCSE 28, 130–134 (1996)

    Article  Google Scholar 

  10. Burrows, S., Tahaghoghi, S., Zobel, J.: Efficient and Effective Plagiarism Detection for Large Code Repositories. In: Proceedings of the Second Australian Undergraduate Students’ Computing Conference, pp. 8–15 (2004)

    Google Scholar 

  11. Chen, X., Francia, B., Li, M., Mckinnon, B., Seker, A.: Shared Information and Program Plagiarism Detection. IEEE Trans. Information Theory 50(7), 1545–1550 (2004)

    Article  MathSciNet  Google Scholar 

  12. Jadalla, A., Elnagar, A.: PDE4Java: Plagiarism Detection Engine for Java Source Code: A Clustering Approach. International Journal of Business Intelligence and Data Mining (IJBIDM) 3(2), 121–135 (2008)

    Article  Google Scholar 

  13. Manber, U.: Finding Similar Files in a Large File System. In: USENIX Technical Conference on USENIX Winter, San Francisco, CA, pp. 1–10 (1994)

    Google Scholar 

  14. Heintze, N.: Scalable Document Fingerprinting. In: USENIX Workshop on Electronic Commerce (1996)

    Google Scholar 

  15. Shivakumar, N., G-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, vol. 24(2), pp. 398–409 (1995)

    Google Scholar 

  16. Broder, A.Z.: On the Resemblance and Containment of Document. In: Compression and Complexity of Sequences (SEQUENCES 1997), pp. 21–29 (1997)

    Google Scholar 

  17. Lyon, C., Malcolm, J., Dickerson, B.: Detecting Short Passages of Similar Text in Large Document Collections. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 118–125 (2001)

    Google Scholar 

  18. Hoad, T.C., Zobel, J.: Methods for Identifying Versioned and Plagiarised Documents. Journal of the American Society for Information Science and Technology 54(3), 203–215 (2003)

    Article  Google Scholar 

  19. Monostori, K., Finkel, R., Zaslavsky, A., Hodász, G., Pataki, M.: Comparison of Overlap Detection Techniques. In: Sloot, P.M.A., Tan, C.J.K., Dongarra, J., Hoekstra, A.G. (eds.) ICCS 2002, Part I. LNCS, vol. 2329, pp. 51–60. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  20. Bernstein, Y., Zobel, J.: A Scalable System for Identifying Co-derivative Documents. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 55–67. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  21. Forman, G., Eshghi, K., Chiocchetti, S.: Finding Similar Files in Large Document Repositories. In: KDD 2005: Proceeding of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, New York, NY, USA, pp. 394–400 (2005)

    Google Scholar 

  22. Muthitacharoen, A., Chen, B., Mazieres, D.: A Low-Bandwidth Network File System. In: Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP 2001), Banff, Canada, pp. 174–187 (2001)

    Google Scholar 

  23. Ponzetto, S.P., Strube, M.: Semantic Role Labeling for Coreference Resolution. In: Companion Volume to the Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, April 3-7, pp. 143–146 (2006)

    Google Scholar 

  24. Ponzetto, S.P., Strube, M.: Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, New York, N.Y., June 4-9, pp. 192–199 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

HaCohen-Kerner, Y., Tayeb, A. (2012). Experiments with Filtered Detection of Similar Academic Papers. In: Ramsay, A., Agre, G. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2012. Lecture Notes in Computer Science(), vol 7557. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33185-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33185-5_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33184-8

  • Online ISBN: 978-3-642-33185-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics