Copy detection in Chinese documents using Ferret
- 107 Downloads
The Ferret copy detector has been used since 2001 to find plagiarism in large collections of students’ coursework in English. This article reports on extending its application to Chinese, with experiments on corpora of coursework collected from two Chinese universities. Our experiments show that Ferret can find both artificially constructed plagiarism and actually occurring, previously undetected plagiarism. We discuss issues of representation, focus on the effectiveness of a sub-symbolic approach, and show that Ferret does not need to find word boundaries first.
KeywordsChinese processing Copy detection Ferret Plagiarism Word definition
Dr. JunPeng Bao’s work at the University of Hertfordshire, UK, is sponsored by the Royal Society as a Visiting International Fellow. The authors would like to thank James Malcolm and Wei Ji for their help in preparing this paper.
- Bao, J. P., Lyon, C., Lane, P. C. R., Ji, W., & Malcolm, J. A. (2006b). Copy detection in Chinese documents using the Ferret: A report on experiments. Technical report 456: School of Computer Science, University of Hertfordshire.Google Scholar
- Bao, J. P., Shen, J. Y., Liu, X. D., Liu, H. Y., & Zhang, X. D. (2004). Finding plagiarism based on common semantic sequence model. In Proceedings of the 5th International Conference on Advances in Web-Age Information Management, pp. 640–645.Google Scholar
- Broder, A. Z. (1998). On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences, pp. 21–29.Google Scholar
- Lane, P. C. R., Lyon, C., & Malcolm, J. A. (2006). Demonstration of the Ferret plagiarism dectector’. In Proceedings of the 2nd International Plagiarism Conference.Google Scholar
- Lyon, C., Barrett, R., & Malcolm, J. A. (2003). Experiments in plagiarism detection. Technical report 388. School of Computer Science, University of Hertfordshire.Google Scholar
- Lyon, C., Malcolm, J. A., & Dickerson, R. G. (2001). Detecting short passages of similar text in large document collections. In Proceedings of Conference on Empirical Methods in Natural Language Processing.Google Scholar
- Lyon, C., Barrett, R., & Malcolm, J. A. (2006). Plagiarism is easy, but also easy to detect. Plagiary, 1, 1–10.Google Scholar
- Malpohl, G. (2006). JPlag: Detecting Software Plagiarism. http://wwwipd.ira.uka.de:2222/.Google Scholar
- Manning, C. D., & Schütze, H. (2001). Foundations of statistical natural language processing. Cambridge, MA: The MIT Press.Google Scholar
- Turnitin. (2006). Plagiairism prevention. http://www.turnitin.com.Google Scholar