Language Resources and Evaluation

, Volume 40, Issue 3–4, pp 357–365 | Cite as

Copy detection in Chinese documents using Ferret

  • Jun Peng Bao
  • Caroline LyonEmail author
  • Peter C. R. Lane


The Ferret copy detector has been used since 2001 to find plagiarism in large collections of students’ coursework in English. This article reports on extending its application to Chinese, with experiments on corpora of coursework collected from two Chinese universities. Our experiments show that Ferret can find both artificially constructed plagiarism and actually occurring, previously undetected plagiarism. We discuss issues of representation, focus on the effectiveness of a sub-symbolic approach, and show that Ferret does not need to find word boundaries first.


Chinese processing Copy detection Ferret Plagiarism Word definition 



Dr. JunPeng Bao’s work at the University of Hertfordshire, UK, is sponsored by the Royal Society as a Visiting International Fellow. The authors would like to thank James Malcolm and Wei Ji for their help in preparing this paper.


  1. Bao, J. P., Shen, J. Y., Liu, X. D., & Liu, H. Y. (2006a). A fast document copy detection model. Soft Computing, 10, 41–46.CrossRefGoogle Scholar
  2. Bao, J. P., Lyon, C., Lane, P. C. R., Ji, W., & Malcolm, J. A. (2006b). Copy detection in Chinese documents using the Ferret: A report on experiments. Technical report 456: School of Computer Science, University of Hertfordshire.Google Scholar
  3. Bao, J. P., Shen, J. Y., Liu, X. D., Liu, H. Y., & Zhang, X. D. (2004). Finding plagiarism based on common semantic sequence model. In Proceedings of the 5th International Conference on Advances in Web-Age Information Management, pp. 640–645.Google Scholar
  4. Broder, A. Z. (1998). On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences, pp. 21–29.Google Scholar
  5. Gao, J., Li, M., Wu, A., & Hang, C. N. (2006). Chinese word segmentation and named entity recognition: A pragmatic approach. Computational Linguistics, 31, 531–573.CrossRefGoogle Scholar
  6. Giles, J. (2006). Preprint analysis quantifies scientific plagiarism. Nature, 444, 524–525.CrossRefGoogle Scholar
  7. Lane, P. C. R., Lyon, C., & Malcolm, J. A. (2006). Demonstration of the Ferret plagiarism dectector’. In Proceedings of the 2nd International Plagiarism Conference.Google Scholar
  8. Lyon, C., Barrett, R., & Malcolm, J. A. (2003). Experiments in plagiarism detection. Technical report 388. School of Computer Science, University of Hertfordshire.Google Scholar
  9. Lyon, C., Malcolm, J. A., & Dickerson, R. G. (2001). Detecting short passages of similar text in large document collections. In Proceedings of Conference on Empirical Methods in Natural Language Processing.Google Scholar
  10. Lyon, C., Barrett, R., & Malcolm, J. A. (2006). Plagiarism is easy, but also easy to detect. Plagiary, 1, 1–10.Google Scholar
  11. Malpohl, G. (2006). JPlag: Detecting Software Plagiarism. Scholar
  12. Manning, C. D., & Schütze, H. (2001). Foundations of statistical natural language processing. Cambridge, MA: The MIT Press.Google Scholar
  13. Turnitin. (2006). Plagiairism prevention. Scholar

Copyright information

© Springer Science+Business Media B.V. 2007

Authors and Affiliations

  • Jun Peng Bao
    • 1
  • Caroline Lyon
    • 2
    Email author
  • Peter C. R. Lane
    • 2
  1. 1.Department of Computer Science & TechnologyXi’an Jiaotong UniversityXi’anChina
  2. 2.School of Computer ScienceUniversity of HertfordshireHatfieldUK

Personalised recommendations