Language Resources and Evaluation

, Volume 45, Issue 1, pp 63–82 | Cite as

Intrinsic plagiarism analysis

Article

Abstract

Research in automatic text plagiarism detection focuses on algorithms that compare suspicious documents against a collection of reference documents. Recent approaches perform well in identifying copied or modified foreign sections, but they assume a closed world where a reference collection is given. This article investigates the question whether plagiarism can be detected by a computer program if no reference can be provided, e.g., if the foreign sections stem from a book that is not available in digital form. We call this problem class intrinsic plagiarism analysis; it is closely related to the problem of authorship verification. Our contributions are threefold. (1) We organize the algorithmic building blocks for intrinsic plagiarism analysis and authorship verification and survey the state of the art. (2) We show how the meta learning approach of Koppel and Schler, termed “unmasking”, can be employed to post-process unreliable stylometric analysis results. (3) We operationalize and evaluate an analysis chain that combines document chunking, style model computation, one-class classification, and meta learning.

Keywords

Plagiarism detection Authorship verification Stylometry One-class classification 

References

  1. Argamon, S., Šarić, M., & Stein, S. S. (2003). Style mining of electronic messages for multiple authorship discrimination: First results. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 475–480). New York, NY, USA: ACM. ISBN 1-58113-737-0. doi: 10.1145/956750.956805.
  2. Bernstein, Y., & Zobel, J. (2004). A scalable system for identifying co-derivative documents. In A. Apostolico & M. Melucci (Eds.), Proceedings of the string processing and information retrieval symposium (SPIRE) (pp. 55–67). Padova, Italy: Springer. Published as LNCS 3246.Google Scholar
  3. Brin, S., Davis, J., & Garcia-Molina, H. (1995). Copy detection mechanisms for digital documents. In SIGMOD ’95 (pp. 398–409). New York, NY, USA: ACM Press. ISBN 0-89791-731-6.Google Scholar
  4. Broder, A. Z., Eiron, N., Fontoura, M., Herscovici, M., Lempel, R., McPherson, J., et al. (2006). Indexing shared content in information retrieval systems. In EDBT ’06 (pp. 313–330).Google Scholar
  5. Chall, J. S., & Dale, E. (1995). Readability revisited: The new Dale–Chall readability formula. Cambridge, MA: Brookline Books.Google Scholar
  6. Chaski, C. E. (2005). Who’s at the keyboard? authorship attribution in digital evidence investigations. IJDE, 4(1), 1–14.Google Scholar
  7. Chawla, N. V., Bowyer, K. W., Kegelmeyer, P. W. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357.Google Scholar
  8. Choi, F. Y. Y. (2000). Advances in domain independent linear text segmentation. In Proceedings of the first conference on North American chapter of the association for computational linguistics (pp. 26–33). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.Google Scholar
  9. Dale, E., & Chall, J. S. (1948). A formula for predicting readability. Educational Research Bulletin, 27, 11–20.Google Scholar
  10. Finkel, R. A., Zaslavsky, A., Monostori, K., & Schmidt, H. (2002). Signature extraction for overlap detection in documents. In Proceedings of the 25th Australian conference on Computer science (pp. 59–64). Australian Computer Society, Inc. ISBN 0-909925-82-8.Google Scholar
  11. Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32, 221–233.CrossRefGoogle Scholar
  12. Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing. In Proceedings of the 25th VLDB conference Edinburgh, Scotland (pp. 518–529).Google Scholar
  13. Graham, N., Hirst, G., & Marthi, B. (2005). Segmenting a document by stylistic character. Natural Language Engineering, 11(4), 397–415. Supersedes August 2003 workshop version.Google Scholar
  14. Gunning, R. (1952). The technique of clear writing. New York: McGraw-Hill.Google Scholar
  15. Henzinger, M. (2006). Finding near-duplicate web pages: A large-scale evaluation of algorithms. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 284–291). New York, NY, USA: ACM Press. ISBN 1-59593-369-7. doi: 10.1145/1148170.1148222.
  16. Hilton, M. L., & Holmes, D. I. (1993). An assessment of cumulative sum charts for authorship attribution. Literary and Linguistic Computing, 8(2), 73–80.CrossRefGoogle Scholar
  17. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504–507.CrossRefGoogle Scholar
  18. Hoad, T. C., & Zobel, J. (2003). Methods for identifying versioned and plagiarised documents. American Society for Information Science and Technology, 54(3), 203–215.CrossRefGoogle Scholar
  19. Holmes, D. I. (1998). The evolution of stylometry in humanities scholarship. Literary and Linguistic, 13(3), 111–117. doi: 10.1093/llc/13.3.111.
  20. Honore, A. (1979). Some simple measures of richness of vocabulary. Association for Literary and Linguistic Computing Bulletin, 7(2), 172–177.Google Scholar
  21. Indyk, P., & Motwani, R. (1998). Approximate nearest neighbor—Towards removing the curse of dimensionality. In Proceedings of the 30th symposium on theory of computing (pp. 604–613).Google Scholar
  22. Juola, P. (2006). Authorship attribution. Foundation Trends Information Retrieval 1(3), 233–334, ISSN 1554-0669. doi: 10.1561/1500000005.
  23. Kacmarcik, G., & Gamon, M. (2006). Obfuscating document stylometry to preserve author anonymity. In Proceedings of the COLING/ACL on main conference poster sessions (pp. 444–451). Morristown, NJ, USA: Association for Computational Linguistics.Google Scholar
  24. Kincaid, J., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Research branch report 8–75. Millington TN: Naval Technical Training US Naval Air Station.Google Scholar
  25. Kjell, B., Woods Addison, W., & Frieder, O. (1994). Discrimination of authorship using visualization. Information Processing and Management, 30(1), 141–150. ISSN 0306-4573. doi: 10.1016/0306-4573(94)90029-9.
  26. Kleinberg, J. (1997). Two algorithms for nearest-neighbor search in high dimensions. In STOC ’97: Proceedings of the twenty-ninth annual ACM symposium on theory of computing.Google Scholar
  27. Koppel, M., & Schler, J. (2003). Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of IJCAI’03 workshop on computational approaches to style analysis and synthesis. Mexico: Acapulco.Google Scholar
  28. Koppel, M., & Schler, J. (2004a). Authorship verification as a one-class classification problem. In ICML ’04: Proceedings of the twenty-first international conference on Machine learning (pp. 62). New York, NY, USA: ACM. ISBN 1-58113-828-5. doi: 10.1145/1015330.1015448.
  29. Koppel, M., & Schler, J. (2004b). Authorship verification as a one-class classification problem. In Proceedings of the 21st international conference on machine learning. Banff, Canada: ACM Press.Google Scholar
  30. Koppel, M., Schler, J., Argamon, S., & Messeri, E. (2006). Authorship attribution with thousands of candidate authors. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 659–660). New York, NY, USA: ACM. ISBN 1-59593-369-7. doi: 10.1145/1148170.1148304.
  31. Koppel, M., Schler, J., & Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research, 8, 1261–1276. ISSN 1533-7928.Google Scholar
  32. Koppel, M., Schler, J., & Argamon, S. (2009). Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1), 9–26.CrossRefGoogle Scholar
  33. Malyutov, M. B. (2006). Authorship attribution of texts: A review. Lecture Notes in Computer Science, 2063, 362–380.Google Scholar
  34. Manevitz, L. M., & Yousef, M. (2001). One-class SVMs for document classification. Journal of Machine Learning Research, 2, 139–154.CrossRefGoogle Scholar
  35. Mansfield, J. S. (2004). Textbook plagiarism in psy101 general psychology: incidence and prevention. In Proceedings of the 18th annual conference on undergraduate teaching of psychology: Ideas and innovations. New York, USA: SUNY Farmingdale.Google Scholar
  36. Meyer zu Eissen, S., & Stein, B. (2004). Genre classification of web pages: User study and feasibility analysis. In S. Biundo, T. Frühwirth, & G. Palm (Eds.), KI 2004: Advances in artificial intelligence, vol. 3228 LNAI of Lecture Notes in artificial intelligence (pp. 256–269). Berlin Heidelberg New York: Springer. ISBN 0302-9743.Google Scholar
  37. Meyer zu Eissen, S., & Stein, B. (2006). Intrinsic plagiarism detection. In M. Lalmas, A. MacFarlane, S. M. Rüger, A. Tombros, T. Tsikrika, & A. Yavlinsky (Eds.), Proceedings of the European conference on information retrieval (ECIR 2006), vol. 3936 of Lecture Notes in Computer Science (pp. 565–569). New York: Springer. ISBN 3-540-33347-9.Google Scholar
  38. Meyer zu Eissen, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H. J. Lenz (Eds.), Advances in data analysis (pp. 359–366). New York: Springer. ISBN 978-3-540-70980-0.Google Scholar
  39. Morton, A. Q., & Michaelson, S. (1990). The qsum plot. Technical report, University of Edinburgh.Google Scholar
  40. Mosteller, F., & Wallace, D. L. (1964). Inference and disputed authorship: Federalist papers. Reading, MA: Addison-Wesley Educational Publishers Inc, 1964. ISBN 0201048655.Google Scholar
  41. Pavelec, D., Oliveira, L. S., Justino, E. J. R., & Batista, L. V. (2008). Using conjunctions and adverbs for author verification. Journal of UCS, 14(18), 2967–2981.Google Scholar
  42. Potthast, M., Eiselt, A., Stein, B., Barròn Cedeño, A., & Rosso, P. (Eds.). (2009). Webis at Bauhaus-Universität Weimar and NLEL at Universidad Polytécnica de Valencia. PAN Plagiarism Corpus 2009 (PAN-PC-09). http://www.webis.de/research/corpora.
  43. Rätsch, G., Mika, S., Schölkopf, B., & Müller, K.-R. (2002). Constructing boosting algorithms from SVMs: An application to one-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1184–1199. ISSN 0162-8828. doi: 10.1109/TPAMI.2002.1033211.Google Scholar
  44. Reynar, J. C. (1998). Topic segmentation: Algorithms and applications. Ph.D. thesis, University of Pennsylvania.Google Scholar
  45. Rudman, J. (1997). The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities, 31, 351–365.CrossRefGoogle Scholar
  46. Russel, S. J., & Norvig, P. (1995). Artificial intelligence: A modern approach. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
  47. Sanderson, C., & Guenter, S. (2006a). On authorship attribution via markov chains and sequence kernels. In Pattern recognition, 2006. ICPR 2006. 18th international conference on (vol. 3, pp. 437–440). doi: 10.1109/ICPR.2006.899.
  48. Sanderson, C., & Guenter, S. (2006b). Short text authorship attribution via sequence kernels, markov chains and author unmasking: An investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (pp. 482–491). URL http://acl.ldc.upenn.edu/W/W06/W06-1657.pdf.
  49. Stamatatos, E. (2007). Author identification using imbalanced and limited training texts. In A. M. Tjoa & R. R. Wagner (Eds.), 18th international conference on database and expert systems applications (DEXA 07) (pp. 237–241). IEEE, September 2007. ISBN 0-7695-2932-1. doi:  10.1109/DEXA.2007.37.
  50. Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of American Society for Information Science & Technology, 60(3), 538–556. ISSN 1532-2882. doi: 10.1002/asi.v60:3.Google Scholar
  51. Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2001). Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35, 193–214.CrossRefGoogle Scholar
  52. Stefik, M. (1995). Introduction to knowledge systems. San Mateo, CA, USA: Morgan Kaufmann.Google Scholar
  53. Stein, B. (2005). Fuzzy-fingerprints for text-based information retrieval. In K. Tochtermann & H. Maurer (Eds.), Proceedings of the 5th international conference on knowledge management (I-KNOW 05), Graz, Journal of Universal Computer Science (pp. 572–579). Know-Center.Google Scholar
  54. Stein, B. (2007). Principles of hash-based text retrieval. In C. Clarke, N. Fuhr, N. Kando, W. Kraaij, & A. de Vries (Eds.), 30th annual international ACM SIGIR conference (pp. 527–534). ACM, July 2007. ISBN 987-1-59593-597-7.Google Scholar
  55. Stein, B., & Meyer zu Eissen, S. (2007). Intrinsic plagiarism analysis with meta learning. In B. Stein, M. Koppel, & E. Stamatatos (Eds.), SIGIR workshop workshop on plagiarism analysis, authorship identification, and near-duplicate detection (PAN 07) (pp. 45–50). CEUR-WS.org, July 2007. URL http://ceur-ws.org/Vol-276.
  56. Stein, B., & Meyer zu Eissen, S. (2007). Topic-identifikation: Formalisierung, analyse und neue Verfahren. KI—Künstliche Intelligenz, 3, 16–22. ISSN 0933-1875. URL http://www.kuenstliche-intelligenz.de/index.php?id=7758.
  57. Stein, B., Lipka, N., & Meyer zu Eissen, S. (2008). Meta analysis within authorship verification. In A. M. Tjoa & R. R. Wagner (Eds.), 19th international conference on database and expert systems applications (DEXA 08) (pp. 34–39). IEEE, September 2008. ISBN 978-0-7695-3299-8. doi: 10.1109/DEXA.2008.20.
  58. Surdulescu R. (2004). Verifying authorship. Final project report CS391L, University of Texas at AustinGoogle Scholar
  59. Tax, D. M. J. (2001). One-class classification. Ph.D. thesis, Technische Universiteit Delft.Google Scholar
  60. Tax D. M. J., & Duin, R. P. W. (2001). Combining one-class classifiers. In Proceedings of the second international workshop on multiple classifier systems (pp. 299–308). New York: Springer. ISBN 3-540-42284-6.Google Scholar
  61. Tweedie, F. J., & Baayen, H. R. (1998). How variable may a constant be? measures of lexical richness in perspective. Computers and the Humanities 32(5):323–352. doi: 10.1023/A:1001749303137.CrossRefGoogle Scholar
  62. van Halteren, H. (2004). Linguistic profiling for author recognition and verification. In ACL ’04: Proceedings of the 42nd annual meeting on association for computational linguistics (pp. 199). Morristown, NJ, USA: Association for Computational Linguistics. doi: 10.3115/1218955.1218981.
  63. van Halteren, H. (2007). Author verification by linguistic profiling: An exploration of the parameter space. ACM Transactions on Speech and Language Processing, 4(1), 1. ISSN 1550-4875. doi: 10.1145/1187415.1187416.
  64. Yang, H., & Callan, J. P. (2006). Near-duplicate detection by instance-level constrained clustering. In E. N. Efthimiadis, S. Dumais, D. Hawking, & K. Järvelin (Eds.), SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 421–428). ISBN 1-59593-369-7.Google Scholar
  65. Yule, G. (1944). The statistical study of literary vocabulary. Cambridge: Cambridge University PressGoogle Scholar
  66. Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393. doi: 10.1002/asi.20316.Google Scholar
  67. Zipf, G. K. (1932). Selective studies and the principle of relative frequency in language.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Faculty of Media, Media SystemsBauhaus-Universität WeimarWeimarGermany

Personalised recommendations