Identification of Original Document by Using Textual Similarities
When there are two documents that share similar content, either accidentally or intentionally, the knowledge about which one of the two is the original source of the content is unknown in most cases. This knowledge can be crucial in order to charge or acquit someone of plagiarism, to establish the provenance of a document or in the case of sensitive information, to make sure that you can rely on the source of the information. Our system identifies the original document by using the idea that the pieces of text written by the same author have higher resemblance to each other than to those written by different authors. Given two pairs of documents with shared content, our system compares the shared part with the remaining text in both of the documents by treating them as bag of words. For cases when there is no reference text by one of the authors to compare against, our system makes predictions based on similarity of the shared content to just one of the documents.
Keywordsoriginal document bag-of-words document provenance plagiarism
Unable to display preview. Download preview PDF.
- 2.Muniswamy-Reddy, K.K., Macko, P., Seltzer, M.: Provenance for the cloud. In: Proceedings of the 8th USENIX Conference on File and Storage Technologies, FAST 2010, pp. 15–14. USENIX Association, Berkeley (2010)Google Scholar
- 3.Muniswamy-Reddy, K.K., Holland, D.A., Braun, U., Seltzer, M.: Provenance-aware storage systems. In: Proceedings of the Annual Conference on USENIX 2006 Annual Technical Conference, ATEC 2006, p. 4. USENIX Association, Berkeley (2006)Google Scholar
- 6.Stamatatos, E.: Intrinsic plagiarism detection using character n-gram profiles. In: 3rd PAN Workshop Uncovering Plagiarism, Authorship and Social Software Misuse, vol. 2, p. 38 (2009)Google Scholar
- 9.Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 62. ACM (2004)Google Scholar
- 10.Guthrie, D., Guthrie, L., Allison, B., Wilks, Y.: Unsupervised anomaly detection. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI 2007, pp. 1624–1628. Morgan Kaufmann Publishers Inc., San Francisco (2007)Google Scholar
- 13.Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Coling 2010: Posters, pp. 997–1005. Coling 2010 Organizing Committee, Beijing (2010)Google Scholar