Efficient privacy-preserving similar document detection

Murugesan, Mummoorthy; Jiang, Wei; Clifton, Chris; Si, Luo; Vaidya, Jaideep

doi:10.1007/s00778-009-0175-9

Efficient privacy-preserving similar document detection

Regular Paper
Published: 16 January 2010

Volume 19, pages 457–475, (2010)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Mummoorthy Murugesan¹,
Wei Jiang²,
Chris Clifton¹,
Luo Si¹ &
…
Jaideep Vaidya³

388 Accesses
61 Citations
3 Altmetric
Explore all metrics

Abstract

Similar document detection plays important roles in many applications, such as file management, copyright protection, plagiarism prevention, and duplicate submission detection. The state of the art protocols assume that the contents of files stored on a server (or multiple servers) are directly accessible. However, this makes such protocols unsuitable for any environment where the documents themselves are sensitive and cannot be openly read. Essentially, this assumption limits more practical applications, e.g., detecting plagiarized documents between two conferences, where submissions are confidential. We propose novel protocols to detect similar documents between two entities where documents cannot be openly shared with each other. The similarity measure used can be a simple cosine similarity on entire documents or on document fragments, enabling detection of partial copying. We conduct extensive experiments to show the practical value of the proposed protocols. While the proposed base protocols are much more efficient than the general secure multiparty computation based solutions, they are still slow for large document sets. We then investigate a clustering based approach that significantly reduces the running time and achieves over 90% of accuracy in our experiments. This makes secure similar document detection both practical and feasible.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Secure Similar Document Detection with Simhash

Similarity Preserving Hashing: Eligible Properties and a New Algorithm MRSH-v2

A Fast NDFA-Based Approach to Approximate Pattern-Matching for Plagiarism Detection in Blockchain-Driven NFTs

References

Ahituv N., Lapid Y., Neumann S.: Processing encrypted data. Comm. ACM 20(9), 777–780 (1987). doi:10.1145/30401.30404
Article Google Scholar
Baeza-Yates R., Ribeiro-Neto B.: Modern Information Retrieval. Addison-Wesley, New York (1999)
Google Scholar
Bernstein, Y., Shokouhi, M., Zobel, J.: Compact features for detection of near-duplicates in distributed retrieval. In: SPIRE, Glasgow, UK, pp. 110–121, Oct 11–13, 2006
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the 1995 ACM SIGMOD Conference on Management of Data, pp. 398–409. ACM, San Jose (1995)
Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences, pp. 21–29 (1997)
Chor B., Kushilevitz E., Goldreich O., Sudan M.: Private information retrieval. JACM 45(6), 965–981 (1998). doi:10.1145/293347.293350
Article MATH MathSciNet Google Scholar
Collberg, C., Kobourov, S., Louie, J., Slattery, T.: SPlaT: a system for self-plagiarism detection. In: Proceedings of IADIS International Conference WWW/INTERNET 2003, Algarve, Portugal, pp. 508–514, Nov 5–8, 2003
Duda R., Hart P.E.: Pattern Classification and Scene Analysis. Wiley, New York (1973)
MATH Google Scholar
Federica M., Riccardo M., Paolo T.: A document comparison scheme for secure duplicate detection. Int. J. Digit. Libr. 4(3), 223–244 (2004)
Article Google Scholar
Fukunaga K.: Introduction to Statistical Pattern Recognition. Academic Press, San Diego (1990)
MATH Google Scholar
Glod, M.: Mclean students sue anti-cheating service: plaintiffs say company’s database of term papers, essays violates copyright laws. The Washington Post, p. B05. http://www.washingtonpost.com/wp-dyn/content/article/2007/03/28/AR2007032802038.html (2007). Accessed 29 Mar 2007
Goethals, B., Laur, S., Lipmaa, H., Mielikainen, T.: On secure scalar product computation for privacy-preserving data mining. In: Park, C., Chee, S. (eds.) The 7th Annual International Conference in Information Security and Cryptology (ICISC 2004), Seoul, Korea, pp. 104–120, Dec 2–3, 2004
Goldreich, O.: The Foundations of Cryptography. In: General Cryptographic Protocols, vol. 2. Cambridge University Press, Cambridge. http://www.wisdom.weizmann.ac.il/oded/PSBookFrag/prot.ps (2004)
Goldreich, O., Micali, S., Wigderson, A.: How to play any mental game—a completeness theorem for protocols with honest majority. In: Proceedings of the 19th ACM Symposium on Theory of Computing, New York, NY, USA, pp. 218–229 (1987)
Goldwasser S., Micali S.: Probabilistic encryption. J. Comput. Syst. Sci. 28(2), 270–299 (1984)
Article MATH MathSciNet Google Scholar
Goldwasser, S., Micali, S., Rackoff, C.: The knowledge complexity of interactive proof systems. In: Proceedings of the 17th Annual ACM Symposium on Theory of Computing (STOC’85), Providence, Rhode Island, USA, pp. 291–304, May 6–8, 1985
Hacigumus, H., Iyer, B.R., Li, C., Mehrotra, S.: Executing SQL over encrypted data in the database-service-provider model. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, pp. 216–227, June 4–6, 2002. doi:10.1145/564691.564717
Hersh, W., Cohen, A.M., Roberts, P., Rekapalli, H.K.: TREC 2006 genomics track overview. In: TREC Notebook, NIST (2006)
Lemur toolkit for language modeling and information retrieval: http://www.lemurproject.org/
Manber, U.: Finding similar files in a large file system. Department of Computer Science, The University of Arizona, Tucson, Arizona, Tech. Rep. TR 93-33. ftp://ftp.cs.arizona.edu/reports/1993/TR93-33.pdf (1993)
Paillier, P.: Public key cryptosystems based on composite degree residuosity classes. In: Advances in Cryptology—Eurocrypt ’99 Proceedings. Lecture Notes in Computer Science, vol. 1592, pp. 223–238, Prague, Czech Republic, May 2–6, 1999
Pohlig S.C., Hellman M.E.: An improved algorithm for computing logarithms over GF(p) and its cryptographic significance. IEEE Trans. Inform. Theory IT 24, 106–110 (1978)
Article MATH MathSciNet Google Scholar
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local algorithms for document fingerprinting. In: Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 76–85, June 9–12, 2003. ACM, San Diego (2003)
Shivakumar, N., Garcia-Molina, H.: SCAM: a copy detection mechanism for digital documents. In: Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries (DL’95), Austin, Texas, USA, June 11–13, 1995
Shivakumar, N., Garcia-Molina, H.: Building a scalable and accurate copy detection mechanism. In: Proceedings of the First ACM International Conference on Digital libraries, Bethesda, MD, USA, pp. 160–168, Mar 20–23, 1996
Si, A., Leong, H.V., Lau, R.: CHECK: a document plagiarism detection system. In: Proceedings of ACM Symposium for Applied Computing, pp. 70–77. ACM, San Jose (1997)
Sorokina, D., Gehrke, J., Warner, S., Ginsparg, P.: Plagiarism detection in arXiv. In: Sixth IEEE International Conference on Data Mining (ICDM06), Hong Kong, China, pp. 1070–1075, Dec 18–12, 2006
Vaidya, J., Clifton, C.: Privacy preserving association rule mining in vertically partitioned data. In: The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, pp. 639–644, July 23–26, 2002. doi:10.1145/775047.775142
Yang, H., Callan, J.: Near-duplicate detection by instance-level constrained clustering. In: Proceedings of the 29rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, USA, pp. 421–428, Aug 6–11, 2006
Yao, A.C.: How to generate and exchange secrets. In: Proceedings of the 27th IEEE Symposium on Foundations of Computer Science, pp. 162–167. IEEE, New York (1986)

Download references

Author information

Authors and Affiliations

Department of Computer Science, Purdue University, 305 N. University Street, West Lafayette, IN, 47907, USA
Mummoorthy Murugesan, Chris Clifton & Luo Si
Department of Computer Science, Missouri University of Science and Technology, 311 Computer Science Building, 500 W. 15th St., Rolla, MO, 65409, USA
Wei Jiang
MSIS Department, Rutgers University, 1 Washington Park, Newark, NJ, 07102, USA
Jaideep Vaidya

Authors

Mummoorthy Murugesan
View author publications
You can also search for this author in PubMed Google Scholar
Wei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Chris Clifton
View author publications
You can also search for this author in PubMed Google Scholar
Luo Si
View author publications
You can also search for this author in PubMed Google Scholar
Jaideep Vaidya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jaideep Vaidya.

Additional information

Portions of this work were supported by the Air Force Office of Scientific Research under grant MURI award FA9550-08-1-0265 and by the National Science Foundation under grants IIS-0746830 and CNS-0746943.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Murugesan, M., Jiang, W., Clifton, C. et al. Efficient privacy-preserving similar document detection. The VLDB Journal 19, 457–475 (2010). https://doi.org/10.1007/s00778-009-0175-9

Download citation

Received: 25 March 2009
Revised: 14 November 2009
Accepted: 19 November 2009
Published: 16 January 2010
Issue Date: August 2010
DOI: https://doi.org/10.1007/s00778-009-0175-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient privacy-preserving similar document detection

Abstract

Access this article

Similar content being viewed by others

Secure Similar Document Detection with Simhash

Similarity Preserving Hashing: Eligible Properties and a New Algorithm MRSH-v2

A Fast NDFA-Based Approach to Approximate Pattern-Matching for Plagiarism Detection in Blockchain-Driven NFTs

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient privacy-preserving similar document detection

Abstract

Access this article

Similar content being viewed by others

Secure Similar Document Detection with Simhash

Similarity Preserving Hashing: Eligible Properties and a New Algorithm MRSH-v2

A Fast NDFA-Based Approach to Approximate Pattern-Matching for Plagiarism Detection in Blockchain-Driven NFTs

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation