The VLDB Journal

, Volume 19, Issue 4, pp 457–475 | Cite as

Efficient privacy-preserving similar document detection

  • Mummoorthy Murugesan
  • Wei Jiang
  • Chris Clifton
  • Luo Si
  • Jaideep Vaidya
Regular Paper

Abstract

Similar document detection plays important roles in many applications, such as file management, copyright protection, plagiarism prevention, and duplicate submission detection. The state of the art protocols assume that the contents of files stored on a server (or multiple servers) are directly accessible. However, this makes such protocols unsuitable for any environment where the documents themselves are sensitive and cannot be openly read. Essentially, this assumption limits more practical applications, e.g., detecting plagiarized documents between two conferences, where submissions are confidential. We propose novel protocols to detect similar documents between two entities where documents cannot be openly shared with each other. The similarity measure used can be a simple cosine similarity on entire documents or on document fragments, enabling detection of partial copying. We conduct extensive experiments to show the practical value of the proposed protocols. While the proposed base protocols are much more efficient than the general secure multiparty computation based solutions, they are still slow for large document sets. We then investigate a clustering based approach that significantly reduces the running time and achieves over 90% of accuracy in our experiments. This makes secure similar document detection both practical and feasible.

Keywords

Privacy Information retrieval 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ahituv N., Lapid Y., Neumann S.: Processing encrypted data. Comm. ACM 20(9), 777–780 (1987). doi:10.1145/30401.30404 CrossRefGoogle Scholar
  2. 2.
    Baeza-Yates R., Ribeiro-Neto B.: Modern Information Retrieval. Addison-Wesley, New York (1999)Google Scholar
  3. 3.
    Bernstein, Y., Shokouhi, M., Zobel, J.: Compact features for detection of near-duplicates in distributed retrieval. In: SPIRE, Glasgow, UK, pp. 110–121, Oct 11–13, 2006Google Scholar
  4. 4.
    Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the 1995 ACM SIGMOD Conference on Management of Data, pp. 398–409. ACM, San Jose (1995)Google Scholar
  5. 5.
    Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences, pp. 21–29 (1997)Google Scholar
  6. 6.
    Chor B., Kushilevitz E., Goldreich O., Sudan M.: Private information retrieval. JACM 45(6), 965–981 (1998). doi:10.1145/293347.293350 MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Collberg, C., Kobourov, S., Louie, J., Slattery, T.: SPlaT: a system for self-plagiarism detection. In: Proceedings of IADIS International Conference WWW/INTERNET 2003, Algarve, Portugal, pp. 508–514, Nov 5–8, 2003Google Scholar
  8. 8.
    Duda R., Hart P.E.: Pattern Classification and Scene Analysis. Wiley, New York (1973)MATHGoogle Scholar
  9. 9.
    Federica M., Riccardo M., Paolo T.: A document comparison scheme for secure duplicate detection. Int. J. Digit. Libr. 4(3), 223–244 (2004)CrossRefGoogle Scholar
  10. 10.
    Fukunaga K.: Introduction to Statistical Pattern Recognition. Academic Press, San Diego (1990)MATHGoogle Scholar
  11. 11.
    Glod, M.: Mclean students sue anti-cheating service: plaintiffs say company’s database of term papers, essays violates copyright laws. The Washington Post, p. B05. http://www.washingtonpost.com/wp-dyn/content/article/2007/03/28/AR2007032802038.html (2007). Accessed 29 Mar 2007
  12. 12.
    Goethals, B., Laur, S., Lipmaa, H., Mielikainen, T.: On secure scalar product computation for privacy-preserving data mining. In: Park, C., Chee, S. (eds.) The 7th Annual International Conference in Information Security and Cryptology (ICISC 2004), Seoul, Korea, pp. 104–120, Dec 2–3, 2004Google Scholar
  13. 13.
    Goldreich, O.: The Foundations of Cryptography. In: General Cryptographic Protocols, vol. 2. Cambridge University Press, Cambridge. http://www.wisdom.weizmann.ac.il/oded/PSBookFrag/prot.ps (2004)
  14. 14.
    Goldreich, O., Micali, S., Wigderson, A.: How to play any mental game—a completeness theorem for protocols with honest majority. In: Proceedings of the 19th ACM Symposium on Theory of Computing, New York, NY, USA, pp. 218–229 (1987)Google Scholar
  15. 15.
    Goldwasser S., Micali S.: Probabilistic encryption. J. Comput. Syst. Sci. 28(2), 270–299 (1984)MATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Goldwasser, S., Micali, S., Rackoff, C.: The knowledge complexity of interactive proof systems. In: Proceedings of the 17th Annual ACM Symposium on Theory of Computing (STOC’85), Providence, Rhode Island, USA, pp. 291–304, May 6–8, 1985Google Scholar
  17. 17.
    Hacigumus, H., Iyer, B.R., Li, C., Mehrotra, S.: Executing SQL over encrypted data in the database-service-provider model. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, pp. 216–227, June 4–6, 2002. doi:10.1145/564691.564717
  18. 18.
    Hersh, W., Cohen, A.M., Roberts, P., Rekapalli, H.K.: TREC 2006 genomics track overview. In: TREC Notebook, NIST (2006)Google Scholar
  19. 19.
    Lemur toolkit for language modeling and information retrieval: http://www.lemurproject.org/
  20. 20.
    Manber, U.: Finding similar files in a large file system. Department of Computer Science, The University of Arizona, Tucson, Arizona, Tech. Rep. TR 93-33. ftp://ftp.cs.arizona.edu/reports/1993/TR93-33.pdf (1993)
  21. 21.
    Paillier, P.: Public key cryptosystems based on composite degree residuosity classes. In: Advances in Cryptology—Eurocrypt ’99 Proceedings. Lecture Notes in Computer Science, vol. 1592, pp. 223–238, Prague, Czech Republic, May 2–6, 1999Google Scholar
  22. 22.
    Pohlig S.C., Hellman M.E.: An improved algorithm for computing logarithms over GF(p) and its cryptographic significance. IEEE Trans. Inform. Theory IT 24, 106–110 (1978)MATHCrossRefMathSciNetGoogle Scholar
  23. 23.
    Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local algorithms for document fingerprinting. In: Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 76–85, June 9–12, 2003. ACM, San Diego (2003)Google Scholar
  24. 24.
    Shivakumar, N., Garcia-Molina, H.: SCAM: a copy detection mechanism for digital documents. In: Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries (DL’95), Austin, Texas, USA, June 11–13, 1995Google Scholar
  25. 25.
    Shivakumar, N., Garcia-Molina, H.: Building a scalable and accurate copy detection mechanism. In: Proceedings of the First ACM International Conference on Digital libraries, Bethesda, MD, USA, pp. 160–168, Mar 20–23, 1996Google Scholar
  26. 26.
    Si, A., Leong, H.V., Lau, R.: CHECK: a document plagiarism detection system. In: Proceedings of ACM Symposium for Applied Computing, pp. 70–77. ACM, San Jose (1997)Google Scholar
  27. 27.
    Sorokina, D., Gehrke, J., Warner, S., Ginsparg, P.: Plagiarism detection in arXiv. In: Sixth IEEE International Conference on Data Mining (ICDM06), Hong Kong, China, pp. 1070–1075, Dec 18–12, 2006Google Scholar
  28. 28.
    Vaidya, J., Clifton, C.: Privacy preserving association rule mining in vertically partitioned data. In: The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, pp. 639–644, July 23–26, 2002. doi:10.1145/775047.775142
  29. 29.
    Yang, H., Callan, J.: Near-duplicate detection by instance-level constrained clustering. In: Proceedings of the 29rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, USA, pp. 421–428, Aug 6–11, 2006Google Scholar
  30. 30.
    Yao, A.C.: How to generate and exchange secrets. In: Proceedings of the 27th IEEE Symposium on Foundations of Computer Science, pp. 162–167. IEEE, New York (1986)Google Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  • Mummoorthy Murugesan
    • 1
  • Wei Jiang
    • 2
  • Chris Clifton
    • 1
  • Luo Si
    • 1
  • Jaideep Vaidya
    • 3
  1. 1.Department of Computer SciencePurdue UniversityWest LafayetteUSA
  2. 2.Department of Computer ScienceMissouri University of Science and TechnologyRollaUSA
  3. 3.MSIS DepartmentRutgers UniversityNewarkUSA

Personalised recommendations