Data Fingerprinting with Similarity Digests

  • Vassil Roussev
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 337)


State-of-the-art techniques for data fingerprinting have been based on randomized feature selection pioneered by Rabin in 1981. This paper proposes a new, statistical approach for selecting fingerprinting features. The approach relies on entropy estimates and a sizeable empirical study to pick out the features that are most likely to be unique to a data object and, therefore, least likely to trigger false positives. The paper also describes the implementation of a tool (sdhash) and the results of an evaluation study. The results demonstrate that the approach works consistently across different types of data, and its compact footprint allows for the digests of targets in excess of 1 TB to be queried in memory.


Data fingerprinting similarity digests fuzzy hashing 


  1. 1.
    B. Bloom, Space/time trade-offs in hash coding with allowable errors, Communications of the ACM, vol. 13(7), pp. 422–426, 1970.zbMATHCrossRefGoogle Scholar
  2. 2.
    S. Brin, J. Davis and H. Garcia-Molina, Copy detection mechanisms for digital documents, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 398–409, 1995.Google Scholar
  3. 3.
    A. Broder, S. Glassman, M. Manasse and G. Zweig, Syntactic clustering of the web, Computer Networks and ISDN Systems, vol. 29(8-13), pp. 1157–1166, 1997.CrossRefGoogle Scholar
  4. 4.
    A. Broder and M. Mitzenmacher, Network applications of Bloom filters: A survey, Internet Mathematics, vol. 1(4), pp. 485–509, 2005.MathSciNetGoogle Scholar
  5. 5.
    C. Cho, S. Lee, C. Tan and Y. Tan, Network forensics on packet fingerprints, Proceedings of the Twenty-First IFIP Information Security Conference, pp. 401–412, 2006.Google Scholar
  6. 6.
    Digital Corpora, NPS Corpus ( Scholar
  7. 7.
    J. Kornblum, Identifying almost identical files using context triggered piecewise hashing, Digital Investigation, vol. 3(S1), pp. S91–S97, 2006.CrossRefGoogle Scholar
  8. 8.
    U. Manber, Finding similar files in a large file system, Proceedings of the USENIX Winter Technical Conference, pp. 1–10, 1994.Google Scholar
  9. 9.
    M. Mitzenmacher, Compressed Bloom filters, IEEE/ACM Transactions on Networks, vol. 10(5), pp. 604–612, 2002.CrossRefGoogle Scholar
  10. 10.
    National Institute of Standards and Technology, National Software Reference Library, Gaithersburg, Maryland ( Scholar
  11. 11.
    M. Ponec, P. Giura, H. Bronnimann and J. Wein, Highly efficient techniques for network forensics, Proceedings of the Fourteenth ACM Conference on Computer and Communications Security, pp. 150–160, 2007.Google Scholar
  12. 12.
    H. Pucha, D. Andersen and M. Kaminsky, Exploiting similarity for multi-source downloads using file handprints, Proceedings of the Fourth USENIX Symposium on Networked Systems Design and Implementation, pp. 15–28, 2007.Google Scholar
  13. 13.
    M. Rabin, Fingerprinting by Random Polynomials, Technical Report TR1581, Center for Research in Computing Technology, Harvard University, Cambridge, Massachusetts, 1981.Google Scholar
  14. 14.
    S. Rhea, K. Liang and E. Brewer, Value-based web caching, Proceedings of the Twelfth International World Wide Web Conference, pp. 619–628, 2003.Google Scholar
  15. 15.
    V. Roussev, Building a better similarity trap with statistically improbable features, Proceedings of the Forty-Second Hawaii International Conference on System Sciences, pp. 1–10, 2009.Google Scholar
  16. 16.
    V. Roussev, Hashing and data fingerprinting in digital forensics, IEEE Security and Privacy, vol. 7(2), pp. 49–55, 2009.CrossRefGoogle Scholar
  17. 17.
    V. Roussev, sdhash, New Orleans, Louisiana ( Scholar
  18. 18.
    V. Roussev, Y. Chen, T. Bourg and G. Richard, md5bloom: Forensic filesystem hashing revisited, Digital Investigation, vol. 3(S), pp. S82–S90, 2006.CrossRefGoogle Scholar
  19. 19.
    V. Roussev, G. Richard and L. Marziale, Multi-resolution similarity hashing, Digital Investigation, vol. 4(S), pp. S105–S113, 2007.CrossRefGoogle Scholar
  20. 20.
    V. Roussev, G. Richard and L. Marziale, Class-aware similarity hashing for data classification, in Research Advances in Digital Forensics IV, I. Ray and S. Shenoi (Eds.), Springer, Boston, Massachusetts, pp. 101–113, 2008.CrossRefGoogle Scholar
  21. 21.
    S. Schleimer, D. Wilkerson and A. Aiken, Winnowing: Local algorithms for document fingerprinting, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 76–85, 2003.Google Scholar
  22. 22.
    K. Shanmugasundaram, H. Bronnimann and N. Memon, Payload attribution via hierarchical Bloom filters, Proceedings of the Eleventh ACM Conference on Computer and Communications Security, pp. 31–41, 2004.Google Scholar

Copyright information

© International Federation for Information Processing 2010

Authors and Affiliations

  • Vassil Roussev
    • 1
  1. 1.University of New OrleansNew OrleansUSA

Personalised recommendations