Managing Terabyte-Scale Investigations with Similarity Digests

  • Vassil Roussev
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 383)


The relentless increase in storage capacity and decrease in storage cost present an escalating challenge for digital forensic investigations – current forensic technologies are not designed to scale to the degree necessary to process the ever increasing volumes of digital evidence. This paper describes a similarity-digest-based approach that scales up the task of finding related digital artifacts in massive data sets. The results show that digests can be generated at rates exceeding those of cryptographic hashes on commodity multi-core computing systems. Also, the querying of the digest of a large (1 TB) target for the (trace) presence of a small file can be completed in less than one second with very high precision and recall rates.


Similarity digests data fingerprinting hashing 


  1. 1.
    B. Bloom, Space/time trade-offs in hash coding with allowable errors, Communications of the ACM, vol. 13(7), pp. 422–426, 1970.zbMATHCrossRefGoogle Scholar
  2. 2.
    S. Brin, J. Davis and H. Garcia-Molina, Copy detection mechanisms for digital documents, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 398–409, 1995.Google Scholar
  3. 3.
    A. Broder, S. Glassman, M. Manasse and G. Zweig, Syntactic clustering of the web, Computer Networks and ISDN Systems, vol. 29(8-13), pp. 1157–1166, 1997.CrossRefGoogle Scholar
  4. 4.
    C. Cho, S. Lee, C. Tan and Y. Tan, Network forensics on packet fingerprints, Proceedings of the Twenty-First IFIP Information Security Conference, pp. 401–412, 2006.Google Scholar
  5. 5.
    Cooperative Association for Internet Data Analysis, Packet size distribution comparison between Internet links in 1998 and 2008, San Diego Supercomputer Center, University of California at San Diego, San Diego, California (, 2010.
  6. 6.
    Digital Corpora, NPS Corpus (
  7. 7.
    S. Garfinkel, P. Farrell, V. Roussev and G. Dinolt, Bringing science to digital forensics with standardized forensic corpora, Digital Investigation, vol. 6(S), pp. S2–S11, 2009.CrossRefGoogle Scholar
  8. 8.
    J. Kornblum, Identifying almost identical files using context triggered piecewise hashing, Digital Investigation, vol. 3(S1), pp. S91–S97, 2006.CrossRefGoogle Scholar
  9. 9.
    U. Manber, Finding similar files in a large file system, Proceedings of the USENIX Winter Technical Conference, pp. 1–10, 1994.Google Scholar
  10. 10.
    M. Mitzenmacher, Compressed Bloom filters, IEEE/ACM Transactions on Networks, vol. 10(5), pp. 604–612, 2002.CrossRefGoogle Scholar
  11. 11.
    National Institute of Standards and Technology, National Software Reference Library, Gaithersburg, Maryland (
  12. 12.
    M. Ponec, P. Giura, H. Bronnimann and J. Wein, Highly efficient techniques for network forensics, Proceedings of the Fourteenth ACM Conference on Computer and Communications Security, pp. 150–160, 2007.Google Scholar
  13. 13.
    H. Pucha, D. Andersen and M. Kaminsky, Exploiting similarity for multi-source downloads using file handprints, Proceedings of the Fourth USENIX Symposium on Networked Systems Design and Implementation, pp. 15–28, 2007.Google Scholar
  14. 14.
    M. Rabin, Fingerprinting by Random Polynomials, Technical Report TR1581, Center for Research in Computing Technology, Harvard University, Cambridge, Massachusetts, 1981.Google Scholar
  15. 15.
    V. Roussev, Hashing and data fingerprinting in digital forensics, IEEE Security and Privacy, vol. 7(2), pp. 49–55, 2009.CrossRefGoogle Scholar
  16. 16.
    V. Roussev, Data fingerprinting with similarity digests, in Advances in Digital Forensics VI, K. Chow and S. Shenoi (Eds.), Springer, Heidelberg, Germany, pp. 207–226, 2010.CrossRefGoogle Scholar
  17. 17.
    V. Roussev, An evaluation of forensic similarity hashes, Digital Investigation, vol. 8(S), pp. S34–S41, 2011.CrossRefGoogle Scholar
  18. 18.
    V. Roussev, Y. Chen, T. Bourg and G. Richard, md5bloom: Forensic filesystem hashing revisited, Digital Investigation, vol. 3(S), pp. S82–S90, 2006.Google Scholar
  19. 19.
    V. Roussev, G. Richard and L. Marziale, Multi-resolution similarity hashing, Digital Investigation, vol. 4(S), pp. S105–S113, 2007. CrossRefGoogle Scholar
  20. 20.
    V. Roussev, G. Richard and L. Marziale, Class-aware similarity hashing for data classification, in Research Advances in Digital Forensics IV, I. Ray and S. Shenoi (Eds.), Springer, Boston, Massachusetts, pp. 101–113, 2008.CrossRefGoogle Scholar
  21. 21.
    S. Schleimer, D. Wilkerson and A. Aiken, Winnowing: Local algorithms for document fingerprinting, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 76–85, 2003.Google Scholar
  22. 22.
    K. Shanmugasundaram, H. Bronnimann and N. Memon, Payload attribution via hierarchical Bloom filters, Proceedings of the Eleventh ACM Conference on Computer and Communications Security, pp. 31–41, 2004. CrossRefGoogle Scholar

Copyright information

© IFIP International Federation for Information Processing 2012

Authors and Affiliations

  • Vassil Roussev
    • 1
  1. 1.University of New OrleansNew OrleansUSA

Personalised recommendations