Managing Terabyte-Scale Investigations with Similarity Digests
The relentless increase in storage capacity and decrease in storage cost present an escalating challenge for digital forensic investigations – current forensic technologies are not designed to scale to the degree necessary to process the ever increasing volumes of digital evidence. This paper describes a similarity-digest-based approach that scales up the task of finding related digital artifacts in massive data sets. The results show that digests can be generated at rates exceeding those of cryptographic hashes on commodity multi-core computing systems. Also, the querying of the digest of a large (1 TB) target for the (trace) presence of a small file can be completed in less than one second with very high precision and recall rates.
KeywordsSimilarity digests data fingerprinting hashing
- 2.S. Brin, J. Davis and H. Garcia-Molina, Copy detection mechanisms for digital documents, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 398–409, 1995.Google Scholar
- 4.C. Cho, S. Lee, C. Tan and Y. Tan, Network forensics on packet fingerprints, Proceedings of the Twenty-First IFIP Information Security Conference, pp. 401–412, 2006.Google Scholar
- 5.Cooperative Association for Internet Data Analysis, Packet size distribution comparison between Internet links in 1998 and 2008, San Diego Supercomputer Center, University of California at San Diego, San Diego, California (www.caida.org/research/traffic-analysis/pkt_size_distribution/graphs.xml), 2010.
- 6.Digital Corpora, NPS Corpus (digitalcorpora.org/corpora/disk-images).
- 9.U. Manber, Finding similar files in a large file system, Proceedings of the USENIX Winter Technical Conference, pp. 1–10, 1994.Google Scholar
- 11.National Institute of Standards and Technology, National Software Reference Library, Gaithersburg, Maryland (www.nsrl.nist.gov).
- 12.M. Ponec, P. Giura, H. Bronnimann and J. Wein, Highly efficient techniques for network forensics, Proceedings of the Fourteenth ACM Conference on Computer and Communications Security, pp. 150–160, 2007.Google Scholar
- 13.H. Pucha, D. Andersen and M. Kaminsky, Exploiting similarity for multi-source downloads using file handprints, Proceedings of the Fourth USENIX Symposium on Networked Systems Design and Implementation, pp. 15–28, 2007.Google Scholar
- 14.M. Rabin, Fingerprinting by Random Polynomials, Technical Report TR1581, Center for Research in Computing Technology, Harvard University, Cambridge, Massachusetts, 1981.Google Scholar
- 18.V. Roussev, Y. Chen, T. Bourg and G. Richard, md5bloom: Forensic filesystem hashing revisited, Digital Investigation, vol. 3(S), pp. S82–S90, 2006.Google Scholar
- 21.S. Schleimer, D. Wilkerson and A. Aiken, Winnowing: Local algorithms for document fingerprinting, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 76–85, 2003.Google Scholar