Using Approximate Matching to Reduce the Volume of Digital Data
Digital forensic investigators frequently have to search for relevant files in massive digital corpora – a task often compared to finding a needle in a haystack. To address this challenge, investigators typically apply cryptographic hash functions to identify known files. However, cryptographic hashing only allows the detection of files that exactly match the known file hash values or fingerprints. This paper demonstrates the benefits of using approximate matching to locate relevant files. The experiments described in this paper used three test images of Windows XP, Windows 7 and Ubuntu 12.04 systems to evaluate fingerprint-based comparisons. The results reveal that approximate matching can improve file identification – in one case, increasing the identification rate from 1.82% to 23.76%.
KeywordsFile identification approximate matching ssdeep
- 1.H. Baier and C. Dichtelmuller, Datenreduktion mittels kryptographischer Hashfunktionen in der IT-Forensik: Nur ein Mythos? DACH Security, pp. 278–287, September 2012.Google Scholar
- 2.F. Breitinger, K. Astebol, H. Baier and C. Busch, mvhash-b – A new approach for similarity preserving hashing, Proceedings of the Seventh International Conference on IT Security Incident Management and IT Forensics, pp. 33–44, 2013.Google Scholar
- 3.F. Breitinger and H. Baier, Security aspects of piecewise hashing in computer forensics, Proceedings of the Sixth International Conference on IT Security Incident Management and IT Forensics, pp. 21–36, 2011.Google Scholar
- 4.F. Breitinger and H. Baier, A fuzzy hashing approach based on random sequences and Hamming distance, Proceedings of the Conference on Digital Forensics, Security and Law, 2012.Google Scholar
- 5.F. Breitinger and H. Baier, Similarity preserving hashing: Eligible properties and a new algorithm mrsh-v2, Proceedings of the Fourth International ICST Conference on Digital Forensics and Cyber Crime, 2012.Google Scholar
- 6.A. Broder, On the resemblance and containment of documents, Proceedings of the International Conference on the Compression and Complexity of Sequences, pp. 21–29, 1997.Google Scholar
- 8.P. Deutsch and J. Gailly, ZLIB Compressed Data Format Specification Version 3.3, RFC 1950, 1996.Google Scholar
- 10.J. Kornblum, ssdeep (http://ssdeep.sourceforge.net), 2013.
- 11.National Institute of Standards and Technology, National Software Reference Library, Gaithersburg, Maryland (www.nsrl.nist.gov).
- 15.C. Sadowski and G. Levin, SimHash: Hash-Based Similarity Detection, Technical Report UCSC-SOE-11-07, Department of Computer Science, University of California Santa Cruz, Santa Cruz, California (http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf), 2007.
- 16.K. Seo, K. Lim, J. Choi, K. Chang and S. Lee, Detecting similar files based on hash and statistical analysis for digital forensic investigations, Proceedings of the Second International Conference on Computer Science and its Applications, 2009.Google Scholar
- 17.A. Tridgell, spamsum (http://mirror.linux.org.au/linux.conf.au/2004/papers/junkcode/spamsum/README), 2002.
- 18.D. White, Hashing of file blocks: When exact matches are not useful, presented at the Annual Meeting of the American Academy of Forensic Sciences, 2008.Google Scholar