Hash-Based File Content Identification Using Distributed Systems
A major challenge in digital forensics is the handling of very large amounts of data. Since forensic investigators often have to analyze several terabytes of data in a single case, efficient and effective tools for automatic data identification and filtering are required. A common data identification technique is to match the cryptographic hashes of files with hashes stored in blacklists and whitelists in order to identify contraband and harmless content, respectively. However, blacklists and whitelists are never complete and they miss most of the files encountered in investigations. Also, cryptographic hash matching fails when file content is altered even very slightly. This paper analyzes several distributed systems for their ability to support file content identification. A framework is presented for automated file content identification that searches for file hashes and collects, aggregates and presents the search results. Experiments demonstrate that the framework can provide identifying information for 26% of the test files from their hashed content, helping reduce the workload of forensic investigators.
KeywordsFile content identification hash values P2P networks search engines
- 2.BitTorrent, BitTorrent and μTorrent software surpass 150 million user milestone; announce new consumer electronics partnerships, Press Release, San Francisco, California (www.bittorrent.com/intl/es/company/about/ces_2012_150m_users), January 9, 2012.
- 3.Cisco Systems, Cisco Visual Networking Index: Forecast and Methodology, White Paper, San Jose, California (www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/white_paper_c11-481360.pdf), 2012.
- 4.B. Cohen, Incentives build robustness in BitTorrent, Proceedings of the First International Workshop on the Economics of Peer-to-Peer Systems, 2003.Google Scholar
- 5.Dev-Host, The ultimate free file hosting/file sharing service, Los Angeles, California (d-h.st).
- 6.eMule-MODs.de, Server List for eDonkey and eMule (www.emule-mods.de/?servermet=show).
- 7.Escape Media Group, Grooveshark, Gainesville, Florida (www.grooveshark.com).
- 8.IMDb.com, Internet Movie Database, Seattle, Washington (www.imdb.com).
- 9.Kuiper Forensics, PeerLab – Scanning and evaluation of P2P applications, Mainz, Germany (www.kuiper.de/index.php/en/peerlab).
- 10.Y. Kulbak and D. Bickson, The eMule Protocol Specification, Technical Report, School of Computer Science and Engineering, Hebrew University of Jerusalem, Jerusalem, Israel, 2005.Google Scholar
- 12.National Institute of Standards and Technology, National Software Reference Library, Gaithersburg, Maryland (www.nsrl.nist.gov).
- 13.Net Applications, Desktop Search Engine Market Share (www.netmarketshare.com/search-engine-market-share.aspx?qprd=4&qpcustomd=0), October 2012.
- 14.SANS Internet Storm Center, Hash Database, SANS Institute, Bethesda, Maryland (isc.sans.edu/tools/hashsearch.html).
- 15.H. Schulze and K. Mochalski, Internet Study 2008/2009, ipoque, Leipzig, Germany (www.ipoque.com/sites/default/files/mediafiles/documents/internet-study-2008-2009.pdf), 2009.
- 16.M. Steinebach, H. Liu and Y. Yannikos, Forbild: Efficient robust image hashing, Proceedings of the SPIE Conference on Media Watermarking, Security and Forensics, vol. 8303, 2012.Google Scholar
- 18.Team Cymru, Malware Hash Registry (MHR), Lake Mary, Florida (www.team-cymru.org/Services/MHR).
- 19.VirusTotal Team, VirusTotal, Malaga, Spain (www.virustotal.com).
- 20.Yahoo! Flickr, Sunnyvale, California (www.flickr.com).