Hash-Based File Content Identification Using Distributed Systems

  • York Yannikos
  • Jonathan Schluessler
  • Martin Steinebach
  • Christian Winter
  • Kalman Graffi
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 410)


A major challenge in digital forensics is the handling of very large amounts of data. Since forensic investigators often have to analyze several terabytes of data in a single case, efficient and effective tools for automatic data identification and filtering are required. A common data identification technique is to match the cryptographic hashes of files with hashes stored in blacklists and whitelists in order to identify contraband and harmless content, respectively. However, blacklists and whitelists are never complete and they miss most of the files encountered in investigations. Also, cryptographic hash matching fails when file content is altered even very slightly. This paper analyzes several distributed systems for their ability to support file content identification. A framework is presented for automated file content identification that searches for file hashes and collects, aggregates and presents the search results. Experiments demonstrate that the framework can provide identifying information for 26% of the test files from their hashed content, helping reduce the workload of forensic investigators.


File content identification hash values P2P networks search engines 


  1. 1.
    F. Adelstein and R. Joyce, File Marshal: Automatic extraction of peer-to-peer data, Digital Investigation, vol. 4(S), pp. S43–S48, 2007.CrossRefGoogle Scholar
  2. 2.
    BitTorrent, BitTorrent and μTorrent software surpass 150 million user milestone; announce new consumer electronics partnerships, Press Release, San Francisco, California (, January 9, 2012.
  3. 3.
    Cisco Systems, Cisco Visual Networking Index: Forecast and Methodology, White Paper, San Jose, California (, 2012.
  4. 4.
    B. Cohen, Incentives build robustness in BitTorrent, Proceedings of the First International Workshop on the Economics of Peer-to-Peer Systems, 2003.Google Scholar
  5. 5.
    Dev-Host, The ultimate free file hosting/file sharing service, Los Angeles, California (
  6. 6., Server List for eDonkey and eMule (
  7. 7.
    Escape Media Group, Grooveshark, Gainesville, Florida (
  8. 8., Internet Movie Database, Seattle, Washington (
  9. 9.
    Kuiper Forensics, PeerLab – Scanning and evaluation of P2P applications, Mainz, Germany (
  10. 10.
    Y. Kulbak and D. Bickson, The eMule Protocol Specification, Technical Report, School of Computer Science and Engineering, Hebrew University of Jerusalem, Jerusalem, Israel, 2005.Google Scholar
  11. 11.
    P. Maymounkov and D. Mazieres, Kademlia: A peer-to-peer information system based on the XOR metric, Proceedings of the First International Workshop on Peer-to-Peer Systems, pp. 53–65, 2002.CrossRefGoogle Scholar
  12. 12.
    National Institute of Standards and Technology, National Software Reference Library, Gaithersburg, Maryland (
  13. 13.
    Net Applications, Desktop Search Engine Market Share (, October 2012.
  14. 14.
    SANS Internet Storm Center, Hash Database, SANS Institute, Bethesda, Maryland (
  15. 15.
    H. Schulze and K. Mochalski, Internet Study 2008/2009, ipoque, Leipzig, Germany (, 2009.
  16. 16.
    M. Steinebach, H. Liu and Y. Yannikos, Forbild: Efficient robust image hashing, Proceedings of the SPIE Conference on Media Watermarking, Security and Forensics, vol. 8303, 2012.Google Scholar
  17. 17.
    M. Steiner, T. En-Najjary and E. Biersack, A global view of kad, Proceedings of the Seventh ACM SIGCOMM Conference on Internet Measurement, pp. 117–122, 2007.CrossRefGoogle Scholar
  18. 18.
    Team Cymru, Malware Hash Registry (MHR), Lake Mary, Florida (
  19. 19.
    VirusTotal Team, VirusTotal, Malaga, Spain (
  20. 20.
    Yahoo! Flickr, Sunnyvale, California (

Copyright information

© IFIP International Federation for Information Processing 2013

Authors and Affiliations

  • York Yannikos
    • 1
  • Jonathan Schluessler
    • 2
  • Martin Steinebach
    • 1
  • Christian Winter
    • 1
  • Kalman Graffi
    • 3
  1. 1.Fraunhofer Institute for Secure Information TechnologyDarmstadtGermany
  2. 2.Vector InformatikStuttgartGermany
  3. 3.University of DusseldorfDusseldorfGermany

Personalised recommendations