Skip to main content

Advertisement

SpringerLink
Log in
Menu
Find a journal Publish with us
Search
Cart
Book cover

IFIP International Conference on Digital Forensics

DigitalForensics 2014: Advances in Digital Forensics X pp 133–147Cite as

  1. Home
  2. Advances in Digital Forensics X
  3. Conference paper
Similarity Hashing Based on Levenshtein Distances

Similarity Hashing Based on Levenshtein Distances

  • Frank Breitinger3,4,
  • Georg Ziroff3,
  • Steffen Lange3 &
  • …
  • Harald Baier3,4 
  • Conference paper
  • 1700 Accesses

  • 1 Citations

Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT,volume 433)

Abstract

It is increasingly common in forensic investigations to use automated pre-processing techniques to reduce the massive volumes of data that are encountered. This is typically accomplished by comparing fingerprints (typically cryptographic hashes) of files against existing databases. In addition to finding exact matches of cryptographic hashes, it is necessary to find approximate matches corresponding to similar files, such as different versions of a given file.

This paper presents a new stand-alone similarity hashing approach called saHash, which has a modular design and operates in linear time. saHash is almost as fast as SHA-1 and more efficient than other approaches for approximate matching. The similarity hashing algorithm uses four sub-hash functions, each producing its own hash value. The four sub-hashes are concatenated to produce the final hash value. This modularity enables sub-hash functions to be added or removed, e.g., if an exploit for a sub-hash function is discovered. Given the hash values of two byte sequences, saHash returns a lower bound on the number of Levenshtein operations between the two byte sequences as their similarity score. The robustness of saHash is verified by comparing it with other approximate matching approaches such as +sdhash+.

Keywords

  • Fuzzy hashing
  • similarity digest
  • Levenshtein distance

Download conference paper PDF

References

  1. F. Breitinger, Security Aspects of Fuzzy Hashing, M.Sc. Thesis, Department of Computer Science, Darmstadt University of Applied Sciences, Darmstadt, Germany, 2011.

    Google Scholar 

  2. F. Breitinger, K. Astebol, H. Baier and C. Busch, mvhash-b – A new approach for similarity preserving hashing, Proceedings of the Seventh International Conference on IT Security Incident Management and IT Forensics, pp. 33–44, 2013.

    Google Scholar 

  3. F. Breitinger and H. Baier, Performance issues about context-triggered piecewise hashing, Proceedings of the Third International ICST Conference on Digital Forensics and Cyber Crime, pp. 141–155, 2011.

    Google Scholar 

  4. F. Breitinger and H. Baier, Security aspects of piecewise hashing in computer forensics, Proceedings of the Sixth International Conference on IT Security Incident Management and IT Forensics, pp. 21–36, 2011.

    Google Scholar 

  5. F. Breitinger and H. Baier, A fuzzy hashing approach based on random sequences and Hamming distance, Proceedings of the Conference on Digital Forensics, Security and Law, 2012.

    Google Scholar 

  6. F. Breitinger and H. Baier, Similarity preserving hashing: Eligible properties and a new algorithm mrsh-v2, Proceedings of the Fourth International ICST Conference on Digital Forensics and Cyber Crime, 2012.

    Google Scholar 

  7. F. Breitinger, G. Stivaktakis and H. Baier, FRASH: A framework to test algorithms for similarity hashing, Digital Investigation, vol. 10(S), pp. S50–S58, 2013.

    CrossRef  Google Scholar 

  8. L. Chen and G. Wang, An efficient piecewise hashing method for computer forensics, Proceedings of the First International Workshop on Knowledge Discovery and Data Mining, pp. 635–638, 2008.

    CrossRef  Google Scholar 

  9. N. Harbour, dcfldd ( http://dcfldd.sourceforge.net ), 2006.

  10. P. Jaccard, Distribution de la flore alpine dans le bassin des drouces et dans quelques regions voisines, Bulletin de la Société Vaudoise des Sciences Naturelles, vol. 37(140), pp. 241–272, 1901.

    Google Scholar 

  11. J. Kornblum, Identifying almost identical files using context triggered piecewise hashing, Digital Investigation, vol. 3(S), pp. S91–S97, 2006.

    CrossRef  Google Scholar 

  12. V. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Soviet Physics Doklady, vol. 10(8), pp. 707–710, 1966.

    MathSciNet  Google Scholar 

  13. National Institute of Standards and Technology, National Software Reference Library, Gaithersburg, Maryland ( www.nsrl.nist.gov ).

  14. National Institute of Standards and Technology, Secure Hash Standard (SHS), FIPS Publication 180-4, Gaithersburg, Maryland, 2012.

    Google Scholar 

  15. M. Rabin, Fingerprinting by Random Polynomials, Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, Cambridge, Massachusetts, 1981.

    Google Scholar 

  16. R. Rivest, The MD5 Message-Digest Algorithm, RFC 1321, 1992.

    Google Scholar 

  17. V. Roussev, Building a better similarity trap with statistically improbable features, Proceedings of the Forty-Second Hawaii International Conference on System Sciences, 2009.

    Google Scholar 

  18. V. Roussev, Data fingerprinting with similarity digests, in Advances in Digital Forensics VI, K. Chow and S. Shenoi (Eds.), Springer, Heidelberg, Germany, pp. 207–226, 2010.

    CrossRef  Google Scholar 

  19. V. Roussev, An evaluation of forensic similarity hashes, Digital Investigation, vol. 8(S), pp. S34–S41, 2011.

    CrossRef  Google Scholar 

  20. V. Roussev, Managing terabyte-scale investigations with similarity digests, in Advances in Digital Forensics VIII, G. Peterson and S. Shenoi (Eds.), Springer, Heidelberg, Germany, pp. 19–34, 2012.

    CrossRef  Google Scholar 

  21. K. Seo, K. Lim, J. Choi, K. Chang and S. Lee, Detecting similar files based on hash and statistical analysis for digital forensic investigations, Proceedings of the Second International Conference on Computer Science and its Applications, 2009.

    Google Scholar 

  22. A. Tridgell, spamsum ( http://mirror.linux.org.au/linux.conf.au/2004/papers/junkcode/spamsum/README ), 2002.

  23. G. Ziroff, Approaches to Similarity-Preserving Hashing, B.Sc. Thesis, Department of Computer Science, Darmstadt University of Applied Sciences, Darmstadt, Germany, 2012.

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Darmstadt University of Applied Sciences, Darmstadt, Germany

    Frank Breitinger, Georg Ziroff, Steffen Lange & Harald Baier

  2. Center for Advanced Security Research Darmstadt, Darmstadt, Germany

    Frank Breitinger & Harald Baier

Authors
  1. Frank Breitinger
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Georg Ziroff
    View author publications

    You can also search for this author in PubMed Google Scholar

  3. Steffen Lange
    View author publications

    You can also search for this author in PubMed Google Scholar

  4. Harald Baier
    View author publications

    You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

  1. Air Force Institute of Technology, Wright-Patterson Air Force Base, 45433-7765, OH, USA

    Gilbert Peterson

  2. University of Tulsa, 74104-3189, Tulsa, OK, USA

    Sujeet Shenoi

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 IFIP International Federation for Information Processing

About this paper

Cite this paper

Breitinger, F., Ziroff, G., Lange, S., Baier, H. (2014). Similarity Hashing Based on Levenshtein Distances. In: Peterson, G., Shenoi, S. (eds) Advances in Digital Forensics X. DigitalForensics 2014. IFIP Advances in Information and Communication Technology, vol 433. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44952-3_10

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/978-3-662-44952-3_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-44951-6

  • Online ISBN: 978-3-662-44952-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Search

Navigation

  • Find a journal
  • Publish with us

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

65.108.231.39

Not affiliated

Springer Nature

© 2023 Springer Nature