Class-Aware Similarity Hashing for Data Classification

  • Vassil Roussev
  • Golden RichardIII
  • Lodovico Marziale
Part of the IFIP — The International Federation for Information Processing book series (IFIPAICT, volume 285)

Abstract

This paper introduces “class-aware similarity hashes” or “classprints,” which are an outgrowth of recent work on similarity hashing. The approach builds on the notion of context-based hashing to create a framework for identifying data types based on content and for building characteristic similarity hashes for individual data items that can be used for correlation. The principal benefits are that data classification can be fully automated and that a priori knowledge of the underlying data is not necessary beyond the availability of a suitable training set.

Keywords

Similarity hashing class-aware similarity hashing classprints 

References

  1. [1]
    S. Brin, J. Davis and H. Garcia-Molina, Copy detection mechanisms for digital documents, Proceedings of the ACM SIGMOD International Conference on the Management of Data, pp. 398-409, 1995.Google Scholar
  2. [2]
    B. Bloom, Space/time tradeoffs in hash coding with allowable errors, Communications of the ACM, vol. 13(7), pp. 422-426, 1970.CrossRefMATHGoogle Scholar
  3. [3]
    A. Broder, S. Glassman, M. Manasse and G. Zweig, Syntactic clustering of the web, Proceedings of the Sixth International World Wide Web Conference, pp. 391-404, 1997.Google Scholar
  4. [4]
    A. Broder and M. Mitzenmacher, Network applications of Bloom filters: A survey, Internet Mathematics, vol. 1(4), pp. 485-509, 2005.MathSciNetCrossRefMATHGoogle Scholar
  5. [5]
    J. Kornblum, Identifying almost identical files using context triggered piecewise hashing, Proceedings of the Sixth Digital Forensic Research Workshop, 2006.Google Scholar
  6. [6]
    National Institute of Standards and Technology, National Software Reference Library, Gaithersburg, Maryland (www.nsrl.nist.gov).
  7. [7]
    D. Patterson, Latency lags bandwidth, Communications of the ACM, vol. 47(10), pp. 71-75, 2004.CrossRefGoogle Scholar
  8. [8]
    M. Rabin, Fingerprinting by Random Polynomials, Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, Cambridge, Massachusetts, 1981.Google Scholar
  9. [9]
    V. Roussev, Y. Chen, T. Bourg and G. Richard III, md5bloom: Forensic file system hashing revisited, Proceedings of the Sixth Digital Forensic Research Workshop, 2006.Google Scholar
  10. [10]
    V. Roussev, G. Richard III and L. Marziale, Multi-resolution similarity hashing, Proceedings of the Seventh Digital Forensic Research Workshop, 2007.Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2008

Authors and Affiliations

  • Vassil Roussev
    • 1
  • Golden RichardIII
    • 1
    • 2
  • Lodovico Marziale
    • 1
  1. 1.The University of New OrleansNew OrleansUSA
  2. 2.Digital Forensics Solutions, LLCNew OrleansUSA

Personalised recommendations