Class-Aware Similarity Hashing for Data Classification
This paper introduces “class-aware similarity hashes” or “classprints,” which are an outgrowth of recent work on similarity hashing. The approach builds on the notion of context-based hashing to create a framework for identifying data types based on content and for building characteristic similarity hashes for individual data items that can be used for correlation. The principal benefits are that data classification can be fully automated and that a priori knowledge of the underlying data is not necessary beyond the availability of a suitable training set.
KeywordsSimilarity hashing class-aware similarity hashing classprints
- S. Brin, J. Davis and H. Garcia-Molina, Copy detection mechanisms for digital documents, Proceedings of the ACM SIGMOD International Conference on the Management of Data, pp. 398-409, 1995.Google Scholar
- A. Broder, S. Glassman, M. Manasse and G. Zweig, Syntactic clustering of the web, Proceedings of the Sixth International World Wide Web Conference, pp. 391-404, 1997.Google Scholar
- J. Kornblum, Identifying almost identical files using context triggered piecewise hashing, Proceedings of the Sixth Digital Forensic Research Workshop, 2006.Google Scholar
- National Institute of Standards and Technology, National Software Reference Library, Gaithersburg, Maryland (www.nsrl.nist.gov).
- M. Rabin, Fingerprinting by Random Polynomials, Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, Cambridge, Massachusetts, 1981.Google Scholar
- V. Roussev, Y. Chen, T. Bourg and G. Richard III, md5bloom: Forensic file system hashing revisited, Proceedings of the Sixth Digital Forensic Research Workshop, 2006.Google Scholar
- V. Roussev, G. Richard III and L. Marziale, Multi-resolution similarity hashing, Proceedings of the Seventh Digital Forensic Research Workshop, 2007.Google Scholar